fix(llm): don't continue generation while dangling tool_calls are present by wustwyh · Pull Request #916 · modelscope/ms-agent

wustwyh · 2026-06-24T13:47:12Z

Summary

Fixes #909.

When the LLM returns a truncated (finish_reason: length / finish_reason: null) assistant message that also contains tool_calls, the continue-generation path in openai_llm.py was appending that partial assistant message to the conversation history and immediately calling the LLM again. This produces an invalid OpenAI conversation state: an assistant message with tool_calls that is not followed by the required tool responses. Providers such as DeepSeek reject this with:

An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'.

What changed

In _continue_generate, return the message as-is when new_message.tool_calls is non-empty instead of entering the continue-generation loop.
- If this happens during a recursive continuation run (i.e. messages[-1] is partial), merge the accumulated partial message, clear the partial flag, and return the merged message.
In _stream_continue_generate, skip the continue-generation branch when the accumulated message has tool_calls.
- If this happens during a recursive continuation run (first_run is False), merge the accumulated partial message and clear the partial flag.

This lets the normal agent step() loop execute the pending tool calls, append their responses, and only then resume generation with a valid conversation state.

Text-only truncated responses continue to work exactly as before.

Tests

Added tests/llm/test_openai_continue_generation.py with six unit tests:

_continue_generate returns early when the first truncated message has tool calls.
_continue_generate still continues plain text truncation.
_continue_generate merges partial state when tool calls appear on a subsequent continuation run.
_stream_continue_generate returns early when the first accumulated stream has tool calls.
_stream_continue_generate still continues plain text truncation streams.
_stream_continue_generate merges partial state when tool calls appear on a subsequent continuation run.

All six tests pass.

…sent When a truncated assistant response contains tool_calls, the continue- generation path was appending the partial message to the conversation and immediately calling the LLM again. Providers that validate the OpenAI conversation format reject an assistant message with tool_calls that is not followed by matching tool responses. Now both _continue_generate and _stream_continue_generate return early when the message has tool_calls, letting the normal step() loop execute the tools and append valid tool responses before the next LLM call. Fixes modelscope#909.

gemini-code-assist

Code Review

This pull request prevents the OpenAI LLM from continuing generation inline when the model emits tool calls, ensuring that the caller's tool execution loop runs first to avoid sending dangling tool calls. The review feedback points out critical edge cases where stopping the continuation early during recursive runs (when previous chunks are marked as partial) will discard accumulated text and leave messages in a dangling partial state. The reviewer suggests merging these partial messages and clearing the partial flag before returning, and provides additional unit tests to cover these scenarios.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-24T13:49:06Z

+        if new_message.tool_calls:
+            return new_message


If _continue_generate is in a recursive continuation run (i.e., messages[-1] is marked as partial), returning new_message directly when tool_calls are present will discard all previously accumulated text from the prior runs and leave messages[-1] in a dangling partial state. We must merge new_message into messages[-1], clear the partial flag, and return the merged message, just like we do when finishing naturally.

if new_message.tool_calls: if messages[-1].to_dict().get('partial', False): self._merge_partial_message(messages, new_message) messages[-1].partial = False return messages.pop(-1) else: return new_message

gemini-code-assist · 2026-06-24T13:49:06Z

+                    if not message.tool_calls:
+                        logger.info(
+                            f'finish_reason: {chunk.choices[0].finish_reason}, continue generate.'
+                        )
+                        completion = self._call_llm_for_continue_gen(
+                            messages, message, tools, **kwargs)
+                        for chunk in self._stream_continue_generate(
+                                messages, completion, tools,
+                                max_runs - 1 if max_runs is not None else None,
+                                **kwargs):
+                            if first_run:
+                                yield self._merge_stream_message(
+                                    messages[-1], chunk)
+                            else:
+                                yield chunk


When streaming and stopping continuation early due to tool_calls being present, if we are in a recursive continuation run (first_run is False), we must merge the accumulated message into messages[-1] and clear the partial flag. Otherwise, the final chunk of generated text/tool calls is never merged into the conversation history, and messages[-1] is left in a dangling partial state.

Suggested change

if not message.tool_calls:

logger.info(

f'finish_reason: {chunk.choices[0].finish_reason}, continue generate.'

)

completion = self._call_llm_for_continue_gen(

messages, message, tools, **kwargs)

for chunk in self._stream_continue_generate(

messages, completion, tools,

max_runs - 1 if max_runs is not None else None,

**kwargs):

if first_run:

yield self._merge_stream_message(

messages[-1], chunk)

else:

yield chunk

if not message.tool_calls:

logger.info(

f'finish_reason: {chunk.choices[0].finish_reason}, continue generate.'

)

completion = self._call_llm_for_continue_gen(

messages, message, tools, **kwargs)

for chunk in self._stream_continue_generate(

messages, completion, tools,

max_runs - 1 if max_runs is not None else None,

**kwargs):

if first_run:

yield self._merge_stream_message(

messages[-1], chunk)

else:

yield chunk

elif not first_run:

self._merge_partial_message(messages, message)

messages[-1].partial = False

message = messages[-1]

gemini-code-assist · 2026-06-24T13:49:06Z

+            yielded = list(llm._stream_continue_generate(messages, iter(initial_chunks)))
+
+        mock_continue.assert_called_once()
+        self.assertEqual(yielded[-1].content, 'first part continued')


Add unit tests to cover the scenario where a subsequent continuation run (both streaming and non-streaming) returns tool calls, ensuring that the partial messages are correctly merged and the partial flag is cleared.

self.assertEqual(yielded[-1].content, 'first part continued') def test_continue_generate_merges_and_returns_when_tool_calls_on_subsequent_run(self): """If a subsequent continuation run returns tool calls, it must merge and return the accumulated message.""" llm = self._make_llm() messages = [ Message(role='system', content='You are a helpful assistant.'), Message(role='user', content='Write a long report.'), ] initial_completion = _make_completion( content='first part', finish_reason='length', ) continued_completion = _make_completion( content=' continued with tool', finish_reason='stop', tool_calls=[{ 'id': 'call_abc', 'tool_name': 'write_file', 'arguments': '{"path": "/tmp/report.md"}', }], ) def fake_continue(messages, new_message, tools, **kwargs): messages.append(new_message) messages[-1].partial = True return continued_completion with patch.object(llm, '_call_llm_for_continue_gen', side_effect=fake_continue): result = llm._continue_generate(messages, initial_completion) self.assertEqual(result.content, 'first part continued with tool') self.assertEqual(len(result.tool_calls), 1) self.assertEqual(result.tool_calls[0]['id'], 'call_abc') self.assertEqual(len(messages), 2) self.assertFalse(result.partial) def test_stream_continue_generate_merges_when_tool_calls_on_subsequent_run(self): """If a subsequent streaming continuation run returns tool calls, it must merge and clear partial flag.""" llm = self._make_llm() messages = [ Message(role='system', content='You are a helpful assistant.'), Message(role='user', content='Write a long report.'), ] initial_chunks = [ _make_stream_chunk(content='first part'), _make_stream_chunk(finish_reason='length'), ] continued_chunks = [ _make_stream_chunk(content=' continued'), _make_stream_chunk( content='', tool_call={ 'id': 'call_abc', 'tool_name': 'write_file', 'arguments': '{"path": "/tmp/report.md"}', }, ), _make_stream_chunk(finish_reason='length'), ] def fake_continue(messages, message, tools, **kwargs): messages.append(message) messages[-1].partial = True return iter(continued_chunks) with patch.object(llm, '_call_llm_for_continue_gen', side_effect=fake_continue): yielded = list(llm._stream_continue_generate(messages, iter(initial_chunks))) self.assertEqual(yielded[-1].content, 'first part continued') self.assertEqual(len(yielded[-1].tool_calls), 1) self.assertEqual(yielded[-1].tool_calls[0]['id'], 'call_abc') self.assertFalse(messages[-1].partial)

…uation Address code review feedback: - In _continue_generate, when tool_calls appear during a recursive continuation run, merge the accumulated partial message and clear the partial flag instead of returning new_message directly. - In _stream_continue_generate, do the same when first_run is False. - Add unit tests covering tool_calls appearing on subsequent continuation runs for both streaming and non-streaming paths.

gemini-code-assist Bot reviewed Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(llm): don't continue generation while dangling tool_calls are present#916

fix(llm): don't continue generation while dangling tool_calls are present#916
wustwyh wants to merge 2 commits into
modelscope:mainfrom
wustwyh:fix/continue-gen-tool-calls

wustwyh commented Jun 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

wustwyh commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Tests

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wustwyh commented Jun 24, 2026 •

edited

Loading