fix(llm): don't continue generation while dangling tool_calls are present#916
fix(llm): don't continue generation while dangling tool_calls are present#916wustwyh wants to merge 2 commits into
Conversation
…sent When a truncated assistant response contains tool_calls, the continue- generation path was appending the partial message to the conversation and immediately calling the LLM again. Providers that validate the OpenAI conversation format reject an assistant message with tool_calls that is not followed by matching tool responses. Now both _continue_generate and _stream_continue_generate return early when the message has tool_calls, letting the normal step() loop execute the tools and append valid tool responses before the next LLM call. Fixes modelscope#909.
There was a problem hiding this comment.
Code Review
This pull request prevents the OpenAI LLM from continuing generation inline when the model emits tool calls, ensuring that the caller's tool execution loop runs first to avoid sending dangling tool calls. The review feedback points out critical edge cases where stopping the continuation early during recursive runs (when previous chunks are marked as partial) will discard accumulated text and leave messages in a dangling partial state. The reviewer suggests merging these partial messages and clearing the partial flag before returning, and provides additional unit tests to cover these scenarios.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if new_message.tool_calls: | ||
| return new_message |
There was a problem hiding this comment.
If _continue_generate is in a recursive continuation run (i.e., messages[-1] is marked as partial), returning new_message directly when tool_calls are present will discard all previously accumulated text from the prior runs and leave messages[-1] in a dangling partial state. We must merge new_message into messages[-1], clear the partial flag, and return the merged message, just like we do when finishing naturally.
if new_message.tool_calls:
if messages[-1].to_dict().get('partial', False):
self._merge_partial_message(messages, new_message)
messages[-1].partial = False
return messages.pop(-1)
else:
return new_message| if not message.tool_calls: | ||
| logger.info( | ||
| f'finish_reason: {chunk.choices[0].finish_reason}, continue generate.' | ||
| ) | ||
| completion = self._call_llm_for_continue_gen( | ||
| messages, message, tools, **kwargs) | ||
| for chunk in self._stream_continue_generate( | ||
| messages, completion, tools, | ||
| max_runs - 1 if max_runs is not None else None, | ||
| **kwargs): | ||
| if first_run: | ||
| yield self._merge_stream_message( | ||
| messages[-1], chunk) | ||
| else: | ||
| yield chunk |
There was a problem hiding this comment.
When streaming and stopping continuation early due to tool_calls being present, if we are in a recursive continuation run (first_run is False), we must merge the accumulated message into messages[-1] and clear the partial flag. Otherwise, the final chunk of generated text/tool calls is never merged into the conversation history, and messages[-1] is left in a dangling partial state.
| if not message.tool_calls: | |
| logger.info( | |
| f'finish_reason: {chunk.choices[0].finish_reason}, continue generate.' | |
| ) | |
| completion = self._call_llm_for_continue_gen( | |
| messages, message, tools, **kwargs) | |
| for chunk in self._stream_continue_generate( | |
| messages, completion, tools, | |
| max_runs - 1 if max_runs is not None else None, | |
| **kwargs): | |
| if first_run: | |
| yield self._merge_stream_message( | |
| messages[-1], chunk) | |
| else: | |
| yield chunk | |
| if not message.tool_calls: | |
| logger.info( | |
| f'finish_reason: {chunk.choices[0].finish_reason}, continue generate.' | |
| ) | |
| completion = self._call_llm_for_continue_gen( | |
| messages, message, tools, **kwargs) | |
| for chunk in self._stream_continue_generate( | |
| messages, completion, tools, | |
| max_runs - 1 if max_runs is not None else None, | |
| **kwargs): | |
| if first_run: | |
| yield self._merge_stream_message( | |
| messages[-1], chunk) | |
| else: | |
| yield chunk | |
| elif not first_run: | |
| self._merge_partial_message(messages, message) | |
| messages[-1].partial = False | |
| message = messages[-1] |
| yielded = list(llm._stream_continue_generate(messages, iter(initial_chunks))) | ||
|
|
||
| mock_continue.assert_called_once() | ||
| self.assertEqual(yielded[-1].content, 'first part continued') |
There was a problem hiding this comment.
Add unit tests to cover the scenario where a subsequent continuation run (both streaming and non-streaming) returns tool calls, ensuring that the partial messages are correctly merged and the partial flag is cleared.
self.assertEqual(yielded[-1].content, 'first part continued')
def test_continue_generate_merges_and_returns_when_tool_calls_on_subsequent_run(self):
"""If a subsequent continuation run returns tool calls, it must merge and return the accumulated message."""
llm = self._make_llm()
messages = [
Message(role='system', content='You are a helpful assistant.'),
Message(role='user', content='Write a long report.'),
]
initial_completion = _make_completion(
content='first part',
finish_reason='length',
)
continued_completion = _make_completion(
content=' continued with tool',
finish_reason='stop',
tool_calls=[{
'id': 'call_abc',
'tool_name': 'write_file',
'arguments': '{"path": "/tmp/report.md"}',
}],
)
def fake_continue(messages, new_message, tools, **kwargs):
messages.append(new_message)
messages[-1].partial = True
return continued_completion
with patch.object(llm, '_call_llm_for_continue_gen', side_effect=fake_continue):
result = llm._continue_generate(messages, initial_completion)
self.assertEqual(result.content, 'first part continued with tool')
self.assertEqual(len(result.tool_calls), 1)
self.assertEqual(result.tool_calls[0]['id'], 'call_abc')
self.assertEqual(len(messages), 2)
self.assertFalse(result.partial)
def test_stream_continue_generate_merges_when_tool_calls_on_subsequent_run(self):
"""If a subsequent streaming continuation run returns tool calls, it must merge and clear partial flag."""
llm = self._make_llm()
messages = [
Message(role='system', content='You are a helpful assistant.'),
Message(role='user', content='Write a long report.'),
]
initial_chunks = [
_make_stream_chunk(content='first part'),
_make_stream_chunk(finish_reason='length'),
]
continued_chunks = [
_make_stream_chunk(content=' continued'),
_make_stream_chunk(
content='',
tool_call={
'id': 'call_abc',
'tool_name': 'write_file',
'arguments': '{"path": "/tmp/report.md"}',
},
),
_make_stream_chunk(finish_reason='length'),
]
def fake_continue(messages, message, tools, **kwargs):
messages.append(message)
messages[-1].partial = True
return iter(continued_chunks)
with patch.object(llm, '_call_llm_for_continue_gen', side_effect=fake_continue):
yielded = list(llm._stream_continue_generate(messages, iter(initial_chunks)))
self.assertEqual(yielded[-1].content, 'first part continued')
self.assertEqual(len(yielded[-1].tool_calls), 1)
self.assertEqual(yielded[-1].tool_calls[0]['id'], 'call_abc')
self.assertFalse(messages[-1].partial)…uation Address code review feedback: - In _continue_generate, when tool_calls appear during a recursive continuation run, merge the accumulated partial message and clear the partial flag instead of returning new_message directly. - In _stream_continue_generate, do the same when first_run is False. - Add unit tests covering tool_calls appearing on subsequent continuation runs for both streaming and non-streaming paths.
Summary
Fixes #909.
When the LLM returns a truncated (
finish_reason: length/finish_reason: null) assistant message that also containstool_calls, the continue-generation path inopenai_llm.pywas appending that partial assistant message to the conversation history and immediately calling the LLM again. This produces an invalid OpenAI conversation state: an assistant message withtool_callsthat is not followed by the required tool responses. Providers such as DeepSeek reject this with:What changed
_continue_generate, return the message as-is whennew_message.tool_callsis non-empty instead of entering the continue-generation loop.messages[-1]is partial), merge the accumulated partial message, clear thepartialflag, and return the merged message._stream_continue_generate, skip the continue-generation branch when the accumulatedmessagehastool_calls.first_run is False), merge the accumulated partial message and clear thepartialflag.This lets the normal agent
step()loop execute the pending tool calls, append their responses, and only then resume generation with a valid conversation state.Text-only truncated responses continue to work exactly as before.
Tests
Added
tests/llm/test_openai_continue_generation.pywith six unit tests:_continue_generatereturns early when the first truncated message has tool calls._continue_generatestill continues plain text truncation._continue_generatemerges partial state when tool calls appear on a subsequent continuation run._stream_continue_generatereturns early when the first accumulated stream has tool calls._stream_continue_generatestill continues plain text truncation streams._stream_continue_generatemerges partial state when tool calls appear on a subsequent continuation run.All six tests pass.