Skip to content

fix(llm): don't continue generation while dangling tool_calls are present#916

Open
wustwyh wants to merge 2 commits into
modelscope:mainfrom
wustwyh:fix/continue-gen-tool-calls
Open

fix(llm): don't continue generation while dangling tool_calls are present#916
wustwyh wants to merge 2 commits into
modelscope:mainfrom
wustwyh:fix/continue-gen-tool-calls

Conversation

@wustwyh

@wustwyh wustwyh commented Jun 24, 2026

Copy link
Copy Markdown

Summary

Fixes #909.

When the LLM returns a truncated (finish_reason: length / finish_reason: null) assistant message that also contains tool_calls, the continue-generation path in openai_llm.py was appending that partial assistant message to the conversation history and immediately calling the LLM again. This produces an invalid OpenAI conversation state: an assistant message with tool_calls that is not followed by the required tool responses. Providers such as DeepSeek reject this with:

An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'.

What changed

  • In _continue_generate, return the message as-is when new_message.tool_calls is non-empty instead of entering the continue-generation loop.
    • If this happens during a recursive continuation run (i.e. messages[-1] is partial), merge the accumulated partial message, clear the partial flag, and return the merged message.
  • In _stream_continue_generate, skip the continue-generation branch when the accumulated message has tool_calls.
    • If this happens during a recursive continuation run (first_run is False), merge the accumulated partial message and clear the partial flag.

This lets the normal agent step() loop execute the pending tool calls, append their responses, and only then resume generation with a valid conversation state.

Text-only truncated responses continue to work exactly as before.

Tests

Added tests/llm/test_openai_continue_generation.py with six unit tests:

  1. _continue_generate returns early when the first truncated message has tool calls.
  2. _continue_generate still continues plain text truncation.
  3. _continue_generate merges partial state when tool calls appear on a subsequent continuation run.
  4. _stream_continue_generate returns early when the first accumulated stream has tool calls.
  5. _stream_continue_generate still continues plain text truncation streams.
  6. _stream_continue_generate merges partial state when tool calls appear on a subsequent continuation run.

All six tests pass.

…sent

When a truncated assistant response contains tool_calls, the continue-
generation path was appending the partial message to the conversation
and immediately calling the LLM again. Providers that validate the
OpenAI conversation format reject an assistant message with tool_calls
that is not followed by matching tool responses.

Now both _continue_generate and _stream_continue_generate return early
when the message has tool_calls, letting the normal step() loop execute
the tools and append valid tool responses before the next LLM call.

Fixes modelscope#909.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request prevents the OpenAI LLM from continuing generation inline when the model emits tool calls, ensuring that the caller's tool execution loop runs first to avoid sending dangling tool calls. The review feedback points out critical edge cases where stopping the continuation early during recursive runs (when previous chunks are marked as partial) will discard accumulated text and leave messages in a dangling partial state. The reviewer suggests merging these partial messages and clearing the partial flag before returning, and provides additional unit tests to cover these scenarios.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +610 to +611
if new_message.tool_calls:
return new_message

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If _continue_generate is in a recursive continuation run (i.e., messages[-1] is marked as partial), returning new_message directly when tool_calls are present will discard all previously accumulated text from the prior runs and leave messages[-1] in a dangling partial state. We must merge new_message into messages[-1], clear the partial flag, and return the merged message, just like we do when finishing naturally.

        if new_message.tool_calls:
            if messages[-1].to_dict().get('partial', False):
                self._merge_partial_message(messages, new_message)
                messages[-1].partial = False
                return messages.pop(-1)
            else:
                return new_message

Comment on lines +426 to +440
if not message.tool_calls:
logger.info(
f'finish_reason: {chunk.choices[0].finish_reason}, continue generate.'
)
completion = self._call_llm_for_continue_gen(
messages, message, tools, **kwargs)
for chunk in self._stream_continue_generate(
messages, completion, tools,
max_runs - 1 if max_runs is not None else None,
**kwargs):
if first_run:
yield self._merge_stream_message(
messages[-1], chunk)
else:
yield chunk

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When streaming and stopping continuation early due to tool_calls being present, if we are in a recursive continuation run (first_run is False), we must merge the accumulated message into messages[-1] and clear the partial flag. Otherwise, the final chunk of generated text/tool calls is never merged into the conversation history, and messages[-1] is left in a dangling partial state.

Suggested change
if not message.tool_calls:
logger.info(
f'finish_reason: {chunk.choices[0].finish_reason}, continue generate.'
)
completion = self._call_llm_for_continue_gen(
messages, message, tools, **kwargs)
for chunk in self._stream_continue_generate(
messages, completion, tools,
max_runs - 1 if max_runs is not None else None,
**kwargs):
if first_run:
yield self._merge_stream_message(
messages[-1], chunk)
else:
yield chunk
if not message.tool_calls:
logger.info(
f'finish_reason: {chunk.choices[0].finish_reason}, continue generate.'
)
completion = self._call_llm_for_continue_gen(
messages, message, tools, **kwargs)
for chunk in self._stream_continue_generate(
messages, completion, tools,
max_runs - 1 if max_runs is not None else None,
**kwargs):
if first_run:
yield self._merge_stream_message(
messages[-1], chunk)
else:
yield chunk
elif not first_run:
self._merge_partial_message(messages, message)
messages[-1].partial = False
message = messages[-1]

yielded = list(llm._stream_continue_generate(messages, iter(initial_chunks)))

mock_continue.assert_called_once()
self.assertEqual(yielded[-1].content, 'first part continued')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add unit tests to cover the scenario where a subsequent continuation run (both streaming and non-streaming) returns tool calls, ensuring that the partial messages are correctly merged and the partial flag is cleared.

        self.assertEqual(yielded[-1].content, 'first part continued')

    def test_continue_generate_merges_and_returns_when_tool_calls_on_subsequent_run(self):
        """If a subsequent continuation run returns tool calls, it must merge and return the accumulated message."""
        llm = self._make_llm()
        messages = [
            Message(role='system', content='You are a helpful assistant.'),
            Message(role='user', content='Write a long report.'),
        ]
        initial_completion = _make_completion(
            content='first part',
            finish_reason='length',
        )
        continued_completion = _make_completion(
            content=' continued with tool',
            finish_reason='stop',
            tool_calls=[{
                'id': 'call_abc',
                'tool_name': 'write_file',
                'arguments': '{"path": "/tmp/report.md"}',
            }],
        )

        def fake_continue(messages, new_message, tools, **kwargs):
            messages.append(new_message)
            messages[-1].partial = True
            return continued_completion

        with patch.object(llm, '_call_llm_for_continue_gen', side_effect=fake_continue):
            result = llm._continue_generate(messages, initial_completion)

        self.assertEqual(result.content, 'first part continued with tool')
        self.assertEqual(len(result.tool_calls), 1)
        self.assertEqual(result.tool_calls[0]['id'], 'call_abc')
        self.assertEqual(len(messages), 2)
        self.assertFalse(result.partial)

    def test_stream_continue_generate_merges_when_tool_calls_on_subsequent_run(self):
        """If a subsequent streaming continuation run returns tool calls, it must merge and clear partial flag."""
        llm = self._make_llm()
        messages = [
            Message(role='system', content='You are a helpful assistant.'),
            Message(role='user', content='Write a long report.'),
        ]
        initial_chunks = [
            _make_stream_chunk(content='first part'),
            _make_stream_chunk(finish_reason='length'),
        ]
        continued_chunks = [
            _make_stream_chunk(content=' continued'),
            _make_stream_chunk(
                content='',
                tool_call={
                    'id': 'call_abc',
                    'tool_name': 'write_file',
                    'arguments': '{"path": "/tmp/report.md"}',
                },
            ),
            _make_stream_chunk(finish_reason='length'),
        ]

        def fake_continue(messages, message, tools, **kwargs):
            messages.append(message)
            messages[-1].partial = True
            return iter(continued_chunks)

        with patch.object(llm, '_call_llm_for_continue_gen', side_effect=fake_continue):
            yielded = list(llm._stream_continue_generate(messages, iter(initial_chunks)))

        self.assertEqual(yielded[-1].content, 'first part continued')
        self.assertEqual(len(yielded[-1].tool_calls), 1)
        self.assertEqual(yielded[-1].tool_calls[0]['id'], 'call_abc')
        self.assertFalse(messages[-1].partial)

…uation

Address code review feedback:
- In _continue_generate, when tool_calls appear during a recursive
  continuation run, merge the accumulated partial message and clear
  the partial flag instead of returning new_message directly.
- In _stream_continue_generate, do the same when first_run is False.
- Add unit tests covering tool_calls appearing on subsequent
  continuation runs for both streaming and non-streaming paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[framework] _continue_generate corrupts conversation when truncated response contains tool_calls

1 participant