[SDKv2] Add speech result types.#746
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Update C#, python and JS SDKs to use new audio output types.
There was a problem hiding this comment.
Moved tests into audio-session.test.ts
There was a problem hiding this comment.
Pull request overview
This PR introduces two new output-only item types — SpeechSegmentItem and SpeechResultItem — across the entire multi-language SDK stack (C++ core, C ABI, C++ wrapper, C#, JavaScript, Python). Previously, AudioSession produced TextItem outputs; now it produces structured speech result types that carry per-segment detail, preparing the ground for future word-level timing, confidence, speaker diarization, and streaming hypothesis updates.
Changes:
- New C ABI types (
flSpeechSegmentData,flSpeechResultData,flSpeechWord,flSpeechSegmentKind) withGetaccessors and an explicit rejection ofItem_Createfor these output-only types — consistently implemented in C++, C#, JS, and Python. AudioSession's internal token generation now accumulatesSpeechSegmentItemper decoded token and produces aSpeechResultItemaggregate as the final response, replacing the formerTextItemoutput in both streaming and non-streaming paths.- All OpenAI/live-audio adapter layers (
LiveAudioTranscriptionClient/LiveAudioTranscriptionSession) updated to consume the new types, plus comprehensive new integration and unit tests across all four SDKs (including the consolidation of the formeraudio-session-streaming.test.tsinto the main test file).
Reviewed changes
Copilot reviewed 39 out of 39 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
sdk_v2/cpp/include/foundry_local/foundry_local_c.h |
Adds flSpeechSegmentKind, flSpeechWord, flSpeechSegmentData, flSpeechResultData structs and GetSpeechSegment/GetSpeechResult vtable entries |
sdk_v2/cpp/include/foundry_local/foundry_local_cpp.h |
Adds SpeechWord, SpeechSegmentContent, SpeechResultContent wrapper structs and new Item::GetSpeech* methods |
sdk_v2/cpp/include/foundry_local/foundry_local_cpp.inline.h |
Implements the wrapper accessors with sentinel→optional conversion |
sdk_v2/cpp/src/items/speech_segment_item.h |
New internal SpeechSegmentItem struct with Finalize/GetApiData |
sdk_v2/cpp/src/items/speech_segment_item.cc |
Implements Finalize for the segment's cached C ABI representation |
sdk_v2/cpp/src/items/speech_result_item.h |
New internal SpeechResultItem (header-only, inline Finalize) |
sdk_v2/cpp/src/inferencing/generative/audio/audio_session.h |
Updated signatures for ProcessChunk/DecodeTokens to carry segment vectors |
sdk_v2/cpp/src/inferencing/generative/audio/audio_session.cc |
Core producer logic: emits SpeechSegmentItem per token, builds SpeechResultItem aggregate |
sdk_v2/cpp/src/c_api.cc |
GetSpeechSegment/GetSpeechResult implementations + Create rejection |
sdk_v2/cpp/CMakeLists.txt |
Adds speech_segment_item.cc to the build |
sdk_v2/cpp/docs/SpeechOutputTypes.md |
Comprehensive design document for the new types |
sdk_v2/cpp/examples/realtime_audio/main.cc |
Updated example to use new types |
sdk_v2/cpp/test/sdk_api/streaming_audio_test.cc |
Updated to assert SPEECH_SEGMENT type on streamed items |
sdk_v2/cpp/test/sdk_api/audio_transcriptions_test.cc |
New test validating SpeechResultItem structure |
sdk_v2/cpp/test/sdk_api/model_fixture.h |
CollectResponseText now handles SPEECH_RESULT |
sdk_v2/cpp/test/internal_api/item_test.cc |
Unit tests for segment/result item construction and wrapper translation |
sdk_v2/cpp/test/internal_api/c_api_test.cc |
Tests that Create is rejected for output-only types |
sdk_v2/cpp/test/internal_api/audio/audio_session_test.cc |
Updated internal audio test to expect SPEECH_RESULT |
sdk_v2/cs/src/Enums.cs |
Adds SpeechSegmentKind enum and new ItemType values |
sdk_v2/cs/src/Items/SpeechSegmentItem.cs |
New C# SpeechSegmentItem with eager data materialization |
sdk_v2/cs/src/Items/SpeechResultItem.cs |
New C# SpeechResultItem with borrowed segment reading |
sdk_v2/cs/src/Items/Item.cs |
Dispatch to new types in FromNative |
sdk_v2/cs/src/Detail/NativeMethods.cs |
P/Invoke structs, delegates, and vtable entries |
sdk_v2/cs/src/AudioSession.cs |
Updated doc comment |
sdk_v2/cs/src/OpenAI/LiveAudioTranscriptionClient.cs |
Consumes SpeechSegment/SpeechResult instead of TextItem |
sdk_v2/cs/test/FoundryLocal.Tests/AudioSessionTests.cs |
Updated existing tests + new streaming PCM test |
sdk_v2/js/native/src/items.cc |
SpeechSegmentToJs/SpeechResultToJs + ItemTypeToString |
sdk_v2/js/src/items.ts |
TypeScript interfaces for the new types |
sdk_v2/js/src/openai/liveAudioSession.ts |
Consumes speechSegment/speechResult instead of text |
sdk_v2/js/test/audio-session.test.ts |
Consolidated streaming tests + new speech type assertions |
sdk_v2/js/test/audio-session-streaming.test.ts |
Deleted (content moved to main test file) |
sdk_v2/python/src/foundry_local_sdk/__init__.py |
Exports new types |
sdk_v2/python/src/foundry_local_sdk/items.py |
Python SpeechSegmentItem/SpeechResultItem/SpeechWord + dispatch |
sdk_v2/python/src/foundry_local_sdk/_native/build_cffi.py |
cffi struct/enum definitions for new ABI types |
sdk_v2/python/src/foundry_local_sdk/openai/live_audio_session.py |
Consumes SpeechSegment/SpeechResult instead of TextItem |
sdk_v2/python/test/unit/test_items.py |
Negative construction tests for new types |
sdk_v2/python/test/unit/test_imports.py |
Import smoke test for new exports |
sdk_v2/python/test/integration/test_audio_session.py |
New streaming PCM integration test |
sdk_v2/python/test/conftest.py |
New streaming_audio_model fixture |
|
|
||
| // Initial capacity for the per-token accumulators. Picked empirically: a few seconds of speech | ||
| // (~10s on Whisper, ~5s on Nemotron streaming) produces under 256 tokens, so most short-form | ||
| // transcriptions avoid any reallocation. Longer transcriptions still grow geometrically. |
There was a problem hiding this comment.
What is the reallocation policy?
There was a problem hiding this comment.
Depends on the std::vector implementation. Some do 2x. Some use the golden ratio. Some use 1.5x.
| token_texts.reserve(kInitialTokenCapacity); | ||
| std::vector<std::unique_ptr<SpeechSegmentItem>> segments; | ||
| segments.reserve(kInitialTokenCapacity); | ||
| // Streaming ASR has no text prompt (input is audio), so prompt_tokens stays 0. |
There was a problem hiding this comment.
If we were to use Whisper for streaming ASR in the future (e.g. with a VAD), we could potentially have a text prompt as well (see the previous text tokens part here for reference).
There was a problem hiding this comment.
Is there anything we should add now to support this? It's internal code so we can add later if that's easier.
There was a problem hiding this comment.
ASRs in general can have conditioning prompt (Whipser and Cohere have it), but most of that should probably be abstracted from the user IMO. I cannot think of a scenario where would a user need to care about it, so its internal.
There was a problem hiding this comment.
this is only used for returning usage counts, which may not be super meaningful in local inferencing.
| /// | ||
| /// As an entry of a SpeechResultItem, `kind` is FINAL (or NONE for a single | ||
| /// non-segmented transcript). | ||
| struct SpeechSegmentItem : Item { |
There was a problem hiding this comment.
Every decoded token creates a heap-allocated SpeechSegmentItem with its own Finalize() + cached C ABI struct. For long transcriptions (thousands of tokens), this could be significant. A pool allocator or a flat struct array could reduce the per-segment allocation overhead, though this is an optimization concern rather than a correctness issue.
There was a problem hiding this comment.
Would prefer to see perf data before adding complexity here.
Move C# test to file with other realtime audio tests to reduce duplication of test infra. Cleanup some ownership issues in C#. Cleanup CA2000 handling. Make logging more deterministic when running at debug level. Will help when diagnosing failures from CI logs.
…uppression in multiple places
No description provided.