Skip to content

[SDKv2] Add speech result types.#746

Open
skottmckay wants to merge 13 commits into
mainfrom
skottmckay/AudioOutputTypes
Open

[SDKv2] Add speech result types.#746
skottmckay wants to merge 13 commits into
mainfrom
skottmckay/AudioOutputTypes

Conversation

@skottmckay

Copy link
Copy Markdown
Collaborator

No description provided.

@vercel

vercel Bot commented May 30, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
foundry-local Ready Ready Preview, Comment Jun 19, 2026 4:26am

Request Review

Comment thread sdk_v2/cpp/src/inferencing/generative/audio/SPEECH_TYPES.md Outdated
Comment thread sdk_v2/cpp/src/inferencing/generative/audio/SPEECH_TYPES.md Outdated
@skottmckay skottmckay changed the title Initial proposed set of types for review/refinement [SDKv2] Add speech result types. Jun 2, 2026
Comment thread sdk_v2/cpp/include/foundry_local/foundry_local_c.h Outdated
Comment thread sdk_v2/cpp/include/foundry_local/foundry_local_cpp.h

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved tests into audio-session.test.ts

@skottmckay skottmckay marked this pull request as ready for review June 18, 2026 01:54
Copilot AI review requested due to automatic review settings June 18, 2026 01:54

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces two new output-only item types — SpeechSegmentItem and SpeechResultItem — across the entire multi-language SDK stack (C++ core, C ABI, C++ wrapper, C#, JavaScript, Python). Previously, AudioSession produced TextItem outputs; now it produces structured speech result types that carry per-segment detail, preparing the ground for future word-level timing, confidence, speaker diarization, and streaming hypothesis updates.

Changes:

  • New C ABI types (flSpeechSegmentData, flSpeechResultData, flSpeechWord, flSpeechSegmentKind) with Get accessors and an explicit rejection of Item_Create for these output-only types — consistently implemented in C++, C#, JS, and Python.
  • AudioSession's internal token generation now accumulates SpeechSegmentItem per decoded token and produces a SpeechResultItem aggregate as the final response, replacing the former TextItem output in both streaming and non-streaming paths.
  • All OpenAI/live-audio adapter layers (LiveAudioTranscriptionClient / LiveAudioTranscriptionSession) updated to consume the new types, plus comprehensive new integration and unit tests across all four SDKs (including the consolidation of the former audio-session-streaming.test.ts into the main test file).

Reviewed changes

Copilot reviewed 39 out of 39 changed files in this pull request and generated no comments.

Show a summary per file
File Description
sdk_v2/cpp/include/foundry_local/foundry_local_c.h Adds flSpeechSegmentKind, flSpeechWord, flSpeechSegmentData, flSpeechResultData structs and GetSpeechSegment/GetSpeechResult vtable entries
sdk_v2/cpp/include/foundry_local/foundry_local_cpp.h Adds SpeechWord, SpeechSegmentContent, SpeechResultContent wrapper structs and new Item::GetSpeech* methods
sdk_v2/cpp/include/foundry_local/foundry_local_cpp.inline.h Implements the wrapper accessors with sentinel→optional conversion
sdk_v2/cpp/src/items/speech_segment_item.h New internal SpeechSegmentItem struct with Finalize/GetApiData
sdk_v2/cpp/src/items/speech_segment_item.cc Implements Finalize for the segment's cached C ABI representation
sdk_v2/cpp/src/items/speech_result_item.h New internal SpeechResultItem (header-only, inline Finalize)
sdk_v2/cpp/src/inferencing/generative/audio/audio_session.h Updated signatures for ProcessChunk/DecodeTokens to carry segment vectors
sdk_v2/cpp/src/inferencing/generative/audio/audio_session.cc Core producer logic: emits SpeechSegmentItem per token, builds SpeechResultItem aggregate
sdk_v2/cpp/src/c_api.cc GetSpeechSegment/GetSpeechResult implementations + Create rejection
sdk_v2/cpp/CMakeLists.txt Adds speech_segment_item.cc to the build
sdk_v2/cpp/docs/SpeechOutputTypes.md Comprehensive design document for the new types
sdk_v2/cpp/examples/realtime_audio/main.cc Updated example to use new types
sdk_v2/cpp/test/sdk_api/streaming_audio_test.cc Updated to assert SPEECH_SEGMENT type on streamed items
sdk_v2/cpp/test/sdk_api/audio_transcriptions_test.cc New test validating SpeechResultItem structure
sdk_v2/cpp/test/sdk_api/model_fixture.h CollectResponseText now handles SPEECH_RESULT
sdk_v2/cpp/test/internal_api/item_test.cc Unit tests for segment/result item construction and wrapper translation
sdk_v2/cpp/test/internal_api/c_api_test.cc Tests that Create is rejected for output-only types
sdk_v2/cpp/test/internal_api/audio/audio_session_test.cc Updated internal audio test to expect SPEECH_RESULT
sdk_v2/cs/src/Enums.cs Adds SpeechSegmentKind enum and new ItemType values
sdk_v2/cs/src/Items/SpeechSegmentItem.cs New C# SpeechSegmentItem with eager data materialization
sdk_v2/cs/src/Items/SpeechResultItem.cs New C# SpeechResultItem with borrowed segment reading
sdk_v2/cs/src/Items/Item.cs Dispatch to new types in FromNative
sdk_v2/cs/src/Detail/NativeMethods.cs P/Invoke structs, delegates, and vtable entries
sdk_v2/cs/src/AudioSession.cs Updated doc comment
sdk_v2/cs/src/OpenAI/LiveAudioTranscriptionClient.cs Consumes SpeechSegment/SpeechResult instead of TextItem
sdk_v2/cs/test/FoundryLocal.Tests/AudioSessionTests.cs Updated existing tests + new streaming PCM test
sdk_v2/js/native/src/items.cc SpeechSegmentToJs/SpeechResultToJs + ItemTypeToString
sdk_v2/js/src/items.ts TypeScript interfaces for the new types
sdk_v2/js/src/openai/liveAudioSession.ts Consumes speechSegment/speechResult instead of text
sdk_v2/js/test/audio-session.test.ts Consolidated streaming tests + new speech type assertions
sdk_v2/js/test/audio-session-streaming.test.ts Deleted (content moved to main test file)
sdk_v2/python/src/foundry_local_sdk/__init__.py Exports new types
sdk_v2/python/src/foundry_local_sdk/items.py Python SpeechSegmentItem/SpeechResultItem/SpeechWord + dispatch
sdk_v2/python/src/foundry_local_sdk/_native/build_cffi.py cffi struct/enum definitions for new ABI types
sdk_v2/python/src/foundry_local_sdk/openai/live_audio_session.py Consumes SpeechSegment/SpeechResult instead of TextItem
sdk_v2/python/test/unit/test_items.py Negative construction tests for new types
sdk_v2/python/test/unit/test_imports.py Import smoke test for new exports
sdk_v2/python/test/integration/test_audio_session.py New streaming PCM integration test
sdk_v2/python/test/conftest.py New streaming_audio_model fixture

baijumeswani
baijumeswani previously approved these changes Jun 18, 2026

// Initial capacity for the per-token accumulators. Picked empirically: a few seconds of speech
// (~10s on Whisper, ~5s on Nemotron streaming) produces under 256 tokens, so most short-form
// transcriptions avoid any reallocation. Longer transcriptions still grow geometrically.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reallocation policy?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on the std::vector implementation. Some do 2x. Some use the golden ratio. Some use 1.5x.

token_texts.reserve(kInitialTokenCapacity);
std::vector<std::unique_ptr<SpeechSegmentItem>> segments;
segments.reserve(kInitialTokenCapacity);
// Streaming ASR has no text prompt (input is audio), so prompt_tokens stays 0.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to use Whisper for streaming ASR in the future (e.g. with a VAD), we could potentially have a text prompt as well (see the previous text tokens part here for reference).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything we should add now to support this? It's internal code so we can add later if that's easier.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASRs in general can have conditioning prompt (Whipser and Cohere have it), but most of that should probably be abstracted from the user IMO. I cannot think of a scenario where would a user need to care about it, so its internal.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only used for returning usage counts, which may not be super meaningful in local inferencing.

///
/// As an entry of a SpeechResultItem, `kind` is FINAL (or NONE for a single
/// non-segmented transcript).
struct SpeechSegmentItem : Item {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every decoded token creates a heap-allocated SpeechSegmentItem with its own Finalize() + cached C ABI struct. For long transcriptions (thousands of tokens), this could be significant. A pool allocator or a flat struct array could reduce the per-segment allocation overhead, though this is an optimization concern rather than a correctness issue.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer to see perf data before adding complexity here.

Move C# test to file with other realtime audio tests to reduce duplication of test infra.
Cleanup some ownership issues in C#.
Cleanup CA2000 handling.
Make logging more deterministic when running at debug level. Will help when diagnosing failures from CI logs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants