[SDKv2] Add speech result types. by skottmckay · Pull Request #746 · microsoft/Foundry-Local

skottmckay · 2026-05-30T00:01:09Z

No description provided.

vercel · 2026-05-30T00:01:15Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
foundry-local	Ready	Preview, Comment	Jun 19, 2026 4:26am

…tTypes

Update C#, python and JS SDKs to use new audio output types.

…tTypes

skottmckay · 2026-06-18T01:50:55Z

Moved tests into audio-session.test.ts

Copilot

Pull request overview

This PR introduces two new output-only item types — SpeechSegmentItem and SpeechResultItem — across the entire multi-language SDK stack (C++ core, C ABI, C++ wrapper, C#, JavaScript, Python). Previously, AudioSession produced TextItem outputs; now it produces structured speech result types that carry per-segment detail, preparing the ground for future word-level timing, confidence, speaker diarization, and streaming hypothesis updates.

Changes:

New C ABI types (flSpeechSegmentData, flSpeechResultData, flSpeechWord, flSpeechSegmentKind) with Get accessors and an explicit rejection of Item_Create for these output-only types — consistently implemented in C++, C#, JS, and Python.
AudioSession's internal token generation now accumulates SpeechSegmentItem per decoded token and produces a SpeechResultItem aggregate as the final response, replacing the former TextItem output in both streaming and non-streaming paths.
All OpenAI/live-audio adapter layers (LiveAudioTranscriptionClient / LiveAudioTranscriptionSession) updated to consume the new types, plus comprehensive new integration and unit tests across all four SDKs (including the consolidation of the former audio-session-streaming.test.ts into the main test file).

Reviewed changes

Copilot reviewed 39 out of 39 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`sdk_v2/cpp/include/foundry_local/foundry_local_c.h`	Adds `flSpeechSegmentKind`, `flSpeechWord`, `flSpeechSegmentData`, `flSpeechResultData` structs and `GetSpeechSegment`/`GetSpeechResult` vtable entries
`sdk_v2/cpp/include/foundry_local/foundry_local_cpp.h`	Adds `SpeechWord`, `SpeechSegmentContent`, `SpeechResultContent` wrapper structs and new `Item::GetSpeech*` methods
`sdk_v2/cpp/include/foundry_local/foundry_local_cpp.inline.h`	Implements the wrapper accessors with sentinel→optional conversion
`sdk_v2/cpp/src/items/speech_segment_item.h`	New internal `SpeechSegmentItem` struct with Finalize/GetApiData
`sdk_v2/cpp/src/items/speech_segment_item.cc`	Implements Finalize for the segment's cached C ABI representation
`sdk_v2/cpp/src/items/speech_result_item.h`	New internal `SpeechResultItem` (header-only, inline Finalize)
`sdk_v2/cpp/src/inferencing/generative/audio/audio_session.h`	Updated signatures for ProcessChunk/DecodeTokens to carry segment vectors
`sdk_v2/cpp/src/inferencing/generative/audio/audio_session.cc`	Core producer logic: emits SpeechSegmentItem per token, builds SpeechResultItem aggregate
`sdk_v2/cpp/src/c_api.cc`	GetSpeechSegment/GetSpeechResult implementations + Create rejection
`sdk_v2/cpp/CMakeLists.txt`	Adds `speech_segment_item.cc` to the build
`sdk_v2/cpp/docs/SpeechOutputTypes.md`	Comprehensive design document for the new types
`sdk_v2/cpp/examples/realtime_audio/main.cc`	Updated example to use new types
`sdk_v2/cpp/test/sdk_api/streaming_audio_test.cc`	Updated to assert SPEECH_SEGMENT type on streamed items
`sdk_v2/cpp/test/sdk_api/audio_transcriptions_test.cc`	New test validating SpeechResultItem structure
`sdk_v2/cpp/test/sdk_api/model_fixture.h`	`CollectResponseText` now handles SPEECH_RESULT
`sdk_v2/cpp/test/internal_api/item_test.cc`	Unit tests for segment/result item construction and wrapper translation
`sdk_v2/cpp/test/internal_api/c_api_test.cc`	Tests that Create is rejected for output-only types
`sdk_v2/cpp/test/internal_api/audio/audio_session_test.cc`	Updated internal audio test to expect SPEECH_RESULT
`sdk_v2/cs/src/Enums.cs`	Adds `SpeechSegmentKind` enum and new `ItemType` values
`sdk_v2/cs/src/Items/SpeechSegmentItem.cs`	New C# SpeechSegmentItem with eager data materialization
`sdk_v2/cs/src/Items/SpeechResultItem.cs`	New C# SpeechResultItem with borrowed segment reading
`sdk_v2/cs/src/Items/Item.cs`	Dispatch to new types in `FromNative`
`sdk_v2/cs/src/Detail/NativeMethods.cs`	P/Invoke structs, delegates, and vtable entries
`sdk_v2/cs/src/AudioSession.cs`	Updated doc comment
`sdk_v2/cs/src/OpenAI/LiveAudioTranscriptionClient.cs`	Consumes SpeechSegment/SpeechResult instead of TextItem
`sdk_v2/cs/test/FoundryLocal.Tests/AudioSessionTests.cs`	Updated existing tests + new streaming PCM test
`sdk_v2/js/native/src/items.cc`	`SpeechSegmentToJs`/`SpeechResultToJs` + ItemTypeToString
`sdk_v2/js/src/items.ts`	TypeScript interfaces for the new types
`sdk_v2/js/src/openai/liveAudioSession.ts`	Consumes speechSegment/speechResult instead of text
`sdk_v2/js/test/audio-session.test.ts`	Consolidated streaming tests + new speech type assertions
`sdk_v2/js/test/audio-session-streaming.test.ts`	Deleted (content moved to main test file)
`sdk_v2/python/src/foundry_local_sdk/__init__.py`	Exports new types
`sdk_v2/python/src/foundry_local_sdk/items.py`	Python SpeechSegmentItem/SpeechResultItem/SpeechWord + dispatch
`sdk_v2/python/src/foundry_local_sdk/_native/build_cffi.py`	cffi struct/enum definitions for new ABI types
`sdk_v2/python/src/foundry_local_sdk/openai/live_audio_session.py`	Consumes SpeechSegment/SpeechResult instead of TextItem
`sdk_v2/python/test/unit/test_items.py`	Negative construction tests for new types
`sdk_v2/python/test/unit/test_imports.py`	Import smoke test for new exports
`sdk_v2/python/test/integration/test_audio_session.py`	New streaming PCM integration test
`sdk_v2/python/test/conftest.py`	New `streaming_audio_model` fixture

kunal-vaishnavi · 2026-06-18T21:37:42Z

+
+// Initial capacity for the per-token accumulators. Picked empirically: a few seconds of speech
+// (~10s on Whisper, ~5s on Nemotron streaming) produces under 256 tokens, so most short-form
+// transcriptions avoid any reallocation. Longer transcriptions still grow geometrically.


What is the reallocation policy?

Depends on the std::vector implementation. Some do 2x. Some use the golden ratio. Some use 1.5x.

kunal-vaishnavi · 2026-06-18T21:40:37Z

+  token_texts.reserve(kInitialTokenCapacity);
+  std::vector<std::unique_ptr<SpeechSegmentItem>> segments;
+  segments.reserve(kInitialTokenCapacity);
+  // Streaming ASR has no text prompt (input is audio), so prompt_tokens stays 0.


If we were to use Whisper for streaming ASR in the future (e.g. with a VAD), we could potentially have a text prompt as well (see the previous text tokens part here for reference).

Is there anything we should add now to support this? It's internal code so we can add later if that's easier.

ASRs in general can have conditioning prompt (Whipser and Cohere have it), but most of that should probably be abstracted from the user IMO. I cannot think of a scenario where would a user need to care about it, so its internal.

this is only used for returning usage counts, which may not be super meaningful in local inferencing.

kunal-vaishnavi · 2026-06-18T21:48:31Z

+///
+/// As an entry of a SpeechResultItem, `kind` is FINAL (or NONE for a single
+/// non-segmented transcript).
+struct SpeechSegmentItem : Item {


Every decoded token creates a heap-allocated SpeechSegmentItem with its own Finalize() + cached C ABI struct. For long transcriptions (thousands of tokens), this could be significant. A pool allocator or a flat struct array could reduce the per-segment allocation overhead, though this is an optimization concern rather than a correctness issue.

Would prefer to see perf data before adding complexity here.

Move C# test to file with other realtime audio tests to reduce duplication of test infra. Cleanup some ownership issues in C#. Cleanup CA2000 handling. Make logging more deterministic when running at debug level. Will help when diagnosing failures from CI logs.

…uppression in multiple places

Initial proposed set of types for review/refinement

8bddd9f

vercel Bot deployed to Preview May 30, 2026 00:01 View deployment

skottmckay commented Jun 1, 2026

View reviewed changes

Comment thread sdk_v2/cpp/src/inferencing/generative/audio/SPEECH_TYPES.md Outdated

skottmckay commented Jun 2, 2026

View reviewed changes

Comment thread sdk_v2/cpp/src/inferencing/generative/audio/SPEECH_TYPES.md Outdated

skottmckay added 2 commits June 2, 2026 17:57

Add speech result types and wire up for initial feedback

c0dd6c2

Merge remote-tracking branch 'origin/main' into skottmckay/AudioOutpu…

511f784

…tTypes

skottmckay changed the title ~~Initial proposed set of types for review/refinement~~ [SDKv2] Add speech result types. Jun 2, 2026

vercel Bot deployed to Preview June 2, 2026 07:58 View deployment

skottmckay added 3 commits June 17, 2026 14:35

Merge remote-tracking branch 'origin/main' into skottmckay/AudioOutpu…

7c254ad

…tTypes

Update to remove 'legacy' path returning TextItem.

773a10e

Update C#, python and JS SDKs to use new audio output types.

Fix some test gaps

808a3ee

vercel Bot deployed to Preview June 17, 2026 10:00 View deployment

Update docs

0cfb668

vercel Bot deployed to Preview June 18, 2026 00:08 View deployment

skottmckay added 2 commits June 18, 2026 11:28

Merge remote-tracking branch 'origin/main' into skottmckay/AudioOutpu…

46b6316

…tTypes

Update example to only expect segment output.

f856581

skottmckay commented Jun 18, 2026

View reviewed changes

skottmckay marked this pull request as ready for review June 18, 2026 01:54

Copilot AI review requested due to automatic review settings June 18, 2026 01:54

vercel Bot deployed to Preview June 18, 2026 01:54 View deployment

Copilot started reviewing on behalf of skottmckay June 18, 2026 01:55 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

Address some local Copilot review comments.

e01c803

vercel Bot deployed to Preview June 18, 2026 04:10 View deployment

skottmckay requested review from baijumeswani, kunal-vaishnavi and nenad1002 June 18, 2026 04:41

baijumeswani previously approved these changes Jun 18, 2026

View reviewed changes

kunal-vaishnavi reviewed Jun 18, 2026

View reviewed changes

Remove has_confidence and use sentinel instead for consistency.

d649797

skottmckay dismissed baijumeswani’s stale review via d649797 June 18, 2026 22:05

vercel Bot deployed to Preview June 18, 2026 22:06 View deployment

vercel Bot deployed to Preview June 19, 2026 02:08 View deployment

Fix dispose setup around PinContext to not require IDISP001 warning s…

1cfb8db

…uppression in multiple places

vercel Bot deployed to Preview June 19, 2026 04:26 View deployment

Conversation

skottmckay commented May 30, 2026

Uh oh!

vercel Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vercel Bot commented May 30, 2026 •

edited

Loading