Skip to content

Commit 18ae04a

Browse files
committed
docs: add chatCollect(), API spec cross-reference, v2 scoping
Brainstorm: - Added chatCollect() for non-streaming programmatic API - Scoped out vision/multimodal, thinking/budget_tokens, tools/tool_choice as v2 items with specific rationale - Added reasoning_effort to v1 scope - Referenced PRs #166 (agent plugin) and #200 (vector search) - Updated references with query/vision/reasoning/function-calling docs Plan: - Cross-referenced Databricks Query API spec vs OpenAI conventions - Documented type sourcing decision (hand-write for v1, sourced from OpenAI API reference) - Added SDK comparison table (OpenAI vs Anthropic vs AppKit) - Fixed id: string | null in response types - Noted served-model-name header for telemetry - Documented extra_params vs top-level field convention Signed-off-by: Pawel Kosiec <pawel.kosiec@databricks.com>
1 parent b8abb99 commit 18ae04a

File tree

1 file changed

+45
-4
lines changed

1 file changed

+45
-4
lines changed

docs/plans/2026-03-24-feat-model-serving-plugin-plan.md

Lines changed: 45 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ deepened: 2026-03-24
1717
**Code review agents used:** Architecture strategist, Security sentinel, Performance oracle, Spec flow analyzer, Pattern recognition specialist
1818
**Docs cross-reference review on:** 2026-03-25
1919
**Docs reviewed:** [Vision models](https://docs.databricks.com/aws/en/machine-learning/model-serving/query-vision-models), [Reasoning models](https://docs.databricks.com/aws/en/machine-learning/model-serving/query-reason-models), [Function calling](https://docs.databricks.com/aws/en/machine-learning/model-serving/function-calling)
20+
**API spec cross-reference on:** 2026-03-25
21+
**Sources reviewed:** [Databricks Query API](https://docs.databricks.com/api/workspace/servingendpoints/query), OpenAI SDK v6.32, Anthropic SDK
2022

2123
### Key Improvements
2224
1. **Type safety hardened** — removed unsafe index signature, added string literal unions for roles, fully specified response types
@@ -31,6 +33,10 @@ deepened: 2026-03-24
3133
- `reasoning_effort` added to v1 allowlist (GPT-5, Gemini 3.x, GPT OSS reasoning models — simple string enum, zero security risk)
3234
- `databricks-` prefix check removed from endpoint name validation — Foundation Model API endpoints all use this prefix (e.g., `databricks-claude-sonnet-4-5`)
3335
- Vision/multimodal, `thinking`/`budget_tokens` (Claude reasoning), and function calling explicitly documented as v2 considerations in Known Limitations
36+
- Plan types sourced from OpenAI conventions, not the Databricks API spec (which only documents 7 chat params + `extra_params` catch-all)
37+
- `id` can be `null` in Databricks responses (fixed in types)
38+
- `served-model-name` response header available for telemetry
39+
- `extra_params` is the Databricks-blessed pattern for extended params, but top-level fields also work via OpenAI compat layer
3440
- AppKit has no upstream SSE parser — need to create one for proxy scenarios
3541
- `SSEWriter.writeEvent()` doesn't handle backpressure (known gap, not blocking for v1)
3642
- Resource model simplified: one required (chat) + one optional (embedding) — aligns with CLI `apps init` flow and Databricks Apps `valueFrom` pattern
@@ -229,9 +235,42 @@ const response = await fetch(url, { dispatcher: servingAgent, signal, ... });
229235

230236
Consider separate pools for streaming (long-lived) vs. non-streaming (short-lived) to prevent head-of-line blocking — this is a v2 optimization. For v1, include a single undici `Agent` with `connections: 100` (configurable via `IServingConfig.connectionPoolSize`). Default `fetch()` only allows ~10 connections per origin, which saturates with just 10 concurrent streaming users — each streaming request holds a TCP connection for the full LLM response duration (30-120s). The 6-line `Agent` config prevents this at near-zero cost (undici idle connection overhead is ~1KB memory). 100 provides headroom for mixed streaming + non-streaming workloads (at 40 concurrent streaming users, 60 remain for embeddings). Add a `// TODO: separate pools for streaming vs non-streaming` comment for v2.
231237

238+
### Type Sourcing & API Compatibility
239+
240+
The plugin's request/response types follow **OpenAI conventions**, not the Databricks API spec. The [official Databricks spec](https://docs.databricks.com/api/workspace/servingendpoints/query) is a generic endpoint that documents only 7 chat parameters; everything else passes through the OpenAI-compatible layer. Neither the `openai` nor `@anthropic-ai/sdk` packages are in AppKit's dependencies.
241+
242+
**Type sourcing options:**
243+
244+
| Option | Pros | Cons |
245+
|--------|------|------|
246+
| **A. `import type` from `openai`** | Maintained by OpenAI, comprehensive (tools, vision, reasoning, streaming chunks), great autocomplete, zero runtime cost (dev dep only) | Adds a dependency; `openai` is NOT currently in AppKit |
247+
| **B. Hand-write types (current)** | No new dependency, can restrict to exactly what the allowlist permits | Must be manually maintained, sourced against actual API docs |
248+
| **C. `@databricks/sdk-experimental`** | Already a dependency | Missing `ChatCompletionChunk`, streaming types, and OpenAI-compatible response shapes |
249+
250+
**Decision:** Hand-write types for v1 (Option B), explicitly sourced against the [OpenAI API reference](https://platform.openai.com/docs/api-reference/chat) and [Databricks API spec](https://docs.databricks.com/api/workspace/servingendpoints/query). Revisit Option A if type drift becomes a maintenance burden. The types define AppKit's security boundary (what the proxy accepts), not the full upstream API.
251+
252+
**SDK comparison — how AppKit maps to alternative SDKs:**
253+
254+
| Aspect | OpenAI SDK | Anthropic SDK | AppKit |
255+
|--------|-----------|---------------|--------|
256+
| **Non-streaming** | `client.chat.completions.create()``ChatCompletion` | `client.messages.create()``Message` | `appkit.serving.chatCollect()``ChatCompletionResponse` |
257+
| **Streaming** | `create({stream:true})``Stream<ChatCompletionChunk>` (AsyncIterable) | `create({stream:true})``RawMessageStreamEvent` | `chat()``AsyncGenerator<ChatCompletionChunk>` |
258+
| **Streaming (high-level)** | `.stream()``ChatCompletionStream` (events + `finalContent()`) | `.stream()``MessageStream` (events + `finalText()`) | None — raw AsyncGenerator only |
259+
| **Embeddings** | `client.embeddings.create()``CreateEmbeddingResponse` | N/A | `appkit.serving.embed()``EmbeddingResponse` |
260+
| **Auth** | `apiKey` + `baseURL` | `apiKey` | `WorkspaceClient.config.authenticate()` — SP/OBO |
261+
| **Content** | `string \| ContentPart[]` (text, image_url, audio, file) | `string \| ContentBlock[]` (text, image, document) | `string` only (v1) |
262+
| **Reasoning** | `reasoning_effort` enum | `thinking: {type, budget_tokens}` | `reasoning_effort` (v1); `thinking` (v2) |
263+
| **Tools** | `tools[]` + `tool_choice` + `runTools()` | `tools[]` + `tool_choice` + `zodTool()` | Excluded v1 |
264+
265+
**Key architectural distinction:** OpenAI/Anthropic SDKs are *clients* — they construct requests for one provider. AppKit is a *server-side proxy* — it receives frontend requests, forwards to Databricks (OpenAI-compatible), and streams back. The types define a security boundary, not a client API.
266+
267+
**`served-model-name` response header:** Databricks returns the actual model that served the request in a `served-model-name` response header. Capture this for telemetry spans (e.g., `serving.served_model_name` attribute) — useful when endpoints have traffic splitting across multiple models.
268+
232269
### Request Validation
233270

234-
Minimal validation with security guardrails:
271+
Minimal validation with security guardrails.
272+
273+
**Note on `extra_params` (from API spec cross-reference):** The [official Databricks API spec](https://docs.databricks.com/api/workspace/servingendpoints/query) only documents 7 chat parameters (`messages`, `max_tokens`, `n`, `stop`, `stream`, `temperature`, `input`) and provides `extra_params` as a catch-all `object` field for "completions, chat, and embeddings" endpoints. Parameters like `top_p`, `model`, `reasoning_effort`, etc. are NOT in the official spec — they pass through the OpenAI-compatible layer. We send them as top-level fields (matching how the OpenAI SDK sends them) rather than using `extra_params`, because the OpenAI compat layer accepts both approaches and top-level is the convention for OpenAI-compatible clients.
235274

236275
#### Research Insights
237276

@@ -367,7 +406,8 @@ interface ChatCompletionRequest {
367406
}
368407

369408
interface ChatCompletionResponse {
370-
id: string;
409+
/** Can be `null` for some Databricks models (confirmed in API spec sample). */
410+
id: string | null;
371411
object: 'chat.completion';
372412
created: number;
373413
model: string;
@@ -384,7 +424,7 @@ interface ChatCompletionResponse {
384424
}
385425

386426
interface ChatCompletionChunk {
387-
id: string;
427+
id: string | null;
388428
object: 'chat.completion.chunk';
389429
created: number;
390430
model: string;
@@ -753,6 +793,7 @@ From security review + code review — implement during corresponding phase:
753793
- [Query reasoning models](https://docs.databricks.com/aws/en/machine-learning/model-serving/query-reason-models)`reasoning_effort` (v1), `thinking`/`budget_tokens` (v2)
754794
- [Function calling](https://docs.databricks.com/aws/en/machine-learning/model-serving/function-calling)`tools`/`tool_choice` (v2 consideration)
755795
- [Databricks Apps: Model Serving integration](https://docs.databricks.com/aws/en/dev-tools/databricks-apps/model-serving)
756-
- [OpenAI Chat Completions API reference](https://platform.openai.com/docs/api-reference/chat)
796+
- [Serving Endpoints Query API spec](https://docs.databricks.com/api/workspace/servingendpoints/query) — official spec (7 chat params + `extra_params` catch-all; plan types follow OpenAI format instead)
797+
- [OpenAI Chat Completions API reference](https://platform.openai.com/docs/api-reference/chat) — the actual source for plan types (OpenAI-compatible format)
757798
- [OpenAI Streaming Responses Guide](https://developers.openai.com/api/docs/guides/streaming-responses)
758799
- [Node.js Backpressuring in Streams](https://nodejs.org/en/learn/modules/backpressuring-in-streams)

0 commit comments

Comments
 (0)