feat: upgrade volcengine TTS provider from V1 to V3 HTTP Chunked API#8566
feat: upgrade volcengine TTS provider from V1 to V3 HTTP Chunked API#8566DDomelette wants to merge 7 commits into
Conversation
- Replace V1 appid/token/cluster auth with V3 X-Api-Key header - Support NDJSON streaming response (long texts) and single JSON (short texts) - Add new audio params: speech_rate, loudness_rate, pitch, emotion - Add audio format selection: mp3/ogg_opus/pcm, sample_rate, bit_rate - Add resource_id for model selection (seed-tts-2.0/seed-icl-2.0 etc.) - Update config metadata fields and i18n translations (zh-CN/en-US/ru-RU)
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- The TTS V3 response parsing currently buffers the entire chunked body into memory before splitting/decoding; for very long audio this can be wasteful or problematic—consider streaming line-by-line (NDJSON) and decoding chunks incrementally instead of accumulating
raw_body. - The fallback single-JSON handling in
get_audioassumes the entire response is valid JSON and does not guardjson.loads(raw_text)with error handling; a malformed or HTML error body from the upstream service would currently raise a non-actionable JSONDecodeError rather than a clearer API error. - This PR downgrades
VERSIONfrom 4.25.2 to 4.25.0 and removes several dashboard-related settings (auth rate limit, TOTP, trust_proxy_headers) that are unrelated to the Volcengine TTS upgrade; please double-check whether these config changes are intentional or should be removed from this PR.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The TTS V3 response parsing currently buffers the entire chunked body into memory before splitting/decoding; for very long audio this can be wasteful or problematic—consider streaming line-by-line (NDJSON) and decoding chunks incrementally instead of accumulating `raw_body`.
- The fallback single-JSON handling in `get_audio` assumes the entire response is valid JSON and does not guard `json.loads(raw_text)` with error handling; a malformed or HTML error body from the upstream service would currently raise a non-actionable JSONDecodeError rather than a clearer API error.
- This PR downgrades `VERSION` from 4.25.2 to 4.25.0 and removes several dashboard-related settings (auth rate limit, TOTP, trust_proxy_headers) that are unrelated to the Volcengine TTS upgrade; please double-check whether these config changes are intentional or should be removed from this PR.
## Individual Comments
### Comment 1
<location path="astrbot/core/config/default.py" line_range="8" />
<code_context>
from astrbot.core.utils.astrbot_path import get_astrbot_data_path
-VERSION = "4.25.2"
+VERSION = "4.25.0"
DB_PATH = os.path.join(get_astrbot_data_path(), "data_v4.db")
PERSONAL_WECHAT_CONFIG_METADATA = {
</code_context>
<issue_to_address>
**issue (bug_risk):** The version constant was decreased, which is likely an unintended regression.
The previous value was `4.25.2` and it is now `4.25.0`. This will make the runtime appear older than prior builds and may break upgrade/version checks. If this change is not an intentional revert to an older release line, the version should be bumped forward instead of decreased.
</issue_to_address>
### Comment 2
<location path="astrbot/core/config/default.py" line_range="2224-2227" />
<code_context>
+ "description": "情感",
+ "hint": "如 tender/happy/sad/storytelling。仅部分音色支持",
+ },
+ "model": {
+ "type": "string",
+ "description": "模型子类型",
+ "hint": "仅声音复刻2.0生效: seed-tts-2.0-standard(标准) / seed-tts-2.0-expressive(表现力)",
},
"azure_tts_voice": {
</code_context>
<issue_to_address>
**suggestion:** The model hint text mentions 声音复刻 2.0 but the example model names are `seed-tts-*`, which is confusing.
The hint says "仅声音复刻2.0生效" but the examples are `seed-tts-2.0-*`, while earlier comments describe this field as for 声音复刻 2.0 / `seed-icl-2.0`. To avoid user misconfiguration, please clarify whether this field is for TTS 2.0 or ICL 2.0 and update the wording and example model IDs to match the actual Volcengine model prefix it expects.
Suggested implementation:
```python
"model": {
"type": "string",
"description": "模型子类型",
"hint": "仅声音复刻2.0(Seed ICL 2.0)生效,模型前缀为 seed-icl-2.0-,如 seed-icl-2.0-standard(标准) / seed-icl-2.0-expressive(表现力)",
},
```
1. Please confirm in the Volcengine docs which exact model family this field controls. If it is actually绑定到 TTS 2.0 而不是 ICL 2.0,请将前缀及说明统一改为 `seed-tts-2.0-*` 并同步更新其他相关注释或文档中对该字段的描述。
2. If elsewhere in the code or docs this field is still described as accepting `seed-tts-*` or `seed-icl-*` inconsistently, align those hints/descriptions with the final, confirmed model prefix to avoid user confusion.
</issue_to_address>
### Comment 3
<location path="astrbot/core/provider/sources/volcengine_tts.py" line_range="171" />
<code_context>
+
+ return payload
+
async def get_audio(self, text: str) -> str:
- """异步方法获取语音文件路径"""
+ """
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting response parsing and file-writing logic from `get_audio` into small helper methods so that `get_audio` acts as a simple orchestrator instead of a large, multi-responsibility method.
You can keep all V3 behavior but reduce incidental complexity in `get_audio` by pushing parsing and file I/O into small helpers.
### 1. Extract response parsing (NDJSON + single JSON)
Right now `get_audio` is responsible for:
- reading the stream into `raw_body`
- deciding NDJSON vs single JSON
- handling `code` / `error`
- extracting `data` vs `audio.data`
- building `audio_chunks`
- tracking `last_event`
You can move this into a dedicated helper, leaving `get_audio` as orchestration:
```python
def _iter_tts_frames(self, raw_text: str, logid: str):
lines = [l for l in raw_text.strip().splitlines() if l.strip()]
if len(lines) <= 1:
# single JSON fallback
try:
obj = json.loads(raw_text)
except json.JSONDecodeError as e:
raise Exception(f"火山引擎 TTS 返回非 JSON 响应 (logid={logid}): {e}")
yield obj
return
logger.debug(f"[VolcengineTTS V3] NDJSON mode: {len(lines)} lines")
for line in lines:
try:
data = json.loads(line)
except json.JSONDecodeError:
continue
if "error" in data:
raise Exception(
f"火山引擎 TTS API 错误 (logid={logid}): "
f"{json.dumps(data['error'], ensure_ascii=False)}"
)
code = data.get("code", 0)
if code not in (0, 20000000):
raise Exception(
f"火山引擎 TTS API 错误 (logid={logid}): "
f"code={code}, message={data.get('message', 'unknown')}"
)
yield data
```
Then a focused `_parse_tts_body`:
```python
def _parse_tts_body(self, raw_body: bytes, logid: str) -> bytes:
raw_text = raw_body.decode("utf-8", errors="replace")
audio_chunks: list[bytes] = []
last_event = ""
for frame in self._iter_tts_frames(raw_text, logid):
event = frame.get("event")
if event:
last_event = event
b64_str = (
frame.get("data")
or (frame.get("audio") or {}).get("data")
)
if not isinstance(b64_str, str):
continue
b64_str = re.sub(r"\s+", "", b64_str)
try:
audio_chunks.append(base64.b64decode(b64_str))
except Exception:
# 保持现有“尽量继续拼接可用片段”的宽容行为
continue
if not audio_chunks:
raise Exception(
f"火山引擎 TTS 未返回音频数据 (logid={logid}, last_event={last_event})。"
f"可能原因: 1) speaker 与 resource_id 不匹配 "
f"2) API Key 对应的服务未开通 "
f"3) 文本内容触发了安全过滤"
)
return b"".join(audio_chunks)
```
`get_audio` then only needs:
```python
raw_body = b""
async for chunk in response.content.iter_any():
if chunk:
raw_body += chunk
if not raw_body:
raise Exception(
f"火山引擎 TTS 返回空响应 (logid={logid}),请检查 API Key 和 resource_id 是否正确"
)
full_audio = self._parse_tts_body(raw_body, logid)
```
This keeps all current semantics (NDJSON + single JSON, error codes, `last_event`) but moves them into a testable helper and simplifies `get_audio`.
### 2. Extract safe file writing
The inline `run_in_executor` with a lambda and bare `open` is both noisy and a bit fragile. A tiny helper keeps behavior and improves safety:
```python
async def _write_audio_file(self, audio: bytes) -> str:
temp_dir = get_astrbot_temp_path()
os.makedirs(temp_dir, exist_ok=True)
file_path = os.path.join(
temp_dir,
f"volcengine_tts_{uuid.uuid4().hex[:12]}.{self.format}",
)
loop = asyncio.get_running_loop()
def _write():
with open(file_path, "wb") as f:
f.write(audio)
await loop.run_in_executor(None, _write)
return file_path
```
Then in `get_audio`:
```python
full_audio = self._parse_tts_body(raw_body, logid)
file_path = await self._write_audio_file(full_audio)
logger.info(
f"[VolcengineTTS V3] 合成完成: {file_path} "
f"({len(full_audio)} bytes, {len(full_audio) and len(full_audio)} chunks, logid={logid})"
)
return file_path
```
(You can still log the chunk count by having `_parse_tts_body` optionally return `(audio_bytes, chunk_count)` if you need that detail.)
These two extra helpers keep the public behavior, keep all V3 features, and make `get_audio` a clear orchestrator instead of a god method, without large structural changes.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| from astrbot.core.utils.astrbot_path import get_astrbot_data_path | ||
|
|
||
| VERSION = "4.25.2" | ||
| VERSION = "4.25.0" |
There was a problem hiding this comment.
issue (bug_risk): The version constant was decreased, which is likely an unintended regression.
The previous value was 4.25.2 and it is now 4.25.0. This will make the runtime appear older than prior builds and may break upgrade/version checks. If this change is not an intentional revert to an older release line, the version should be bumped forward instead of decreased.
|
|
||
| return payload | ||
|
|
||
| async def get_audio(self, text: str) -> str: |
There was a problem hiding this comment.
issue (complexity): Consider extracting response parsing and file-writing logic from get_audio into small helper methods so that get_audio acts as a simple orchestrator instead of a large, multi-responsibility method.
You can keep all V3 behavior but reduce incidental complexity in get_audio by pushing parsing and file I/O into small helpers.
1. Extract response parsing (NDJSON + single JSON)
Right now get_audio is responsible for:
- reading the stream into
raw_body - deciding NDJSON vs single JSON
- handling
code/error - extracting
datavsaudio.data - building
audio_chunks - tracking
last_event
You can move this into a dedicated helper, leaving get_audio as orchestration:
def _iter_tts_frames(self, raw_text: str, logid: str):
lines = [l for l in raw_text.strip().splitlines() if l.strip()]
if len(lines) <= 1:
# single JSON fallback
try:
obj = json.loads(raw_text)
except json.JSONDecodeError as e:
raise Exception(f"火山引擎 TTS 返回非 JSON 响应 (logid={logid}): {e}")
yield obj
return
logger.debug(f"[VolcengineTTS V3] NDJSON mode: {len(lines)} lines")
for line in lines:
try:
data = json.loads(line)
except json.JSONDecodeError:
continue
if "error" in data:
raise Exception(
f"火山引擎 TTS API 错误 (logid={logid}): "
f"{json.dumps(data['error'], ensure_ascii=False)}"
)
code = data.get("code", 0)
if code not in (0, 20000000):
raise Exception(
f"火山引擎 TTS API 错误 (logid={logid}): "
f"code={code}, message={data.get('message', 'unknown')}"
)
yield dataThen a focused _parse_tts_body:
def _parse_tts_body(self, raw_body: bytes, logid: str) -> bytes:
raw_text = raw_body.decode("utf-8", errors="replace")
audio_chunks: list[bytes] = []
last_event = ""
for frame in self._iter_tts_frames(raw_text, logid):
event = frame.get("event")
if event:
last_event = event
b64_str = (
frame.get("data")
or (frame.get("audio") or {}).get("data")
)
if not isinstance(b64_str, str):
continue
b64_str = re.sub(r"\s+", "", b64_str)
try:
audio_chunks.append(base64.b64decode(b64_str))
except Exception:
# 保持现有“尽量继续拼接可用片段”的宽容行为
continue
if not audio_chunks:
raise Exception(
f"火山引擎 TTS 未返回音频数据 (logid={logid}, last_event={last_event})。"
f"可能原因: 1) speaker 与 resource_id 不匹配 "
f"2) API Key 对应的服务未开通 "
f"3) 文本内容触发了安全过滤"
)
return b"".join(audio_chunks)get_audio then only needs:
raw_body = b""
async for chunk in response.content.iter_any():
if chunk:
raw_body += chunk
if not raw_body:
raise Exception(
f"火山引擎 TTS 返回空响应 (logid={logid}),请检查 API Key 和 resource_id 是否正确"
)
full_audio = self._parse_tts_body(raw_body, logid)This keeps all current semantics (NDJSON + single JSON, error codes, last_event) but moves them into a testable helper and simplifies get_audio.
2. Extract safe file writing
The inline run_in_executor with a lambda and bare open is both noisy and a bit fragile. A tiny helper keeps behavior and improves safety:
async def _write_audio_file(self, audio: bytes) -> str:
temp_dir = get_astrbot_temp_path()
os.makedirs(temp_dir, exist_ok=True)
file_path = os.path.join(
temp_dir,
f"volcengine_tts_{uuid.uuid4().hex[:12]}.{self.format}",
)
loop = asyncio.get_running_loop()
def _write():
with open(file_path, "wb") as f:
f.write(audio)
await loop.run_in_executor(None, _write)
return file_pathThen in get_audio:
full_audio = self._parse_tts_body(raw_body, logid)
file_path = await self._write_audio_file(full_audio)
logger.info(
f"[VolcengineTTS V3] 合成完成: {file_path} "
f"({len(full_audio)} bytes, {len(full_audio) and len(full_audio)} chunks, logid={logid})"
)
return file_path(You can still log the chunk count by having _parse_tts_body optionally return (audio_bytes, chunk_count) if you need that detail.)
These two extra helpers keep the public behavior, keep all V3 features, and make get_audio a clear orchestrator instead of a god method, without large structural changes.
There was a problem hiding this comment.
Code Review
This pull request upgrades the Volcengine TTS provider from the V1 API to the V3 HTTP Chunked unidirectional streaming API, updates the default context limit strategy and compression parameters, and removes the Xiaomi provider, TOTP, and dashboard rate-limiting configurations. Feedback on these changes highlights a critical routing issue where the namespace parameter for Volcengine TTS must be set to "TTS" instead of "BidirectionalTTS". Additionally, it is recommended to simplify the response parsing logic into a single unified loop to handle both NDJSON and single JSON formats robustly, and to verify that the context compressor implementation is updated to support the newly introduced llm_compress_keep_recent configuration key.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| # --- 主请求体 --- | ||
| payload: dict = { | ||
| "user": {"uid": str(uuid.uuid4())[:8]}, | ||
| "namespace": "BidirectionalTTS", |
There was a problem hiding this comment.
For the Volcengine TTS V3 unidirectional HTTP Chunked API, the namespace parameter must be set to "TTS". Setting it to "BidirectionalTTS" (which is used for bidirectional WebSocket streaming) will cause the API request to fail with a routing or unsupported namespace error.
| "namespace": "BidirectionalTTS", | |
| "namespace": "TTS", |
| # --- Approach 1: NDJSON (streaming format — primary for unidirectional API) --- | ||
| lines = [l for l in raw_text.strip().split("\n") if l.strip()] | ||
| if len(lines) > 1: | ||
| logger.debug(f"[VolcengineTTS V3] NDJSON mode: {len(lines)} lines, {len(raw_body)} bytes") | ||
| for line in lines: | ||
| line = line.strip() | ||
| try: | ||
| data = json.loads(line) | ||
| except json.JSONDecodeError: | ||
| continue | ||
|
|
||
| if "error" in data: | ||
| raise Exception( | ||
| f"火山引擎 TTS API 错误 (logid={logid}): {json.dumps(data['error'], ensure_ascii=False)}" | ||
| ) | ||
| if "code" in data: | ||
| code = data.get("code", 0) | ||
| if code not in (0, 20000000): | ||
| raise Exception( | ||
| f"火山引擎 TTS API 错误 (logid={logid}): " | ||
| f"code={code}, message={data.get('message', 'unknown')}" | ||
| ) | ||
|
|
||
| event = data.get("event", "") | ||
| if event: | ||
| last_event = event | ||
|
|
||
| # NDJSON: each line may have either "data" at top level or "audio.data" nested | ||
| if "data" in data and isinstance(data["data"], str): | ||
| b64_str = re.sub(r'\s+', '', data["data"]) | ||
| try: | ||
| audio_chunks.append(base64.b64decode(b64_str)) | ||
| except Exception: | ||
| pass | ||
| elif "audio" in data and "data" in data["audio"]: | ||
| audio_chunks.append(base64.b64decode(data["audio"]["data"])) | ||
|
|
||
| # --- Approach 2: single JSON (fallback for short texts) --- | ||
| if not audio_chunks: | ||
| logger.debug(f"[VolcengineTTS V3] single JSON mode, {len(raw_body)} bytes") | ||
| obj = json.loads(raw_text) | ||
| if "data" in obj and obj["data"] and isinstance(obj["data"], str): | ||
| b64_str = re.sub(r'\s+', '', obj["data"]) | ||
| audio_chunks.append(base64.b64decode(b64_str)) | ||
|
|
There was a problem hiding this comment.
The current response parsing logic uses two separate approaches based on len(lines) > 1. This is fragile because if an NDJSON response contains only one line (or if other lines are empty/filtered), it will fall back to single JSON mode and fail to parse the nested "audio.data" field.
We can simplify and make this much more robust by using a single unified loop that parses each line as JSON. Since the NDJSON parser already handles both top-level "data" (single JSON fallback) and nested "audio.data" (NDJSON), a single loop seamlessly handles both response formats.
# --- Parse response lines (handles both single JSON and NDJSON streaming) ---
audio_chunks: list[bytes] = []
last_event = ""
raw_text = raw_body.decode("utf-8", errors="replace")
lines = [l for l in raw_text.strip().split("\n") if l.strip()]
logger.debug(f"[VolcengineTTS V3] Parsing {len(lines)} lines, {len(raw_body)} bytes")
for line in lines:
try:
data = json.loads(line)
except json.JSONDecodeError:
continue
if "error" in data:
raise Exception(
f"火山引擎 TTS API 错误 (logid={logid}): {json.dumps(data['error'], ensure_ascii=False)}"
)
if "code" in data:
code = data.get("code", 0)
if code not in (0, 20000000):
raise Exception(
f"火山引擎 TTS API 错误 (logid={logid}): "
f"code={code}, message={data.get('message', 'unknown')}"
)
event = data.get("event", "")
if event:
last_event = event
# Each line may have either "data" at top level (single JSON fallback) or "audio.data" nested (NDJSON)
if "data" in data and isinstance(data["data"], str):
b64_str = re.sub(r'\s+', '', data["data"])
try:
audio_chunks.append(base64.b64decode(b64_str))
except Exception:
pass
elif "audio" in data and isinstance(data["audio"], dict) and "data" in data["audio"]:
audio_chunks.append(base64.b64decode(data["audio"]["data"]))| "4. Write the summary in the user's language.\n" | ||
| ), | ||
| "llm_compress_keep_recent_ratio": 0.15, | ||
| "llm_compress_keep_recent": 6, |
There was a problem hiding this comment.
The configuration key has been changed from "llm_compress_keep_recent_ratio" to "llm_compress_keep_recent" (with a default value of 6 turns). However, the actual context compressor implementation in astrbot/core/agent/context/compressor.py and astrbot/core/agent/context/config.py still expects llm_compress_keep_recent_ratio and does not implement turn-based keeping.
This mismatch will cause the new llm_compress_keep_recent setting to be completely ignored, and could potentially lead to runtime errors or fallback to default values. Please ensure that the context compressor and config classes are updated to support turn-based keeping.
Summary
Upgrades the Volcengine (????) TTS provider from the deprecated V1 API to the current V3 HTTP Chunked unidirectional streaming API.
Changes
Provider (�olcengine_tts.py)
Config (default.py)
esource_id, speaker, ormat, sample_rate, �it_rate, speech_rate, loudness_rate, pitch, emotion, model n- Backward compatible: old �oice_type config field is still read as fallback for speaker n
i18n
Testing
References
Summary by Sourcery
Upgrade the Volcengine text-to-speech provider to the V3 HTTP chunked streaming API and align configuration, defaults, and metadata accordingly.
New Features:
Enhancements:
Documentation: