This tutorial shows how to use qebench run to batch-translate dataset
entries using LLM providers, and then evaluate the results with qebench judge.
- You've completed Getting Started
- Install the LLM dependencies:
uv sync --extra llm- Set your API key as an environment variable:
# For Claude (Anthropic)
export ANTHROPIC_API_KEY=sk-ant-...
# For OpenAI
export OPENAI_API_KEY=sk-...Before making API calls, preview what will be translated:
uv run qebench run --dry-runThis shows the first 5 entries that would be translated, without calling the API:
╭──── qebench run ────╮
│ Provider: claude │
│ Model: (default) │
│ Prompt: default │
│ Entries: 314 terms │
╰──────────────────────╯
Dry run — no API calls will be made.
term-001: inflation...
term-002: gross domestic product...
term-003: supply and demand...
term-004: equilibrium...
term-005: monetary policy...
... and 309 more
Start with a small batch to verify everything works:
uv run qebench run -n 10This translates 10 terms using the default provider (Claude) and prompt. You'll see a progress spinner, then a summary:
╭──────── Run Summary ─────────╮
│ Entries translated 10 │
│ Total tokens 2,340 │
│ Total cost $0.0035 │
│ Avg latency 245ms│
│ Output file results/model-… │
╰──────────────────────────────╯
Results are saved as JSONL to results/model-outputs/.
Compare Claude and OpenAI on the same entries:
# Claude (default)
uv run qebench run -n 20 -d economics
# OpenAI
uv run qebench run -n 20 -d economics --provider openaiEach run creates a separate output file, so results are never overwritten.
Override the default model for a provider:
uv run qebench run --provider openai --model gpt-5.4-mini -n 10Prompt templates live in the prompts/ directory. The project ships with four:
default— general-purpose translation promptacademic— emphasizes formal academic registeraction-basic— MyST Markdown-aware rules (preserves directives, code, math fencing)action-new— MyST rules + glossary injection fromaction-translation
# Academic prompt
uv run qebench run --prompt academic -n 20
# Action-translation style (MyST-aware, no glossary)
uv run qebench run --prompt action-basic -n 20
# Action-translation style with glossary injection
uv run qebench run --prompt action-new -n 20
# Compare against default on the same domain
uv run qebench run --prompt default -n 20 -d economicsThe action-new template automatically loads the glossary from
action-translation's GitHub repository (configured in config.yaml).
See Glossary & Prompt Templates for full details.
By default, qebench run translates terms. Use --type for other types:
# Translate sentences
uv run qebench run --type sentences -n 10
# Translate paragraphs
uv run qebench run --type paragraphs -n 5Paragraphs are the most challenging and informative entry type for benchmarking.
They include MyST feature flags (contains_directives, contains_roles,
contains_mixed_fencing) that describe the structural complexity of each
paragraph for filtering and analysis.
Once you have model outputs, use qebench judge to compare them:
uv run qebench judge -n 10The judge reveal panel now shows formatting scores alongside the existing reference overlap and glossary compliance metrics. This lets you see whether a model broke directives, mixed fence markers, or used ASCII punctuation instead of fullwidth characters.
See Judging Translations for the full walkthrough.
Push model outputs and judgments to GitHub:
uv run qebench submitEach run produces a JSONL file in results/model-outputs/ with one record
per entry:
{
"entry_id": "term-001",
"source_text": "inflation",
"translated_text": "通货膨胀",
"model": "claude-sonnet-4-6",
"provider": "claude",
"prompt_template": "default",
"input_tokens": 123,
"output_tokens": 45,
"cost_usd": 0.001044,
"latency_ms": 345.6
}# Basic run (all terms, Claude, default prompt)
uv run qebench run
# Targeted run
uv run qebench run -n 20 -d economics --type sentences
# Compare providers
uv run qebench run -n 50 --provider claude
uv run qebench run -n 50 --provider openai
# Compare prompts (all 4)
uv run qebench run -n 50 --prompt default
uv run qebench run -n 50 --prompt academic
uv run qebench run -n 50 --prompt action-basic
uv run qebench run -n 50 --prompt action-new
# Dry run to preview
uv run qebench run --dry-run --type paragraphs- Judge results: See Judging Translations to evaluate model outputs
- Check leaderboard:
qebench statsshows Elo ratings and XP rankings - Add custom prompts: Create a new
.txtfile inprompts/and pass its name with--prompt - Glossary & prompts: See Glossary & Prompt Templates for details on glossary injection
- Seed more data: See Seeding from Lectures to extract sentence/paragraph pairs (developer guide)