databricks-solutions · dmatrix · May 6, 2026 · djliden · May 7, 2026
diff --git a/demos/bee-pollinator/README.md b/demos/bee-pollinator/README.md
@@ -72,10 +72,10 @@ Add `--profile your_profile` if not using the default CLI profile.
 
 This set of databricks commands creates 3 Delta tables, uploads 4 PDFs to a UC Volume, and creates a Genie Space and Knowledge Assistant — all automated.
 
-| Variable | Default | Description |
-|----------|---------|-------------|
+| Variable  | Default | Description |
+|-----------|---------|-------------|
 | `catalog` | `main` | Unity Catalog catalog name |
-| `schema` | `bee_pollinator` | Schema for demo tables |
+| `schema`  | `bee_pollinator` | Schema for demo tables |
 | `warehouse_id` | — (required) | SQL Warehouse ID for Genie Space |
 
 ### Step 3: Create the Supervisor Agent (~5 minutes)
@@ -126,8 +126,8 @@ Test the Supervisor Agent with these queries:
 | Type | Query |
 |------|-------|
 | Data (Genie) | "Which 5 states had the highest colony loss percentage in Q4 2024, and what were their max colonies?" |
-| Document (KA) | "What does the Varroa Management Guide recommend for monitoring mite levels?" |
-| Cross-modal | "Which stressors affected California colonies most in Q1 2024, and what varroa management practices should California beekeepers prioritize?" |
+| Document (KA)| "What does the Varroa Management Guide recommend for monitoring mite levels?" |
+| Cross-modal  | "Which stressors affected California colonies most in Q1 2024, and what varroa management practices should California beekeepers prioritize?" |
 
 Honey questions can stay annual. Colony-loss and stressor questions should stay quarterly because the USDA Honey Bee Colonies data in this demo is quarter-based. Use `max_colonies` with `loss_colonies` when you need quarter-specific scale.
 
@@ -176,6 +176,45 @@ Once active, the judge evaluates incoming traces and attaches feedback scores vi
 
 You can also add judges directly through the MLflow Experiment UI (Scorers tab → New Scorer) or programmatically via the SDK. See [Registering and Versioning Scorers](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/versioning/) for details.
 
+## Evaluate the Supervisor Agent with MLflow
+
+For a more comprehensive evaluation beyond ad-hoc Genie Code judges, the `eval_supervisor` notebook runs 12 queries across all three routing patterns and scores every response using MLflow's GenAI evaluation framework — `mlflow.genai.evaluate()`.
+
+![Evaluation results dashboard](./images/bee_pollinator_evals.png)
+
+Each Supervisor's query trace shows the judge's score and the rational for that score.
+
+![Evalution rational](./images/eval_rational.png)
+
+### What it evaluates
+
+The notebook sends 12 queries (4 Genie-only, 4 Knowledge-Assistant-only, 4 both) through the deployed Supervisor Agent and applies four scorers to each response:
+
+| Scorer | Type | What it measures |
+|--------|------|------------------|
+| **Routing Correctness** | `make_judge()` | Did the supervisor route to the correct sub-agent(s)? |
+| **Answer Correctness** | Built-in `Correctness()` | Does the response contain the expected facts? |
+| **Completeness** | `@scorer` + `make_judge()` | Does the response cover all expected elements? |
+| **Response Quality** | Built-in `Guidelines()` | Does the response meet domain quality standards? |
+
+### How to run it
+
+1. Change to your deployed bundle directory `scripts/eval_supervisor.py` into your Databricks workspace
+2. Attach the notebook to a cluster
+3. Set the two widgets at the top:
+   - **Supervisor Agent Endpoint Name** — the serving endpoint for your Supervisor Agent (e.g., `mas-f6c439c0-endpoint`)
+   - **Judge Model URI** — the model used for LLM judge scorers (e.g., `databricks:/databricks-gpt-5-4`)
+4. Run All Cells — the notebook installs dependencies, queries the agent, runs all four judges, and displays results
+
+The evaluation takes 3-6 minutes depending on agent response times.
+
+### What you get
+
+- **MLflow experiment** at `/Users/<you>/bee_pollinator_eval` with metrics logged per run
+- **Eval results table** with per-query scores for routing, correctness, completeness, and quality
+- **Aggregate metrics** displayed as an HTML dashboard in the notebook
+- All traces are captured via `mlflow.openai.autolog()` for drill-down in the MLflow Traces tab
+
 ## Teardown
 
 ```bash

diff --git a/demos/bee-pollinator/images/bee_pollinator_evals.png b/demos/bee-pollinator/images/bee_pollinator_evals.png
diff --git a/demos/bee-pollinator/images/eval_rational.png b/demos/bee-pollinator/images/eval_rational.png