Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 44 additions & 5 deletions demos/bee-pollinator/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,10 +72,10 @@ Add `--profile your_profile` if not using the default CLI profile.

This set of databricks commands creates 3 Delta tables, uploads 4 PDFs to a UC Volume, and creates a Genie Space and Knowledge Assistant — all automated.

| Variable | Default | Description |
|----------|---------|-------------|
| Variable | Default | Description |
|-----------|---------|-------------|
| `catalog` | `main` | Unity Catalog catalog name |
| `schema` | `bee_pollinator` | Schema for demo tables |
| `schema` | `bee_pollinator` | Schema for demo tables |
| `warehouse_id` | — (required) | SQL Warehouse ID for Genie Space |

### Step 3: Create the Supervisor Agent (~5 minutes)
Expand Down Expand Up @@ -126,8 +126,8 @@ Test the Supervisor Agent with these queries:
| Type | Query |
|------|-------|
| Data (Genie) | "Which 5 states had the highest colony loss percentage in Q4 2024, and what were their max colonies?" |
| Document (KA) | "What does the Varroa Management Guide recommend for monitoring mite levels?" |
| Cross-modal | "Which stressors affected California colonies most in Q1 2024, and what varroa management practices should California beekeepers prioritize?" |
| Document (KA)| "What does the Varroa Management Guide recommend for monitoring mite levels?" |
| Cross-modal | "Which stressors affected California colonies most in Q1 2024, and what varroa management practices should California beekeepers prioritize?" |

Honey questions can stay annual. Colony-loss and stressor questions should stay quarterly because the USDA Honey Bee Colonies data in this demo is quarter-based. Use `max_colonies` with `loss_colonies` when you need quarter-specific scale.

Expand Down Expand Up @@ -176,6 +176,45 @@ Once active, the judge evaluates incoming traces and attaches feedback scores vi

You can also add judges directly through the MLflow Experiment UI (Scorers tab → New Scorer) or programmatically via the SDK. See [Registering and Versioning Scorers](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/versioning/) for details.

## Evaluate the Supervisor Agent with MLflow

For a more comprehensive evaluation beyond ad-hoc Genie Code judges, the `eval_supervisor` notebook runs 12 queries across all three routing patterns and scores every response using MLflow's GenAI evaluation framework — `mlflow.genai.evaluate()`.

![Evaluation results dashboard](./images/bee_pollinator_evals.png)

Each Supervisor's query trace shows the judge's score and the rational for that score.

![Evalution rational](./images/eval_rational.png)

### What it evaluates

The notebook sends 12 queries (4 Genie-only, 4 Knowledge-Assistant-only, 4 both) through the deployed Supervisor Agent and applies four scorers to each response:

| Scorer | Type | What it measures |
|--------|------|------------------|
| **Routing Correctness** | `make_judge()` | Did the supervisor route to the correct sub-agent(s)? |
| **Answer Correctness** | Built-in `Correctness()` | Does the response contain the expected facts? |
| **Completeness** | `@scorer` + `make_judge()` | Does the response cover all expected elements? |
| **Response Quality** | Built-in `Guidelines()` | Does the response meet domain quality standards? |

### How to run it

1. Change to your deployed bundle directory `scripts/eval_supervisor.py` into your Databricks workspace
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this instruction hard to follow—did you mean something like

Open scripts/eval_supervisor.py in your Databricks workspace (the bundle uploads it under /Workspace/Users//.bundle/bee-pollinator-demo/dev/files/scripts/)"

2. Attach the notebook to a cluster
3. Set the two widgets at the top:
- **Supervisor Agent Endpoint Name** — the serving endpoint for your Supervisor Agent (e.g., `mas-f6c439c0-endpoint`)
- **Judge Model URI** — the model used for LLM judge scorers (e.g., `databricks:/databricks-gpt-5-4`)
4. Run All Cells — the notebook installs dependencies, queries the agent, runs all four judges, and displays results

The evaluation takes 3-6 minutes depending on agent response times.

### What you get

- **MLflow experiment** at `/Users/<you>/bee_pollinator_eval` with metrics logged per run
- **Eval results table** with per-query scores for routing, correctness, completeness, and quality
- **Aggregate metrics** displayed as an HTML dashboard in the notebook
- All traces are captured via `mlflow.openai.autolog()` for drill-down in the MLflow Traces tab

## Teardown

```bash
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demos/bee-pollinator/images/eval_rational.png
Comment thread
dmatrix marked this conversation as resolved.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading