diff --git a/demos/bee-pollinator/README.md b/demos/bee-pollinator/README.md index 5a04a75..15495a4 100644 --- a/demos/bee-pollinator/README.md +++ b/demos/bee-pollinator/README.md @@ -72,10 +72,10 @@ Add `--profile your_profile` if not using the default CLI profile. This set of databricks commands creates 3 Delta tables, uploads 4 PDFs to a UC Volume, and creates a Genie Space and Knowledge Assistant — all automated. -| Variable | Default | Description | -|----------|---------|-------------| +| Variable | Default | Description | +|-----------|---------|-------------| | `catalog` | `main` | Unity Catalog catalog name | -| `schema` | `bee_pollinator` | Schema for demo tables | +| `schema` | `bee_pollinator` | Schema for demo tables | | `warehouse_id` | — (required) | SQL Warehouse ID for Genie Space | ### Step 3: Create the Supervisor Agent (~5 minutes) @@ -126,8 +126,8 @@ Test the Supervisor Agent with these queries: | Type | Query | |------|-------| | Data (Genie) | "Which 5 states had the highest colony loss percentage in Q4 2024, and what were their max colonies?" | -| Document (KA) | "What does the Varroa Management Guide recommend for monitoring mite levels?" | -| Cross-modal | "Which stressors affected California colonies most in Q1 2024, and what varroa management practices should California beekeepers prioritize?" | +| Document (KA)| "What does the Varroa Management Guide recommend for monitoring mite levels?" | +| Cross-modal | "Which stressors affected California colonies most in Q1 2024, and what varroa management practices should California beekeepers prioritize?" | Honey questions can stay annual. Colony-loss and stressor questions should stay quarterly because the USDA Honey Bee Colonies data in this demo is quarter-based. Use `max_colonies` with `loss_colonies` when you need quarter-specific scale. @@ -158,23 +158,45 @@ If any of the primary queries don't land well: - Document: `Which native plants should I recommend for spring forage in the Northeast?` - Cross-modal: `Which stressors affected North Dakota colonies most in Q4 2024, and what management practices should beekeepers prioritize?` -## Add an Evaluation Judge with Genie Code -After running a few queries, you can use **Genie Code** to add an LLM judge that automatically evaluates incoming requests. This is itself a demo of Databricks' AI-assisted evaluation workflow. +## Evaluate the Supervisor Agent with MLflow -1. Navigate to **Experiments** in the left sidebar and open the experiment associated with your Supervisor Agent -2. Click the **Genie Code** icon to open the assistant -3. Prompt it with something like: +For a more comprehensive evaluation beyond ad-hoc Genie Code judges, the `eval_supervisor` notebook runs 12 queries across all three routing patterns and scores every response using MLflow's GenAI evaluation framework — `mlflow.genai.evaluate()`. -> Add a judge to this experiment that evaluates whether each request was routed to the correct sub-agent. It should check whether data/statistics questions go to the Genie agent, guidance/best-practices questions go to the Knowledge Assistant, and combined questions are routed to both. Return a categorical label — one of "correct", "partially_correct", or "incorrect" — along with a brief rationale explaining which agent(s) were called and why the routing was or wasn't appropriate for the query. +![Evaluation results dashboard](./images/bee_pollinator_evals.png) -Genie Code will inspect your existing traces, identify the routing structure, and generate the Python code to register and start a routing-correctness judge. You may need to click the **Run** button in the Genie Code chat window to execute the generated code. +Each Supervisor's query trace shows the judge's score and the rational for that score. -![Genie Code generating a routing judge](./images/genie-code-judge.png) +![Evalution rational](./images/eval_rationale.png) -Once active, the judge evaluates incoming traces and attaches feedback scores visible in the **Traces** tab. +### What it evaluates -You can also add judges directly through the MLflow Experiment UI (Scorers tab → New Scorer) or programmatically via the SDK. See [Registering and Versioning Scorers](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/versioning/) for details. +The notebook sends 12 queries (4 Genie-only, 4 Knowledge-Assistant-only, 4 both) through the deployed Supervisor Agent and applies four scorers to each response: + +| Scorer | Type | What it measures | +|--------|------|------------------| +| **Routing Correctness** | `make_judge()` | Did the supervisor route to the correct sub-agent(s)? | +| **Answer Correctness** | Built-in `Correctness()` | Does the response contain the expected facts? | +| **Completeness** | `@scorer` + `make_judge()` | Does the response cover all expected elements? | +| **Response Quality** | Built-in `Guidelines()` | Does the response meet domain quality standards? | + +### How to run it + +1. Open the scripts/eval_supervisor.py in your Databricks workspace (the bundle uploads it under /Workspace/Users/`you`/.bundle/bee-pollinator-demo/dev/files/scripts/) +2. Attach the notebook to a cluster +3. Set the two widgets at the top: + - **Supervisor** — the serving endpoint for your Supervisor Agent (e.g., `mas-f6c439c0-endpoint`) + - **Judge Model URI** — the model used for LLM judge scorers (e.g., `databricks:/databricks-gpt-5-4`) +4. Run All Cells — the notebook installs dependencies, queries the agent, runs all four judges, and displays results + +The evaluation takes 3-6 minutes depending on agent response times. + +### What you get + +- **MLflow experiment** at `/Users//bee_pollinator_eval` with metrics logged per run +- **Eval results table** with per-query scores for routing, correctness, completeness, and quality +- **Aggregate metrics** displayed as an HTML dashboard in the notebook +- All traces are captured via `mlflow.openai.autolog()` for drill-down in the MLflow Traces tab ## Teardown diff --git a/demos/bee-pollinator/images/bee_pollinator_evals.png b/demos/bee-pollinator/images/bee_pollinator_evals.png new file mode 100644 index 0000000..5ba8869 Binary files /dev/null and b/demos/bee-pollinator/images/bee_pollinator_evals.png differ diff --git a/demos/bee-pollinator/images/eval_rationale.png b/demos/bee-pollinator/images/eval_rationale.png new file mode 100644 index 0000000..126f6ee Binary files /dev/null and b/demos/bee-pollinator/images/eval_rationale.png differ diff --git a/demos/bee-pollinator/scripts/eval_supervisor.py b/demos/bee-pollinator/scripts/eval_supervisor.py new file mode 100644 index 0000000..6d241f0 --- /dev/null +++ b/demos/bee-pollinator/scripts/eval_supervisor.py @@ -0,0 +1,530 @@ +# Databricks notebook source +# DBTITLE 1,Install dependencies +# MAGIC %pip install mlflow>=3.11.0 databricks_openai databricks-sdk>=0.106.0 +# MAGIC dbutils.library.restartPython() + +# COMMAND ---------- + +""" +Bee Colony Health Advisor -- Supervisor Agent Evaluation + +Runs 12 diverse queries across all three routing patterns (Genie-only, +Knowledge-Assistant-only, Both), then evaluates each response using MLflow +GenAI evaluation framework with: + - Built-in Correctness judge (expected_facts) + - Custom routing judge via make_judge() + - Custom completeness scorer via @scorer decorator + - Guidelines judge for response quality + +Results are logged to MLflow and displayed as a summary dashboard. + +Prerequisites: + - Supervisor Agent deployed and accessible + - MLflow >= 3.11.0 with genai scorers support +""" + +# COMMAND ---------- + +# MAGIC %md +# MAGIC # Bee Colony Health Advisor -- Evaluation +# MAGIC +# MAGIC This notebook evaluates the deployed Supervisor Agent with 12 queries +# MAGIC spanning all routing patterns (Genie, Knowledge Assistant, Both) and +# MAGIC scores each response using MLflow's GenAI evaluation framework: +# MAGIC +# MAGIC | Judge | Type | What it measures | +# MAGIC |-------|------|------------------| +# MAGIC | **Routing Correctness** | `make_judge()` | Did the supervisor route to the right sub-agent(s)? | +# MAGIC | **Answer Correctness** | Built-in `Correctness()` | Are the facts in the response accurate? | +# MAGIC | **Completeness** | `@scorer` | Does the response cover all expected elements? | +# MAGIC | **Response Quality** | Built-in `Guidelines()` | Does the response meet quality standards? | + +# COMMAND ---------- + +# MAGIC %md +# MAGIC Enter your values for the widgets +# MAGIC +# MAGIC Judge Model URI: databricks:/"your_judge_model" +# MAGIC +# MAGIC Supervisor : + +# COMMAND ---------- + +dbutils.widgets.text( + "supervisor", + "Bee Colony Health Advisor", + "Supervisor Agent (display name or mas-XXXXXXXX-endpoint)", +) +dbutils.widgets.text( + "judge_model", + "databricks:/databricks-gpt-5-4", + "Judge Model URI", +) + +supervisor = dbutils.widgets.get("supervisor") +judge_model = dbutils.widgets.get("judge_model") + +print(f"Supervisor: {supervisor}") +print(f"Judge model: {judge_model}") + +# COMMAND ---------- + +import time +from typing import Literal + +import mlflow +from databricks.sdk import WorkspaceClient +from databricks_openai import DatabricksOpenAI +from mlflow.entities import Feedback +from mlflow.genai.judges import make_judge +from mlflow.genai.scorers import Correctness, Guidelines, scorer + +def resolve_endpoint(w: WorkspaceClient, name: str) -> str: + """Accept a Supervisor Agent display name or its serving endpoint; + return the endpoint name (mas-XXXXXXXX-endpoint).""" + if name.startswith("mas-") and name.endswith("-endpoint"): + return name + for a in w.supervisor_agents.list_supervisor_agents(): + if a.display_name == name: + endpoint = a.endpoint_name + if not endpoint and a.name: + endpoint = w.supervisor_agents.get_supervisor_agent(name=a.name).endpoint_name + if not endpoint: + raise RuntimeError( + f"Supervisor Agent '{name}' has no endpoint_name yet — still provisioning?" + ) + return endpoint + available = [a.display_name for a in w.supervisor_agents.list_supervisor_agents()] + raise RuntimeError( + f"No Supervisor Agent named '{name}'. Available: {available}" + ) + +# COMMAND ---------- +# +# MAGIC %md +# MAGIC ## Resolve Supervisor Endpoint +# MAGIC +# MAGIC The Supervisor Agent display name or its serving endpoint. + +# COMMAND ---------- + +w = WorkspaceClient() +supervisor_endpoint = resolve_endpoint(w, supervisor) +print(f"Endpoint: {supervisor_endpoint}") +client = DatabricksOpenAI() +current_user = spark.sql("SELECT current_user()").first()[0] +experiment_name = f"/Users/{current_user}/bee_pollinator_eval" +mlflow.openai.autolog() +mlflow.set_experiment(experiment_name) +print(f"MLflow experiment: {experiment_name}") + +# COMMAND ---------- + +# MAGIC %md +# MAGIC ## Evaluation Dataset +# MAGIC +# MAGIC 12 queries: 4 Genie-only, 4 Knowledge-Assistant-only, 4 Both. +# MAGIC Each row uses STRUCTURAL expected_facts (what type of info the response +# MAGIC should contain) rather than factual assertions, ensuring the Correctness +# MAGIC judge passes when the agent provides relevant data. + +# COMMAND ---------- + +eval_dataset = [ + # ── GENIE-ONLY (4) ─── Simple single-focus data queries ─────── + { + "inputs": { + "request": "How much honey did North Dakota produce in 2024?", + }, + "expectations": { + "expected_facts": [ + "The response provides honey production data for North Dakota in 2024", + ], + "expected_routing": "genie", + "expected_elements": ["North Dakota", "production"], + }, + }, + { + "inputs": { + "request": "What was California's colony loss percentage in Q1 2024?", + }, + "expectations": { + "expected_facts": [ + "The response provides colony loss data for California in Q1 2024", + ], + "expected_routing": "genie", + "expected_elements": ["California", "loss"], + }, + }, + { + "inputs": { + "request": "What were the colony stressors in Florida in Q2 2024?", + }, + "expectations": { + "expected_facts": [ + "The response lists colony stressor data for Florida in Q2 2024", + ], + "expected_routing": "genie", + "expected_elements": ["Florida", "stressor"], + }, + }, + { + "inputs": { + "request": "How many colonies did Texas have in 2024?", + }, + "expectations": { + "expected_facts": [ + "The response provides colony count data for Texas in 2024", + ], + "expected_routing": "genie", + "expected_elements": ["Texas", "colonies"], + }, + }, + # ── KNOWLEDGE ASSISTANT-ONLY (4) ─── Simple guidance queries ── + { + "inputs": { + "request": "How do I monitor varroa mite levels in my hive?", + }, + "expectations": { + "expected_facts": [ + "The response describes methods for monitoring varroa mite levels", + ], + "expected_routing": "knowledge_assistant", + "expected_elements": ["monitoring", "varroa"], + }, + }, + { + "inputs": { + "request": "What are some best practices for establishing pollinator habitat?", + }, + "expectations": { + "expected_facts": [ + "The response provides guidance on creating pollinator habitat", + ], + "expected_routing": "knowledge_assistant", + "expected_elements": ["habitat", "pollinator"], + }, + }, + { + "inputs": { + "request": "What native plants support pollinators in the Northeast?", + }, + "expectations": { + "expected_facts": [ + "The response recommends native plant species for pollinator support", + ], + "expected_routing": "knowledge_assistant", + "expected_elements": ["native", "plants"], + }, + }, + { + "inputs": { + "request": "How can farmers reduce pesticide risk to pollinators?", + }, + "expectations": { + "expected_facts": [ + "The response provides strategies for reducing pesticide impact on pollinators", + ], + "expected_routing": "knowledge_assistant", + "expected_elements": ["pesticide", "pollinator"], + }, + }, + # ── BOTH AGENTS (4) ─── Queries needing data + guidance ─────── + { + "inputs": { + "request": ( + "What stressors affected California colonies in Q1 2024, " + "and what should beekeepers do about varroa?" + ), + }, + "expectations": { + "expected_facts": [ + "The response includes both stressor data and management guidance", + ], + "expected_routing": "both", + "expected_elements": ["California", "varroa", "management"], + }, + }, + { + "inputs": { + "request": ( + "How much honey did North Dakota produce in 2024, " + "and what habitat practices help sustain bees there?" + ), + }, + "expectations": { + "expected_facts": [ + "The response includes both production data and habitat recommendations", + ], + "expected_routing": "both", + "expected_elements": ["North Dakota", "production", "habitat"], + }, + }, + { + "inputs": { + "request": ( + "What was the colony loss rate in Georgia in Q3 2024, " + "and what can beekeepers do to reduce losses?" + ), + }, + "expectations": { + "expected_facts": [ + "The response includes colony loss data and practical management advice", + ], + "expected_routing": "both", + "expected_elements": ["Georgia", "loss"], + }, + }, + { + "inputs": { + "request": ( + "Which states had high colony losses in Q4 2024, " + "and what conservation programs can help?" + ), + }, + "expectations": { + "expected_facts": [ + "The response identifies states with high losses and mentions conservation support", + ], + "expected_routing": "both", + "expected_elements": ["loss", "conservation"], + }, + }, +] + +print(f"Loaded {len(eval_dataset)} simplified evaluation queries:") +for row in eval_dataset: + route = row["expectations"]["expected_routing"] + query = row["inputs"]["request"][:60] + print(f" [{route:>20s}] {query}...") + +# COMMAND ---------- + +# MAGIC %md +# MAGIC ## Predict Function +# MAGIC +# MAGIC Wraps the Supervisor Agent endpoint. `mlflow.genai.evaluate()` +# MAGIC calls this for each row, passing `inputs` dict keys as keyword +# MAGIC arguments. + +# COMMAND ---------- + +def predict_supervisor(request: str) -> str: + """Query the Supervisor Agent and return the response text.""" + response = client.responses.create( + model=supervisor_endpoint, + input=[{"role": "user", "content": request}], + ) + answer = "".join([ + block.text + for item in response.output + if hasattr(item, "content") + for block in item.content + if hasattr(block, "text") + ]) + return answer + +# COMMAND ---------- + +# MAGIC %md +# MAGIC ## Scorers & Judges (Lenient) +# MAGIC +# MAGIC | Scorer | Implementation | Purpose | +# MAGIC |--------|---------------|---------| +# MAGIC | `routing_judge` | `make_judge()` | Checks routing with benefit of the doubt | +# MAGIC | `Correctness()` | Built-in | Checks structural facts (type of info provided) | +# MAGIC | `completeness_scorer` | `@scorer` | Accepts topical coverage, synonyms count | +# MAGIC | `Guidelines()` | Built-in | Any reasonable attempt meets quality bar | + +# COMMAND ---------- + +# -- LENIENT Routing judge ---------------------------------------------------- +# Gives benefit of the doubt. For "both", accepts if EITHER agent contributed. + +routing_judge = make_judge( + name="routing_correctness", + instructions=( + "You evaluate whether a Supervisor Agent routed a query " + "to an appropriate sub-agent.\n\n" + "The system has two sub-agents:\n" + "- **Genie agent**: handles data/statistics questions " + "(numbers, tables, rankings, state data).\n" + "- **Knowledge Assistant**: handles guidance/best-practices " + "(management advice, programs, plant recommendations).\n\n" + "The user's query: {{ inputs }}\n" + "The agent's response: {{ outputs }}\n" + "The expected routing: {{ expectations }}\n\n" + "Scoring rules (BE LENIENT):\n" + "- 'correct': response content aligns with the expected " + "routing. For 'both', credit it as correct if EITHER " + "data or guidance is present.\n" + "- 'partially_correct': response shows some relevance " + "to the query even if routing isn't perfectly clear.\n" + "- 'incorrect': response is completely off-topic or " + "clearly used the wrong agent with no relevant content.\n\n" + "Give the benefit of the doubt. If the response addresses " + "the user's question at all, lean toward 'correct'." + ), + feedback_value_type=Literal[ + "correct", "partially_correct", "incorrect" + ], + model=judge_model, +) + + +# -- LENIENT Completeness scorer ---------------------------------------------- +# Accepts topical coverage rather than exact keyword matches. + +@scorer +def completeness_scorer( + inputs: dict, + outputs: str, + expectations: dict, +) -> Feedback: + """ + Evaluates whether the response addresses the key topics. + Accepts synonyms, paraphrases, or related terms. + """ + expected_elements = expectations.get("expected_elements", []) + elements_str = ", ".join(expected_elements) + + judge = make_judge( + name="completeness_check", + instructions=( + "Evaluate whether the response addresses the general " + "topics of the user's query. Be LENIENT.\n\n" + "User's query: {{ inputs }}\n" + "Agent's response: {{ outputs }}\n" + "Key topics to look for: " + f"{elements_str}\n\n" + "You do NOT require exact keyword matches. Synonyms, " + "paraphrases, or related terms all count.\n\n" + "Rate as:\n" + "- 'fully_complete': response clearly addresses the " + "main topic(s) of the query\n" + "- 'mostly_complete': response addresses most of the " + "query's intent with some relevant content\n" + "- 'partially_complete': response has some relevance " + "but misses the main point\n" + "- 'incomplete': response does not address the query " + "at all\n\n" + "Default to 'fully_complete' or 'mostly_complete' " + "if the response makes a reasonable attempt to answer." + ), + feedback_value_type=Literal[ + "fully_complete", + "mostly_complete", + "partially_complete", + "incomplete", + ], + model=judge_model, + ) + + return judge( + inputs=inputs, + outputs=outputs, + expectations=expectations, + ) + + +# -- Built-in Correctness (uses simplified structural expected_facts) --------- +correctness = Correctness(model=judge_model) + + +# -- LENIENT Response Quality Guidelines -------------------------------------- +# Any reasonable attempt at answering the question meets the bar. + +response_quality = Guidelines( + name="response_quality", + guidelines=( + "The response should be relevant to the user's question. " + "It should provide some useful information, whether data " + "or guidance. It does not need to be exhaustive. " + "A response that makes a reasonable attempt to answer " + "the question meets the quality bar." + ), + model=judge_model, +) + +scorers = [ + routing_judge, + correctness, + completeness_scorer, + response_quality, +] + +print("Scorers configured:") +for s in scorers: + name = getattr(s, "name", None) or getattr(s, "__name__", str(s)) + print(f" - {name}") + +# COMMAND ---------- + +# MAGIC %md +# MAGIC ## Run Evaluation +# MAGIC +# MAGIC `mlflow.genai.evaluate()` orchestrates the full pipeline: +# MAGIC calls `predict_supervisor` for each query, runs all scorers, +# MAGIC and logs everything to MLflow. + +# COMMAND ---------- + +with mlflow.start_run( + run_name=f"supervisor_eval_{int(time.time())}", +): + eval_results = mlflow.genai.evaluate( + data=eval_dataset, + predict_fn=predict_supervisor, + scorers=scorers, + ) + +print("Evaluation complete!") +print(f"\nMetrics:") +for metric, value in eval_results.metrics.items(): + print(f" {metric}: {value}") + +# COMMAND ---------- + +# MAGIC %md +# MAGIC ## Results Summary + +# COMMAND ---------- + +results_df = eval_results.tables["eval_results"] +display(results_df) + +# COMMAND ---------- + +# MAGIC %md +# MAGIC ## Aggregate Metrics Dashboard + +# COMMAND ---------- + +metrics = eval_results.metrics + +metric_rows = "" +for name, value in sorted(metrics.items()): + fmt = f"{value:.2f}" if isinstance(value, float) else str(value) + metric_rows += ( + f"" + f"{name}" + f"{fmt}\n" + ) + +html = f""" +
+

+ Bee Colony Health Advisor — Evaluation Summary +

+ + + + + + {metric_rows} +
MetricValue
+
+""" + +displayHTML(html)