Skip to content

Added new demo functionality to evaluate Supervisor Agent#15

Open
dmatrix wants to merge 1 commit intodatabricks-solutions:mainfrom
dmatrix:br_jsd_add_supervisor_evals
Open

Added new demo functionality to evaluate Supervisor Agent#15
dmatrix wants to merge 1 commit intodatabricks-solutions:mainfrom
dmatrix:br_jsd_add_supervisor_evals

Conversation

@dmatrix
Copy link
Copy Markdown
Contributor

@dmatrix dmatrix commented May 6, 2026

  • Added new demo functionality to evaluate Supervisor Agent
  • update DABs to deploy the new databricks notebook as part of the bundle
  • updated the README to describe how to show this demo

Signed-off-by: Jules Damji <dmatrix@comcast.net>
@djliden
Copy link
Copy Markdown
Collaborator

djliden commented May 7, 2026

A few minor issues:

1. Endpoint default is hardcoded to your deploy. dbutils.widgets.text("supervisor_name", "mas-f6c439c0-endpoint", ...) won't match anyone else's workspace. Since display_name is unique per workspace (per the SDK), accept the display name and resolve at startup--something like this:

def resolve_endpoint(w, name: str) -> str:
    if name.startswith("mas-") and name.endswith("-endpoint"):
        return name
    for a in w.supervisor_agents.list_supervisor_agents():
        if a.display_name == name:
            return a.endpoint_name or w.supervisor_agents.get_supervisor_agent(name=a.name).endpoint_name
    raise RuntimeError(f"No Supervisor Agent named '{name}'")

Default the widget to "Bee Colony Health Advisor" (the canonical name setup_agents.py creates). Also rename the var — it currently holds an endpoint but is called supervisor_name.

  1. PR description claims a databricks.yml change to wire the notebook into the bundle, but the diff doesn't include one—add notebook to DAB?

  2. Nits: eval_rational.png → eval_rationale.png; "Change to your deployed bundle directory scripts/eval_supervisor.py into your Databricks workspace" reads like words got dropped; >=3.11.0 rather than mlflow==3.11.0 maybe?

Copy link
Copy Markdown
Collaborator

@djliden djliden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor suggestions—thanks for adding!

@djliden
Copy link
Copy Markdown
Collaborator

djliden commented May 7, 2026

I would also suggest getting rid of the "make a judge with genie code" part from the readme and replacing it entirely with this—but we can ask genie code to look at the traces.

@dmatrix
Copy link
Copy Markdown
Contributor Author

dmatrix commented May 7, 2026

Default the widget to "Bee Colony Health Advisor" (the canonical name setup_agents.py creates). Also rename the var — it currently holds an endpoint but is called supervisor_name.

supervisor_name --> supervisor_name_endpoint. And it should hold a string in the widget text as "your_supervisor_endpoint_name".

PR description claims a databricks.yml change to wire the notebook into the bundle, but the diff doesn't include one—add notebook to DAB?

A misnomer comment since anything under scripts gets deployed anyway. So no need to add this explicitly.

I would also suggest getting rid of the "make a judge with genie code" part from the readme and replacing it entirely with this—but we can ask genie code to look at the traces.

Removed

Nits: eval_rational.png → eval_rationale.png; "Change to your deployed bundle directory scripts/eval_supervisor.py into your Databricks workspace"

renamed the image

reads like words got dropped; >=3.11.0 rather than mlflow==3.11.0 maybe?
Not sure where this is dropped. I have in the prereq as now mlflow==3.11.0

def resolve_endpoint(w, name: str) -> str:
    if name.startswith("mas-") and name.endswith("-endpoint"):
        return name
    for a in w.supervisor_agents.list_supervisor_agents():
        if a.display_name == name:
            return a.endpoint_name or w.supervisor_agents.get_supervisor_agent(name=a.name).endpoint_name
    raise RuntimeError(f"No Supervisor Agent named '{name}'")

What is type for w argument here? And where should this be placed in the notebook? Makes sense after fetching the varibles from the widgets.

@djliden
Copy link
Copy Markdown
Collaborator

djliden commented May 7, 2026

Ahh, I misunderstood, thought you wanted the eval notebook to run as part of the DAB (not just be deployed)

removed
renamed the image

I don't see any additional commits on this PR

I added these notes as suggestions for clarity—I would also suggest making the eval part runnable via the DAB, maybe with something like:

...
  jobs:
    setup_demo:
      # ... existing job unchanged ...

    eval_demo:
      name: "[${bundle.target}] bee-pollinator-eval"
      tasks:
        - task_key: evaluate_supervisor
          notebook_task:
            notebook_path: ./scripts/eval_supervisor.py
            source: WORKSPACE
            base_parameters:
              supervisor: "Bee Colony Health Advisor"
              judge_model: "databricks:/databricks-claude-opus-4-7"   # or whatever default
          environment_key: default
      environments:
        - environment_key: default
          spec:
            client: "1"
            dependencies:
              - databricks-sdk>=0.106.0
              - databricks-openai>=0.5.0
              - mlflow>=3.11.0
...

Copy link
Copy Markdown
Collaborator

@djliden djliden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(added inline suggestions to clarify a few of the earlier suggestions)

Comment on lines +53 to +68
dbutils.widgets.text(
"supervisor_name",
"mas-f6c439c0-endpoint",
"Supervisor Agent Endpoint Name",
)
dbutils.widgets.text(
"judge_model",
"databricks:/databricks-gpt-5-4",
"Judge Model URI",
)

supervisor_name = dbutils.widgets.get("supervisor_name")
judge_model = dbutils.widgets.get("judge_model")

print(f"Supervisor: {supervisor_name}")
print(f"Judge model: {judge_model}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dbutils.widgets.text(
"supervisor_name",
"mas-f6c439c0-endpoint",
"Supervisor Agent Endpoint Name",
)
dbutils.widgets.text(
"judge_model",
"databricks:/databricks-gpt-5-4",
"Judge Model URI",
)
supervisor_name = dbutils.widgets.get("supervisor_name")
judge_model = dbutils.widgets.get("judge_model")
print(f"Supervisor: {supervisor_name}")
print(f"Judge model: {judge_model}")
dbutils.widgets.text(
"supervisor",
"Bee Colony Health Advisor",
"Supervisor Agent (display name or mas-XXXXXXXX-endpoint)",
)
dbutils.widgets.text(
"judge_model",
"databricks:/databricks-gpt-5-4",
"Judge Model URI",
)
supervisor = dbutils.widgets.get("supervisor")
judge_model = dbutils.widgets.get("judge_model")
print(f"Supervisor: {supervisor}")
print(f"Judge model: {judge_model}")

uses the "canonical" name by default (Bee Colony Health Advisor) which we will resolve to the endpoint name, so users do not need to manually find the endpoint most of the time.

Comment on lines +72 to +91
import time
from typing import Literal

import mlflow
from databricks_openai import DatabricksOpenAI
from mlflow.entities import Feedback
from mlflow.genai.judges import make_judge
from mlflow.genai.scorers import Correctness, Guidelines, scorer

client = DatabricksOpenAI()

current_user = (
spark.sql("SELECT current_user()").first()[0]
)
experiment_name = (
f"/Users/{current_user}/bee_pollinator_eval"
)
mlflow.openai.autolog()
mlflow.set_experiment(experiment_name)
print(f"MLflow experiment: {experiment_name}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import time
from typing import Literal
import mlflow
from databricks_openai import DatabricksOpenAI
from mlflow.entities import Feedback
from mlflow.genai.judges import make_judge
from mlflow.genai.scorers import Correctness, Guidelines, scorer
client = DatabricksOpenAI()
current_user = (
spark.sql("SELECT current_user()").first()[0]
)
experiment_name = (
f"/Users/{current_user}/bee_pollinator_eval"
)
mlflow.openai.autolog()
mlflow.set_experiment(experiment_name)
print(f"MLflow experiment: {experiment_name}")
import time
from typing import Literal
import mlflow
from databricks.sdk import WorkspaceClient
from databricks_openai import DatabricksOpenAI
from mlflow.entities import Feedback
from mlflow.genai.judges import make_judge
from mlflow.genai.scorers import Correctness, Guidelines, scorer
def resolve_endpoint(w: WorkspaceClient, name: str) -> str:
"""Accept a Supervisor Agent display name or its serving endpoint;
return the endpoint name (mas-XXXXXXXX-endpoint)."""
if name.startswith("mas-") and name.endswith("-endpoint"):
return name
for a in w.supervisor_agents.list_supervisor_agents():
if a.display_name == name:
endpoint = a.endpoint_name
if not endpoint and a.name:
endpoint = w.supervisor_agents.get_supervisor_agent(name=a.name).endpoint_name
if not endpoint:
raise RuntimeError(
f"Supervisor Agent '{name}' has no endpoint_name yet — still provisioning?"
)
return endpoint
available = [a.display_name for a in w.supervisor_agents.list_supervisor_agents()]
raise RuntimeError(
f"No Supervisor Agent named '{name}'. Available: {available}"
)
w = WorkspaceClient()
supervisor_endpoint = resolve_endpoint(w, supervisor)
print(f"Endpoint: {supervisor_endpoint}")
client = DatabricksOpenAI()
current_user = spark.sql("SELECT current_user()").first()[0]
experiment_name = f"/Users/{current_user}/bee_pollinator_eval"
mlflow.openai.autolog()
mlflow.set_experiment(experiment_name)
print(f"MLflow experiment: {experiment_name}")

Resolves supervisor name into endpoint

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WorskapceClient has not attrribute called supervisor_agents, so this won't work. Am debugging it....

Copy link
Copy Markdown
Collaborator

@djliden djliden May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes as is the knowledge assistant API (see https://docs.databricks.com/api/workspace/supervisoragents and https://docs.databricks.com/api/workspace/knowledgeassistants). Both work well with the current version of the SDK. Just have to keep an eye out for any breaking changes in the future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What databricks_sdk version are you using that has the lastest API.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let try to install databricks-sdk>=0.106.0

Comment on lines +284 to +297
def predict_supervisor(request: str) -> str:
"""Query the Supervisor Agent and return the response text."""
response = client.responses.create(
model=supervisor_name,
input=[{"role": "user", "content": request}],
)
answer = "".join([
block.text
for item in response.output
if hasattr(item, "content")
for block in item.content
if hasattr(block, "text")
])
return answer
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def predict_supervisor(request: str) -> str:
"""Query the Supervisor Agent and return the response text."""
response = client.responses.create(
model=supervisor_name,
input=[{"role": "user", "content": request}],
)
answer = "".join([
block.text
for item in response.output
if hasattr(item, "content")
for block in item.content
if hasattr(block, "text")
])
return answer
def predict_supervisor(request: str) -> str:
"""Query the Supervisor Agent and return the response text."""
response = client.responses.create(
model=supervisor_endpoint,
input=[{"role": "user", "content": request}],
)
answer = "".join([
block.text
for item in response.output
if hasattr(item, "content")
for block in item.content
if hasattr(block, "text")
])
return answer

use the resolved endpoint


### How to run it

1. Change to your deployed bundle directory `scripts/eval_supervisor.py` into your Databricks workspace
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this instruction hard to follow—did you mean something like

Open scripts/eval_supervisor.py in your Databricks workspace (the bundle uploads it under /Workspace/Users//.bundle/bee-pollinator-demo/dev/files/scripts/)"

Comment thread demos/bee-pollinator/images/eval_rational.png
@@ -0,0 +1,502 @@
# Databricks notebook source
# DBTITLE 1,Install dependencies
# MAGIC %pip install mlflow==3.11.0 databricks_openai
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# MAGIC %pip install mlflow==3.11.0 databricks_openai
# MAGIC %pip install mlflow>=3.11.0 databricks_openai

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too eager about making the eval notebook as part of DAB run. It should be something the demoer walks and runs through it runtime, speaking to it, and also possibly, adding realtime monitoring after the face.

Much better experience for both the demoer and audience.

Copy link
Copy Markdown
Contributor Author

@dmatrix dmatrix May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I have not committed the changes yet. :-) Only in my branch. Want to test it before I push.

Copy link
Copy Markdown
Collaborator

@djliden djliden May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too eager about making the eval notebook as part of DAB run. It should be something the demoer walks and runs through it runtime, speaking to it, and also possibly, adding realtime monitoring after the face.

Much better experience for both the demoer and audience.

Sounds good—ignore that comment, then. Maybe later we can add a lightweight judge/eval so the experiment is pre-populated with some traces the presenter can talk through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants