-
Notifications
You must be signed in to change notification settings - Fork 548
Add CV screening example with curated resume test set #4607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
mmabrouk
wants to merge
6
commits into
main
Choose a base branch
from
claude/cv-classifier-demo-oug3jb
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
0ffffa1
Add CV screening example with curated resume test set
claude c28d1a2
Add user feedback annotations to CV screening demo
claude cbb7d72
Split UI from AI logic; auto-instrument OpenAI and reference the prom…
mmabrouk d3dd4d5
Simplify the screening output to three match booleans with reasons
mmabrouk 508d419
Fix make_sample_pdfs for the new test set columns
mmabrouk 9cb02f6
Merge main into claude/cv-classifier-demo-oug3jb to update the branch…
mmabrouk File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| # Agenta credentials (create an API key in the Agenta UI under Settings > API Keys) | ||
| AGENTA_API_KEY=your-agenta-api-key | ||
| # For Agenta Cloud keep the default; for self-hosted point to your instance | ||
| AGENTA_HOST=https://cloud.agenta.ai | ||
|
|
||
| # LLM provider used by the Streamlit demo | ||
| OPENAI_API_KEY=your-openai-api-key |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,226 @@ | ||
| # CV Screening with Agenta | ||
|
|
||
| A complete walkthrough for building a CV classifier with Agenta: a prompt | ||
| that evaluates a candidate's CV (as Markdown) against a job specification | ||
| and returns a structured assessment — a technical-skills match, an | ||
| experience match, and an overall hire/no-hire recommendation, each with a | ||
| short reason, plus the list of missing must-have requirements. | ||
|
|
||
| The split between Agenta and the application code follows the pattern we | ||
| recommend for production: | ||
|
|
||
| - **Inside Agenta**: the prompt (job requirements, nice-to-haves, scoring | ||
| instructions), the model configuration, the structured-output JSON schema, | ||
| and the test set of Markdown CVs. This is what you iterate on in the | ||
| playground, evaluate, and deploy. | ||
| - **Outside Agenta**: everything around the prompt — a small Streamlit app | ||
| that accepts a PDF upload, converts it to Markdown, fetches the deployed | ||
| prompt from the Agenta registry, calls the LLM, and renders the result. | ||
|
|
||
| ``` | ||
| PDF upload ──> Markdown (markitdown) ──> prompt fetched from Agenta ──> LLM ──> structured scores | ||
| ``` | ||
|
|
||
| ## What's in this folder | ||
|
|
||
| | File | Purpose | | ||
| | --- | --- | | ||
| | `config.py` | Job spec, prompt messages, structured-output JSON schema, app slugs | | ||
| | `create_app.py` | Creates the `cv-screening` app in Agenta and deploys the prompt to production | | ||
| | `prepare_testset.py` | Builds `data/testset.csv` from a public resume dataset (optionally uploads it to Agenta) | | ||
| | `data/testset.csv` | 30 real Markdown CVs with hand-labeled expected matches (committed, ready to upload) | | ||
| | `screening.py` | The AI logic: fetches the prompt, calls the LLM, traces, sends feedback | | ||
| | `app.py` | Streamlit demo UI: upload a PDF, screen the candidate | | ||
| | `make_sample_pdfs.py` | Renders three test set CVs as PDFs for the demo | | ||
| | `data/sample_cvs/` | Sample CV PDFs (one strong match, one potential match, one rejection) | | ||
|
|
||
| ## The test set | ||
|
|
||
| The test set is built from the | ||
| [`opensporks/resumes`](https://huggingface.co/datasets/opensporks/resumes) | ||
| dataset on Hugging Face — a mirror of the Kaggle | ||
| [Resume Dataset](https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset) | ||
| (~2,400 real, anonymized resumes from livecareer.com, 24 job categories). | ||
|
|
||
| `prepare_testset.py` takes a curated subset of 30 resumes, converts them from | ||
| HTML to clean Markdown, and labels each one by hand against the IT Manager | ||
| job spec in `config.py`: | ||
|
|
||
| - **6 strong matches** — seasoned IT managers, directors, and a VP of IT | ||
| - **7 partial matches** — IT specialists and supervisors missing | ||
| management scope, plus an engineering manager with weak IT depth | ||
| - **17 rejections** — interns, and candidates from unrelated fields (chef, | ||
| teacher, attorney, finance analyst, ...), including one resume that is | ||
| mislabeled in the source dataset (an "IT Coordinator" that is actually a | ||
| paralegal CV — a nice robustness check for the classifier) | ||
|
|
||
| Each CSV row has: | ||
|
|
||
| | Column | Content | | ||
| | --- | --- | | ||
| | `cv` | The CV as Markdown — maps to the `{{cv}}` input of the prompt | | ||
| | `expected_tech_match` | Hand-assigned ground truth for `tech_match` (`true` / `false`) | | ||
| | `expected_experience_match` | Hand-assigned ground truth for `experience_match` (`true` / `false`) | | ||
| | `expected_overall_match` | Hand-assigned ground truth for `overall_match` (`true` / `false`) | | ||
|
|
||
| An empty expected cell means "no ground truth for this dimension"; the code | ||
| evaluator below skips it. That is how you add a test case that only pins | ||
| down the overall decision (for example, a CV that fails a new requirement) | ||
| without having to label the other dimensions. | ||
|
|
||
| The CVs are Markdown rather than PDFs on purpose: PDF parsing happens | ||
| outside Agenta (in the app), so the test set captures exactly what the | ||
| prompt receives. This keeps evaluations reproducible and independent of the | ||
| PDF-extraction step. | ||
|
|
||
| ## Walkthrough | ||
|
|
||
| ### 0. Setup | ||
|
|
||
| ```bash | ||
| pip install -r requirements.txt | ||
| cp .env.example .env # then fill in your keys | ||
| ``` | ||
|
|
||
| ### 1. Create the prompt in Agenta | ||
|
|
||
| ```bash | ||
| python create_app.py | ||
| ``` | ||
|
|
||
| This creates a completion app called `cv-screening`, commits the screening | ||
| prompt (with the job spec and the JSON schema for structured output), and | ||
| deploys it to the production environment. Open the app in Agenta to see it | ||
| in the playground. | ||
|
|
||
| ### 2. Upload the test set | ||
|
|
||
| The committed `data/testset.csv` can be uploaded directly in the Agenta UI | ||
| (Test sets → Create → Upload CSV), or via the SDK: | ||
|
|
||
| ```bash | ||
| python prepare_testset.py --upload | ||
| ``` | ||
|
|
||
| (Without `--upload` the script just rebuilds the CSV from the source | ||
| dataset.) | ||
|
|
||
| ### 3. Iterate and evaluate in Agenta | ||
|
|
||
| In the playground, load test cases from the test set and experiment with | ||
| the prompt: tighten the requirements, change the model, adjust the | ||
| instructions. To score runs against the hand-labeled | ||
| `expected_*` columns, create a custom code evaluator (Evaluators → | ||
| Create → Code) with: | ||
|
|
||
| ```python | ||
| import json | ||
| from typing import Dict, Any | ||
|
|
||
| FIELDS = ("tech_match", "experience_match", "overall_match") | ||
|
|
||
|
|
||
| def evaluate( | ||
| inputs: Dict[str, Any], | ||
| outputs: Any, | ||
| trace: Dict[str, Any], | ||
| ) -> float: | ||
| result = json.loads(outputs) if isinstance(outputs, str) else outputs | ||
|
|
||
| checked = [] | ||
| for field in FIELDS: | ||
| expected = str(inputs.get(f"expected_{field}") or "").strip().lower() | ||
| if expected not in ("true", "false"): | ||
| continue # empty cell: no ground truth for this dimension | ||
| checked.append(str(result.get(field)).lower() == expected) | ||
|
|
||
| return sum(checked) / len(checked) if checked else 1.0 | ||
| ``` | ||
|
|
||
| It compares each of the three match booleans to its `expected_*` column | ||
| and returns the fraction that agree. Empty expected cells are skipped, so | ||
| a test case can pin down only one dimension. Then run an evaluation with | ||
| the test set and this evaluator. | ||
|
|
||
| ### 4. Run the demo app | ||
|
|
||
| ```bash | ||
| streamlit run app.py | ||
| ``` | ||
|
|
||
| Upload one of the PDFs from `data/sample_cvs/` (or any CV). `app.py` is | ||
| UI only; the AI logic lives in `screening.py`. The flow: | ||
|
|
||
| 1. the app converts the PDF to Markdown with [markitdown](https://github.com/microsoft/markitdown), | ||
| 2. `screening.py` fetches the production prompt from the Agenta registry — | ||
| so whatever you deploy from the playground is what the app uses, with no | ||
| redeploy, | ||
| 3. calls the LLM with the structured-output schema, | ||
| 4. the app renders the three match verdicts with their reasons and the | ||
| missing requirements. | ||
|
|
||
| Every screening shows up as a trace in Agenta's observability view, built | ||
| so you can act on it from the UI: | ||
|
|
||
| - `classify_cv` is instrumented with `@ag.instrument()`, and the OpenAI | ||
| client is auto-instrumented with | ||
| [OpenInference](https://github.com/Arize-ai/openinference), so each trace | ||
| has a child LLM span with the exact messages, token counts, and cost. | ||
| - The span's inputs are the prompt's input variables (`{"cv": ...}`), and | ||
| the prompt configuration is kept out of the trace (`ignore_inputs`). | ||
| - The span is linked to the exact prompt revision it used | ||
| (`ag.tracing.store_refs`), so you can filter traces by app or environment | ||
| and open the span in the playground on the same prompt revision, inputs | ||
| pre-filled. | ||
|
|
||
| ### 5. Collect user feedback on screenings | ||
|
|
||
| After each screening the app shows a feedback form: 👍/👎 plus an optional | ||
| comment. Submitting it attaches the feedback to that screening's trace as an | ||
| [annotation](https://docs.agenta.ai/observability/trace-with-python-sdk/annotate-traces) | ||
| under the `user-feedback` evaluator slug: | ||
|
|
||
| 1. `classify_cv` captures the trace and span IDs while its span is open | ||
| (`ag.tracing.build_invocation_link()`), | ||
| 2. on submit, the app POSTs an annotation to `/api/simple/traces/` with | ||
| `{"score": 1 | 0, "comment": ...}` linked to that invocation. | ||
|
|
||
| The feedback appears on the trace in Agenta's observability view, so you | ||
| can filter for badly rated screenings, inspect the CVs that caused them, | ||
| and turn them into new test cases. To see aggregated stats for the | ||
| `user-feedback` evaluator in the UI, create a matching human evaluator | ||
| (Evaluators → Human evaluators) with the same slug. | ||
|
|
||
| ### 6. Close the loop: from feedback to a deployed fix | ||
|
|
||
| The pieces above compose into the core Agenta workflow. Say the role | ||
| requires fluent German, but the prompt doesn't mention it: | ||
|
|
||
| 1. **Recruiter** screens a CV in the app, sees "Advance to interview" for | ||
| a candidate with no German, and submits a 👎 with the comment | ||
| *"candidate doesn't speak German"*. | ||
| 2. **AI engineer** filters traces by the `user-feedback` annotation, opens | ||
| the badly rated trace, and opens its span in the playground — landing | ||
| on the exact prompt revision with the CV pre-filled. | ||
| 3. In the playground, they add *"Fluent German (the company's working | ||
| language)"* to the must-have requirements and rerun: `overall_match` | ||
| flips to `false` and German shows up in `missing_requirements`, while | ||
| `tech_match` and `experience_match` stay `true`. | ||
| 4. They add the CV to the test set as a new test case with | ||
| `expected_overall_match = false` and the other two expected columns | ||
| left **empty** — the code evaluator only checks the overall decision | ||
| for this case. | ||
| 5. They run an evaluation comparing the deployed revision against the new | ||
| one. The old prompt fails the new test case; the new prompt passes it | ||
| without regressing the other 30. | ||
| 6. They deploy the new revision to production. The Streamlit app picks it | ||
| up on the next screening — no code change, no redeploy. | ||
|
|
||
| ## Adapting it to your role | ||
|
|
||
| Everything role-specific lives in the prompt: edit the job spec directly in | ||
| the Agenta playground (or in `config.py` and re-run `create_app.py`). The | ||
| structured-output schema and the app don't need to change. To build a test | ||
| set for a different role, adjust the curated IDs and labels in | ||
| `prepare_testset.py` — the source dataset has 24 job categories to draw | ||
| from. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,142 @@ | ||
| """Streamlit demo: upload a CV as PDF and screen it against the job spec. | ||
|
|
||
| This file is UI only. All the AI logic (prompt fetching, the LLM call, | ||
| tracing, and feedback) lives in `screening.py`, which any other frontend | ||
| could reuse. The flow mirrors a production setup: | ||
|
|
||
| 1. The PDF is converted to Markdown locally (markitdown). | ||
| 2. The screening prompt is fetched from the Agenta registry — the same | ||
| prompt you iterate on in the playground and evaluate against the | ||
| test set. | ||
| 3. The prompt is formatted with the CV and sent to the LLM with a JSON | ||
| schema response format. | ||
| 4. The structured result (tech / experience / overall match, each with | ||
| a reason) is rendered as a small dashboard. | ||
| 5. The user can rate the screening (thumbs up/down plus an optional | ||
| comment); the feedback is attached to the trace in Agenta as an | ||
| annotation. | ||
|
|
||
| Run with: | ||
| streamlit run app.py | ||
| """ | ||
|
|
||
| import io | ||
| import os | ||
|
|
||
| import streamlit as st | ||
| from dotenv import load_dotenv | ||
| from markitdown import MarkItDown | ||
|
|
||
| import screening | ||
|
|
||
| load_dotenv() | ||
|
|
||
| MATCH_LABELS = { | ||
| "tech": "Technical skills", | ||
| "experience": "Experience", | ||
| } | ||
|
|
||
|
|
||
| @st.cache_resource | ||
| def init_screening() -> None: | ||
| screening.init() | ||
|
|
||
|
|
||
| @st.cache_data(ttl=60) | ||
| def fetch_config() -> screening.ScreeningConfig: | ||
| return screening.fetch_config() | ||
|
|
||
|
|
||
| @st.cache_data(show_spinner="Converting PDF to Markdown ...") | ||
| def pdf_to_markdown(file_bytes: bytes) -> str: | ||
| result = MarkItDown().convert_stream(io.BytesIO(file_bytes), file_extension=".pdf") | ||
| return result.text_content.strip() | ||
|
|
||
|
|
||
| def render_result(result: dict) -> None: | ||
| banner = st.success if result["overall_match"] else st.error | ||
| verdict = "Advance to interview" if result["overall_match"] else "Do not advance" | ||
| banner(f"**{verdict}** — {result['overall_reason']}") | ||
|
|
||
| columns = st.columns(len(MATCH_LABELS)) | ||
| for column, (key, label) in zip(columns, MATCH_LABELS.items()): | ||
| with column: | ||
| icon = "✅" if result[f"{key}_match"] else "❌" | ||
| st.markdown(f"#### {icon} {label}") | ||
| st.markdown(result[f"{key}_reason"]) | ||
|
|
||
| if result["missing_requirements"]: | ||
| st.subheader("Missing requirements") | ||
| for item in result["missing_requirements"]: | ||
| st.markdown(f"- ❌ {item}") | ||
|
|
||
|
|
||
| def render_feedback(result: dict) -> None: | ||
| invocation = result.get("_invocation") | ||
| if invocation is None or not os.environ.get("AGENTA_API_KEY"): | ||
| st.caption( | ||
| "Feedback is disabled: set AGENTA_API_KEY so screenings are " | ||
| "traced and can be annotated." | ||
| ) | ||
| return | ||
|
|
||
| st.divider() | ||
| if st.session_state.get("feedback_sent") == invocation: | ||
| st.success("Thanks! Your feedback was attached to the trace in Agenta.") | ||
| return | ||
|
|
||
| with st.form("feedback"): | ||
| st.markdown("**Was this screening accurate?**") | ||
| rating = st.feedback("thumbs") | ||
| comment = st.text_input("Comment (optional)") | ||
| submitted = st.form_submit_button("Send feedback") | ||
|
|
||
| if submitted: | ||
| if rating is None: | ||
| st.warning("Pick 👍 or 👎 first.") | ||
| elif screening.send_feedback( | ||
| invocation, thumbs_up=rating == 1, comment=comment | ||
| ): | ||
| st.session_state["feedback_sent"] = invocation | ||
| st.rerun() | ||
| else: | ||
| st.error("Could not send feedback to Agenta. Check the logs.") | ||
|
|
||
|
|
||
| def main() -> None: | ||
| st.set_page_config(page_title="CV Screening", page_icon="📄", layout="wide") | ||
| st.title("📄 CV Screening") | ||
| st.caption( | ||
| "Upload a CV as PDF. It is converted to Markdown and screened against " | ||
| "the job spec by the prompt managed in Agenta." | ||
| ) | ||
|
|
||
| init_screening() | ||
| config = fetch_config() | ||
| st.sidebar.markdown(f"**Prompt source:** {config.source}") | ||
| st.sidebar.markdown(f"**Model:** {config.params['prompt']['llm_config']['model']}") | ||
|
|
||
| uploaded = st.file_uploader("Candidate CV (PDF)", type=["pdf"]) | ||
| if uploaded is None: | ||
| st.info("Upload a PDF to get started. Sample CVs are in `data/sample_cvs/`.") | ||
| return | ||
|
|
||
| cv_markdown = pdf_to_markdown(uploaded.getvalue()) | ||
| with st.expander("Extracted Markdown", expanded=False): | ||
| st.markdown(cv_markdown) | ||
|
|
||
| if st.button("Screen candidate", type="primary"): | ||
| with st.spinner("Evaluating CV against the job spec ..."): | ||
| result = screening.classify_cv({"cv": cv_markdown}, config) | ||
| st.session_state["screening"] = {"cv": cv_markdown, "result": result} | ||
|
|
||
| # Render from session state so the result (and its feedback form) | ||
| # survives the reruns Streamlit triggers on every interaction. | ||
| current = st.session_state.get("screening") | ||
| if current and current["cv"] == cv_markdown: | ||
| render_result(current["result"]) | ||
| render_feedback(current["result"]) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard conversion/classification with user-facing error handling.
The main screening path can raise on PDF parsing, LLM call, or JSON decoding; currently those failures bubble up and break the interaction.
Proposed fix
📝 Committable suggestion