██████╗ █████╗ ██████╗ ████████╗██╗ ██╗ ████████╗██╗██╗ ██╗ █████╗ ██████╗ ██╗
██╔══██╗██╔══██╗██╔══██╗╚══██╔══╝██║ ██║ ╚══██╔══╝██║██║ ██║██╔══██╗██╔══██╗██║
██████╔╝███████║██████╔╝ ██║ ███████║ ██║ ██║██║ █╗ ██║███████║██████╔╝██║
██╔═══╝ ██╔══██║██╔══██╗ ██║ ██╔══██║ ██║ ██║██║███╗██║██╔══██║██╔══██╗██║
██║ ██║ ██║██║ ██║ ██║ ██║ ██║ ██║ ██║╚███╔███╔╝██║ ██║██║ ██║██║
╚═╝ ╚═╝ ╚═╝╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚══╝╚══╝ ╚═╝ ╚═╝╚═╝ ╚═╝╚═╝
$ initializing parth_tiwari.profile ...
[✓] identity → AI Systems Engineer
[✓] location → Bengaluru, India
[✓] status → open to the right problem
[✓] philosophy → evidence before claims
[✓] vibe-coding → NOT DETECTED
[✓] evaluation → ACTIVE
[✓] evidence systems → 9 mapped in EVIDENCEBOUND
[✓] current build → SecondSelf
[✓] work node → Stick and Dot (AI/ML Intern)
[READY] parth_tiwari.profile loaded successfully.Most profiles show you the wins. Here's what actually happened.
Building a fraud engine. Backtesting revealed this:
train ROC-AUC → 0.895 ← model looked great
production ROC → 0.60 ← system was lying to itself the whole time
cause: temporal features bled future signal into past training windows
fix: leakage validation, point-in-time enforcement, rebuilt from scratch
result: precision stayed useful under a real alert budget
Shipped a Text-to-SQL agent. Hallucination detector reported 100% hallucination:
hallucination_rate → 100% ← every query hallucinating?
actual rate → 0% ← the metric was wrong, not the system
cause: schema_tables_used returned ["schema_dict", "tables"] — dict keys, not table names
fix: one-line patch
lesson: I found this because I wrote a hallucination detector in the first place
Deployed to Render. LLM mixed up two different databases:
question → "what is the total revenue?" (ecommerce schema)
sql → SELECT SUM(amount) FROM fines (library schema — wrong database entirely)
cause: both schemas lived in the same Chroma collection, embeddings leaked cross-schema
fix: prompt isolation + schema-scoped retrieval + re-evaluated full 82-query benchmark
The pattern: I find these things because I build evaluation harnesses before I trust results.
- "it works on my machine" → ship it
+ measure → break it intentionally → fix it → measure again → then ship itmodel_id : parth-tiwari-v2
type : early-career AI systems engineer
architecture : first-principles → build → evaluate → break → fix → deploy
training_data : production constraints, real failure modes, measurable outcomes
benchmarks:
text_to_sql_execution_success : 95.7% # 82-query ecommerce benchmark
cross_schema_generalization : 100% # zero-shot on unseen library schema
syntactic_hallucination_rate : 0.0% # schema-grounded generation
fraud_precision_in_budget : 92.06% # 0.5% daily alert constraint
fraud_p95_latency : ~386ms # API scoring path
medrag_answered_faithfulness : ~0.99 # cited medical retrieval answers
medrag_refusal_accuracy : 100% # insufficient evidence => refusal
vivid_beta_users : 10+ # creative AI work under Stick and Dot
serving:
portfolio : EVIDENCEBOUND — 9 evidence systems, same-world overlays
deployment : Docker · Render · Streamlit · HuggingFace · Vercel
current_focus : SecondSelf · evidence-bound career/application OS
known_limitations : early-career · still learning · high ownership · ships with boundariesFeatured below: 3 public systems. Full map: EVIDENCEBOUND — 9 nodes across personal projects, work evidence, current builds, and tooling.
⚡ QUERYPILOT · Self-Correcting Text-to-SQL Agent
Natural Language
│
▼
Schema-Aware RAG ──► SQL Generator
│
Static Validator
│
┌───────────────┼───────────────┐
Regex Repair LLM Fix Executor
└───────────────┴───────────────┘
Self-Correction Loop
(max 3 attempts)
| Metric | Result | Context |
|---|---|---|
| First-attempt success | 90.0% |
No correction, cold generation |
| After self-correction | 95.7% |
3-stage loop on 82-query benchmark |
| Hallucination rate | 0.0% |
Zero invented tables or columns |
| Cross-schema generalization | 100% |
Library schema, zero domain tuning |
| Cold-start reduction | ~400ms |
Per-schema agent caching |
Python LangGraph FastAPI ChromaDB PostgreSQL Docker GitHub Actions
🛡 UPI FRAUD ENGINE · Real-Time Fraud Decision System
HARD CONSTRAINTS (non-negotiable):
├── score transaction at T using only pre-T features (no future leakage)
├── ≤ 0.5% daily alert budget (precision is everything)
└── simulate delayed fraud labels (real-world label lag)
transactions → point-in-time features → leakage tests → alert-budget model
train/serve drift surfaced → rebuilt → re-tested under real decision constraints
| Metric | Result | Context |
|---|---|---|
| Precision in alert budget | 92.06% |
Only flags what matters |
| P95 latency | ~386ms |
API scoring path |
| Leakage tests | 55+ |
Temporal integrity checks |
| Backtest mode | day-by-day |
Production-like replay |
Python XGBoost FastAPI DuckDB Great Expectations Docker
🧬 EVIDENCE-BOUND DRUG RAG · Medical Knowledge Retrieval
HARD CONSTRAINT: medical domain — hallucination is patient harm
├── every claim needs source evidence
├── insufficient evidence must trigger refusal, not a guess
└── faithfulness is measured, not assumed
FDA + NICE PDFs → semantic chunks → retrieval → citation → refusal policy
| Metric | Result | Context |
|---|---|---|
| Answered faithfulness | ~0.99 |
Claims grounded in source |
| Refusal accuracy | 100% |
Unsupported requests refused |
| Eval cost | $0.168 |
Cost-aware evaluation |
| Boundary | non-diagnostic |
Not medical advice |
Python FastAPI ChromaDB SentenceTransformers LangChain RAGAS Streamlit
step 1 → define what "working" means before writing a single line
step 2 → build the evaluation harness
step 3 → write the system
step 4 → break it intentionally (adversarial inputs, edge cases, drift simulation)
step 5 → fix what breaks
step 6 → measure again
step 7 → deploy with monitoring hooks
step 8 → repeat when production proves you wrong
This is how suspicious metrics become trustworthy. This is how a metric bug gets caught before it becomes a product lie. This is how a smaller system with gates beats a bigger prompt with vibes.
| Signal | Current State |
|---|---|
| Evidence systems | 9 mapped in EVIDENCEBOUND |
| Featured public systems | QueryPilot · UPI Fraud Engine · MedRAG |
| Main stack | Python · FastAPI · RAG · XGBoost · Vue · Three.js |
| Current build | SecondSelf - evidence-bound career OS |
$ ./parth --shutdown
[saving state] ✓ 9 evidence systems mapped
[saving state] ✓ 3 featured systems public
[saving state] ✓ all evaluation harnesses active
[saving state] ✓ open to the right problem
[goodbye] see you on the other side of the next PR.