Parth Tiwari parthtiwari-dev

██████╗  █████╗ ██████╗ ████████╗██╗  ██╗    ████████╗██╗██╗    ██╗ █████╗ ██████╗ ██╗
██╔══██╗██╔══██╗██╔══██╗╚══██╔══╝██║  ██║    ╚══██╔══╝██║██║    ██║██╔══██╗██╔══██╗██║
██████╔╝███████║██████╔╝   ██║   ███████║       ██║   ██║██║ █╗ ██║███████║██████╔╝██║
██╔═══╝ ██╔══██║██╔══██╗   ██║   ██╔══██║       ██║   ██║██║███╗██║██╔══██║██╔══██╗██║
██║     ██║  ██║██║  ██║   ██║   ██║  ██║       ██║   ██║╚███╔███╔╝██║  ██║██║  ██║██║
╚═╝     ╚═╝  ╚═╝╚═╝  ╚═╝   ╚═╝   ╚═╝  ╚═╝       ╚═╝   ╚═╝ ╚══╝╚══╝ ╚═╝  ╚═╝╚═╝  ╚═╝╚═╝

◈ SYSTEM BOOT

$ initializing parth_tiwari.profile ...

[✓] identity          →  AI Systems Engineer
[✓] location          →  Bengaluru, India
[✓] status            →  open to the right problem
[✓] philosophy        →  evidence before claims
[✓] vibe-coding       →  NOT DETECTED
[✓] evaluation        →  ACTIVE
[✓] evidence systems  →  9 mapped in EVIDENCEBOUND
[✓] current build     →  SecondSelf
[✓] work node         →  Stick and Dot  (AI/ML Intern)

[READY] parth_tiwari.profile loaded successfully.

◈ WHO I AM (told through what broke)

Most profiles show you the wins. Here's what actually happened.

Building a fraud engine. Backtesting revealed this:

train ROC-AUC      →  0.895   ← model looked great
production ROC     →  0.60    ← system was lying to itself the whole time

cause:  temporal features bled future signal into past training windows
fix:    leakage validation, point-in-time enforcement, rebuilt from scratch
result: precision stayed useful under a real alert budget

Shipped a Text-to-SQL agent. Hallucination detector reported 100% hallucination:

hallucination_rate  →  100%   ← every query hallucinating?
actual rate         →  0%     ← the metric was wrong, not the system

cause:  schema_tables_used returned ["schema_dict", "tables"] — dict keys, not table names
fix:    one-line patch
lesson: I found this because I wrote a hallucination detector in the first place

Deployed to Render. LLM mixed up two different databases:

question  →  "what is the total revenue?"      (ecommerce schema)
sql       →  SELECT SUM(amount) FROM fines      (library schema — wrong database entirely)

cause:  both schemas lived in the same Chroma collection, embeddings leaked cross-schema
fix:    prompt isolation + schema-scoped retrieval + re-evaluated full 82-query benchmark

The pattern: I find these things because I build evaluation harnesses before I trust results.

- "it works on my machine" → ship it
+ measure → break it intentionally → fix it → measure again → then ship it

◈ MODEL CARD

model_id         : parth-tiwari-v2
type             : early-career AI systems engineer
architecture     : first-principles → build → evaluate → break → fix → deploy
training_data    : production constraints, real failure modes, measurable outcomes

benchmarks:
  text_to_sql_execution_success  : 95.7%     # 82-query ecommerce benchmark
  cross_schema_generalization    : 100%      # zero-shot on unseen library schema
  syntactic_hallucination_rate   : 0.0%      # schema-grounded generation
  fraud_precision_in_budget      : 92.06%    # 0.5% daily alert constraint
  fraud_p95_latency              : ~386ms    # API scoring path
  medrag_answered_faithfulness   : ~0.99     # cited medical retrieval answers
  medrag_refusal_accuracy        : 100%      # insufficient evidence => refusal
  vivid_beta_users               : 10+       # creative AI work under Stick and Dot

serving:
  portfolio          : EVIDENCEBOUND — 9 evidence systems, same-world overlays
  deployment         : Docker · Render · Streamlit · HuggingFace · Vercel
  current_focus      : SecondSelf · evidence-bound career/application OS

known_limitations    : early-career · still learning · high ownership · ships with boundaries

◈ DEPLOYED SYSTEMS

Featured below: 3 public systems. Full map: EVIDENCEBOUND — 9 nodes across personal projects, work evidence, current builds, and tooling.

⚡ QUERYPILOT · Self-Correcting Text-to-SQL Agent

  Natural Language
        │
        ▼
  Schema-Aware RAG  ──►  SQL Generator
                               │
                         Static Validator
                               │
               ┌───────────────┼───────────────┐
          Regex Repair       LLM Fix        Executor
               └───────────────┴───────────────┘
                        Self-Correction Loop
                           (max 3 attempts)

Metric	Result	Context
First-attempt success	`90.0%`	No correction, cold generation
After self-correction	`95.7%`	3-stage loop on 82-query benchmark
Hallucination rate	`0.0%`	Zero invented tables or columns
Cross-schema generalization	`100%`	Library schema, zero domain tuning
Cold-start reduction	`~400ms`	Per-schema agent caching

Python LangGraph FastAPI ChromaDB PostgreSQL Docker GitHub Actions

🛡 UPI FRAUD ENGINE · Real-Time Fraud Decision System

  HARD CONSTRAINTS (non-negotiable):
  ├── score transaction at T using only pre-T features   (no future leakage)
  ├── ≤ 0.5% daily alert budget                         (precision is everything)
  └── simulate delayed fraud labels                     (real-world label lag)

  transactions → point-in-time features → leakage tests → alert-budget model
  train/serve drift surfaced → rebuilt → re-tested under real decision constraints

Metric	Result	Context
Precision in alert budget	`92.06%`	Only flags what matters
P95 latency	`~386ms`	API scoring path
Leakage tests	`55+`	Temporal integrity checks
Backtest mode	`day-by-day`	Production-like replay

Python XGBoost FastAPI DuckDB Great Expectations Docker

🧬 EVIDENCE-BOUND DRUG RAG · Medical Knowledge Retrieval

  HARD CONSTRAINT: medical domain — hallucination is patient harm
  ├── every claim needs source evidence
  ├── insufficient evidence must trigger refusal, not a guess
  └── faithfulness is measured, not assumed

  FDA + NICE PDFs → semantic chunks → retrieval → citation → refusal policy

Metric	Result	Context
Answered faithfulness	`~0.99`	Claims grounded in source
Refusal accuracy	`100%`	Unsupported requests refused
Eval cost	`$0.168`	Cost-aware evaluation
Boundary	`non-diagnostic`	Not medical advice

Python FastAPI ChromaDB SentenceTransformers LangChain RAGAS Streamlit

◈ HOW I ACTUALLY BUILD

step 1  →  define what "working" means before writing a single line
step 2  →  build the evaluation harness
step 3  →  write the system
step 4  →  break it intentionally  (adversarial inputs, edge cases, drift simulation)
step 5  →  fix what breaks
step 6  →  measure again
step 7  →  deploy with monitoring hooks
step 8  →  repeat when production proves you wrong

This is how suspicious metrics become trustworthy. This is how a metric bug gets caught before it becomes a product lie. This is how a smaller system with gates beats a bigger prompt with vibes.

◈ STACK

◈ STATS

Signal	Current State
Evidence systems	`9 mapped in EVIDENCEBOUND`
Featured public systems	`QueryPilot · UPI Fraud Engine · MedRAG`
Main stack	`Python · FastAPI · RAG · XGBoost · Vue · Three.js`
Current build	`SecondSelf - evidence-bound career OS`

$ ./parth --shutdown

[saving state]   ✓  9 evidence systems mapped
[saving state]   ✓  3 featured systems public
[saving state]   ✓  all evaluation harnesses active
[saving state]   ✓  open to the right problem

[goodbye]  see you on the other side of the next PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly