Skip to content

Exile404/Datascope-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataScope AI

Full-stack LLM operations platform built on a custom fine-tuned model. Profile any CSV, evaluate model outputs, monitor drift, and analyze costs — all powered by your own LLM, not hosted APIs.

Status License Python Next.js


Preview

Home

Profiler Evaluator
Profiler Evaluator
Drift Monitor Cost Analyzer
Drift Cost

What is DataScope AI?

A complete LLM operations dashboard built end-to-end without relying on hosted APIs. The system combines a fine-tuned Llama 3.1 8B model with statistical analysis services to deliver four production-grade tools:

  1. Data Profiler — Upload any CSV, get automated EDA + AI-generated insights
  2. LLM Evaluator — Test prompts across temperatures with automated quality scoring (LLM-as-judge)
  3. Drift Monitor — Semantic drift detection, hallucination flagging, time-series tracking
  4. Cost Analyzer — Token cost projections across 11 LLM providers

Architecture

┌─────────────────┐     ┌────────────────────┐     ┌─────────────────────┐
│  Next.js 15     │────▶│  FastAPI           │────▶│  Fine-Tuned LLM     │
│  + shadcn/ui    │◀────│  + LangChain       │◀────│  Llama 3.1 8B + LoRA│
│  + TanStack     │ JSON│  + Pandas/SciPy    │     │  via Ollama         │
│  + Recharts     │     │  + sentence-trans  │     └─────────────────────┘
└─────────────────┘     │  + SQLite metrics  │     ┌─────────────────────┐
                        └────────────────────┘────▶│  Embedding Engine   │
                                                   │  all-MiniLM-L6-v2   │
                                                   └─────────────────────┘
     Frontend                Backend                    ML Layer

Three layers, clean separation:

  • Frontend — UI dashboard with TanStack Query state management
  • Backend — FastAPI server with engines (LLM, embeddings, metrics) and services (profiler, evaluator, drift, cost)
  • ML — Training pipeline (Unsloth + LoRA) and runtime services

Tech Stack

Layer Technology
Base Model Llama 3.1 8B Instruct (4-bit quantized)
Fine-Tuning Unsloth + LoRA (r=16, alpha=32)
Inference Ollama (GGUF Q4_K_M)
Embeddings sentence-transformers (all-MiniLM-L6-v2)
Backend FastAPI, Pydantic, LangChain, Pandas, SciPy, aiosqlite
Frontend Next.js 15, TypeScript, Tailwind CSS v4, shadcn/ui
State Management TanStack Query
Charts Recharts
Markdown react-markdown + remark-gfm

Features

1. Data Profiler

Upload any CSV and receive:

  • Automated statistical profiling (mean, std, skewness, kurtosis, IQR outliers)
  • Correlation analysis with strength classification
  • Data quality scoring with column-level callouts
  • AI-generated insights with actionable ML recommendations
  • Reasoning trace (chain-of-thought visibility)

2. LLM Evaluator

  • Single-prompt evaluation with quality scoring
  • Multi-temperature comparison (parallel async inference)
  • LLM-as-judge scoring across 4 dimensions (relevance, coherence, completeness, factuality)
  • Latency and token tracking
  • Persistent evaluation history (SQLite)

3. Drift Monitor

  • Semantic drift detection via sentence embeddings
  • Statistical drift on input distributions
  • Hallucination detection (flags numbers/entities not in source context)
  • Real-time alerts (critical/warning/info)
  • Time-series charts with threshold reference lines
  • Auto-refreshing history

4. Cost Analyzer

  • Token cost calculations across 11 LLM providers
  • Daily/monthly/annual projections
  • Cost comparison bar chart with color-coded tiers
  • Detailed pricing table with rankings
  • Self-hosted vs paid savings calculator

Project Structure

datascope-ai/
├── ml/                          # ML training pipeline
│   ├── configs/
│   │   ├── model_config.yaml    # Base model + LoRA settings
│   │   └── training_config.yaml # Hyperparameters
│   ├── scripts/
│   │   ├── generate_training_data.py  # 14-domain synthetic generator
│   │   ├── train.py             # Unsloth fine-tuning
│   │   └── export.py            # Model export utility
│   └── data/processed/
│       └── datascope_train.toon # 10K training examples
│
├── backend/                     # FastAPI server
│   ├── app/
│   │   ├── api/                 # Route handlers
│   │   │   ├── profiler.py
│   │   │   ├── evaluator.py
│   │   │   ├── drift.py
│   │   │   └── cost.py
│   │   ├── engines/             # ML services (heavy)
│   │   │   ├── llm_engine.py
│   │   │   ├── embedding_engine.py
│   │   │   └── metrics_store.py
│   │   ├── services/            # Business logic
│   │   │   ├── profiler_service.py
│   │   │   ├── insight_service.py
│   │   │   ├── evaluator_service.py
│   │   │   ├── drift_service.py
│   │   │   └── cost_service.py
│   │   ├── models/              # Pydantic schemas
│   │   ├── config.py
│   │   └── main.py
│   └── requirements.txt
│
├── frontend/                    # Next.js dashboard
│   ├── src/
│   │   ├── app/                 # App Router pages
│   │   ├── components/
│   │   ├── hooks/               # TanStack Query hooks
│   │   ├── lib/                 # API client, constants
│   │   └── types/               # TypeScript interfaces
│   └── package.json
│
└── screenshots/                 # README assets

Setup

Prerequisites

  • Python 3.12+
  • Node.js 20+
  • pnpm 9+
  • NVIDIA GPU with 16GB+ VRAM (for training only)
  • Ollama

1. Clone

git clone https://github.com/Exile404/Datascope-AI.git
cd Datascope-AI

2. ML Pipeline (skip if just running)

cd ml
python3 -m venv ml-env
source ml-env/bin/activate
pip install -r requirements.txt
pip install -r requirements-train.txt

cd scripts
python generate_training_data.py --num_examples 10000
python train.py

3. Backend

cd backend
python3 -m venv backend-env
source backend-env/bin/activate
pip install -r requirements.txt
cp .env.example .env

uvicorn app.main:app --reload --port 8000

API docs: http://localhost:8000/docs

4. Frontend

cd frontend
pnpm install
echo "NEXT_PUBLIC_API_URL=http://localhost:8000" > .env.local
pnpm dev

App: http://localhost:3000

5. Ollama

curl -fsSL https://ollama.com/install.sh | sh
cd ml/scripts/models/datascope-analyst-gguf_gguf
ollama create datascope-analyst -f ./Modelfile

API Endpoints

Profiler

Method Endpoint Description
POST /api/profiler/profile CSV → statistics only
POST /api/profiler/insight CSV → full LLM analysis

Evaluator

Method Endpoint Description
POST /api/evaluator/evaluate Single eval with quality scoring
POST /api/evaluator/compare Multi-temperature comparison
GET /api/evaluator/history Recent evaluations

Drift

Method Endpoint Description
POST /api/drift/detect Semantic + statistical drift
POST /api/drift/hallucinations Invented content detection
GET /api/drift/history/{metric} Time-series for any metric

Cost

Method Endpoint Description
POST /api/cost/calculate Single call cost
POST /api/cost/project Multi-model projections
GET /api/cost/usage Real usage from metrics store
GET /api/cost/models All supported models

Training Details

Dataset

  • 10,000 synthetic examples across 14 domains: E-Commerce, Healthcare, Finance, HR, IoT, Education, Marketing, Real Estate, Logistics, Social Media, Cybersecurity, Retail, Weather, Sports
  • Format: Custom .toon format with [system]/[input]/[output] blocks
  • Augmentation: Multiple distributions (normal, log-normal, beta, exponential, t-distribution, bimodal), realistic correlations, missing value injection

Hyperparameters

base_model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
lora:
  r: 16
  alpha: 32
  target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
training:
  epochs: 3
  effective_batch_size: 16
  learning_rate: 2e-4
  scheduler: cosine
  warmup_ratio: 0.05
  precision: bf16

Results

  • Final training loss: 0.4815
  • Trained in ~7 hours on RTX 5060 Ti (16GB)
  • Average quality score (LLM-as-judge): 96/100

Roadmap (v3)

  • Multi-agent orchestration (Statistician + Correlation + Quality + Strategist agents)
  • Streaming token output in UI
  • Histogram + correlation heatmap visualizations
  • Authentication + multi-tenancy
  • Docker Compose for one-command deployment
  • Vercel + Railway deployment guides
  • CI/CD with GitHub Actions

Author

Tahsinul Haque Dhrubo Master of Data Science — Deakin University GitHub: @Exile404

License

MIT

AI Usage

Used AI to create readme.md and debugging the code

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors