PowerBarcoder 🧬

Stop discarding your best data. Let AI rescue it.

The AI-Native Platform for Next-Generation DNA Barcoding

🚀 The Problem: Long Amplicons, Short Reads

Standard pipelines discard up to 30% of valuable data because forward and reverse reads don't overlap sufficiently. You lose critical biological signals simply because traditional algorithms are too rigid.

💡 The Solution: PowerBarcoder

PowerBarcoder combines Reference-Guided Merging with LLM Arbitration to recover those lost sequences with high fidelity. Think of it as having a bioinformatician review every ambiguous read—at machine speed.

⚡ Why Researchers Choose PowerBarcoder

🎯 Maximize Data Recovery

Don't settle for standard yields. Our Two-Stage Error Learning adapts to your batch-specific noise patterns, recovering significantly more ASVs than traditional DADA2 pipelines.

🤖 AI-Powered Accuracy

When algorithms disagree, PowerBarcoder calls in an expert. Our LLM Arbitration Engine (compatible with GPT-4, Claude, Gemini, and local models) reviews ambiguous cases, achieving 91.5% consistency with human domain experts.

🔬 Hybrid Precision

We don't rely solely on public databases. Our Hybrid Reference Strategy combines global references (BOLD, GenBank) with your batch's specific data for unmatched alignment accuracy.

🛠️ Built for Modern Science

Feature	Benefit
🐳 Docker Ready	Deploy anywhere in minutes. No dependency hell.
🖥️ Beautiful Web UI	No command-line guessing. Monitor progress in real-time.
📊 Audit-Ready	Every AI decision comes with a reasoning log. Full traceability.
🔄 Streaming Analysis	Watch your data process live via Server-Sent Events (SSE).
📈 Quality Reports	Automated Excel reports with charts and statistics.

🏁 Get Started in 3 Minutes

Option 1: Docker (Recommended) 🐳

Step 1: Pull and Run

# Pull the latest image
docker pull powerbarcoder/powerbarcoder:latest

# Run the container
docker run -d -p 5000:5000 \
  -v $(pwd)/data:/PowerBarcoder_data \
  --name powerbarcoder \
  powerbarcoder/powerbarcoder:latest

Step 2: Open the Web UI

Navigate to http://localhost:5000 in your browser.

Step 3: Upload Your Data

Click "New Batch" in the dashboard
Upload your paired-end FASTQ files (.fastq.gz or .fq.gz)
Provide your sample metadata CSV (optional but recommended)

Step 4: Configure Pipeline

Working Directory: Auto-created in /PowerBarcoder_data/batch_YYYYMMDD_HHMMSS
Reference Database: Choose BOLD, GenBank, or upload custom FASTA
LLM Provider (optional): OpenAI, Anthropic, Google, or Local (Ollama)
Pipeline Mode: Standard or Express (skips LLM arbitration)

Step 5: Start Processing

Click "Start Pipeline" and monitor progress in real-time. The dashboard shows:

Current pipeline stage (Demultiplex → Denoise → Merge → BLAST → QC)
Progress percentage
Live logs via SSE streaming
Estimated time remaining

Step 6: Download Results

When complete, download:

ASV Table (CSV): Final species identification results
QC Report (Excel): Quality metrics with charts
Merge Tracking (CSV): Detailed merge statistics
Audit Logs: AI reasoning logs (if LLM was used)

Option 2: Local Installation (For Developers) 💻

# Clone the repository
git clone https://github.com/PowerBarcoder/PowerBarcoder.git
cd PowerBarcoder

# Install dependencies
pip install -r requirements.txt

# Install R dependencies (required for DADA2)
Rscript -e "install.packages(c('dada2', 'optparse', 'jsonlite'))"

# Run the application
python app.py

Then open http://localhost:5000 in your browser.

💡 Best Practices

📋 Data Preparation

Quality Check First: Run FastQC on your raw reads to identify potential issues
Consistent Naming: Use clear, consistent sample names (e.g., Site01_Rep1_R1.fastq.gz)
Metadata is Key: Provide a CSV with sample info (sampling site, date, replicate number)

🎯 Pipeline Configuration

Choose the Right Mode:
- Standard Mode: For publication-quality results (includes LLM arbitration)
- Express Mode: For quick exploratory analysis (skips LLM)
Reference Selection:
- Use BOLD for animal barcoding (COI)
- Use GenBank for broader taxonomic coverage
- Upload custom FASTA for targeted studies (e.g., invasive species monitoring)
LLM Provider Selection:
- GPT-4: Best overall accuracy, higher cost
- Claude 3.5: Strong biological reasoning, competitive pricing
- Gemini Pro: Fast and cost-effective
- Local (Ollama): Free but requires powerful GPU

🔬 Quality Control

Review QC Report: Check for batch effects, chimera rates, and merge success rates
Inspect Audit Logs: For critical decisions, review AI reasoning in audit logs
Compare Replicates: Technical replicates should show high consistency
Validate Outliers: Unexpected species may be real discoveries—or contamination

📊 Result Interpretation

ASV Abundance: Higher read count ≠ higher biomass (PCR bias exists)
Species Assignment: Check confidence scores (based on BLAST identity %)
Merge Statistics: Low merge rates may indicate:
- Amplicon too long for read length
- Low library quality
- Primer mismatch

🚀 Optimization Tips

Batch Processing: Group samples by project for consistent reference selection
Iterative Refinement: Re-run with adjusted parameters if initial results are poor
Resource Allocation: For large batches (>100 samples), allocate 32GB+ RAM
Cost Management: Use Express mode for pilot studies, Standard mode for final analysis

📚 Documentation

Explore in-depth guides for each component:

Topic	Guide
Pipeline Stages	Batch Processing • Demultiplexing • Denoising • Merging
AI & Intelligence	Quality Control • LLM Integration
System Architecture	Data Flow • Log Flow • Frontend

🎓 Example Workflow

Here's a real-world example using marine biodiversity monitoring data:

# 1. Start PowerBarcoder
docker run -d -p 5000:5000 -v $(pwd)/data:/PowerBarcoder_data \
  --name powerbarcoder powerbarcoder/powerbarcoder:latest

# 2. Access Web UI
# → http://localhost:5000

# 3. Upload data
# → Navigate to "New Batch"
# → Upload: marine_samples_R1.fastq.gz, marine_samples_R2.fastq.gz
# → Upload metadata: marine_metadata.csv

# 4. Configure
# → Reference: BOLD (COI database)
# → LLM: Claude 3.5 Sonnet
# → Mode: Standard

# 5. Start and monitor
# → Click "Start Pipeline"
# → Watch real-time progress in dashboard

# 6. Review results
# → Download ASV table: marine_samples_asv_table.csv
# → Download QC report: marine_samples_qc_report.xlsx
# → Check audit logs for ambiguous species assignments

# 7. Analyze
# → Import ASV table into R/Python for diversity analysis
# → Cross-reference unexpected species with GBIF occurrence data
# → Report findings with full audit trail

❓ FAQ

Q1: What sequencing platforms are supported?

PowerBarcoder works with any Illumina paired-end data (MiSeq, NovaSeq, etc.). We've tested with:

COI barcoding (600-700 bp amplicons)
16S rRNA (V3-V4 region, ~450 bp)
ITS regions (variable length)

Q2: Do I need an API key for LLM providers?

Yes, for cloud providers (OpenAI, Anthropic, Google). Set your API key via:

Environment variable: OPENAI_API_KEY, ANTHROPIC_API_KEY, or GOOGLE_API_KEY
Web UI: Settings → LLM Configuration

For local models (Ollama), no API key needed.

Q3: How much does LLM arbitration cost?

Cost depends on your data volume and provider:

GPT-4: ~$0.10 per 100 ambiguous merges
Claude 3.5: ~$0.05 per 100 merges
Gemini Pro: ~$0.02 per 100 merges
Local (Ollama): Free (requires GPU)

Typical batch (50 samples): $1-5 total LLM cost.

Q4: Can I use PowerBarcoder offline?

Yes! Use Local mode with Ollama:

Install Ollama: https://ollama.ai
Pull a biology-capable model: ollama pull llama3.1:70b
Configure PowerBarcoder to use Local provider

🤝 Contributing

We welcome contributions from the community:

🐞 Bug Reports → GitHub Issues
✨ Feature Requests → Share your use case and we'll consider it
🔧 Code Contributions → Fork and submit a PR
📖 Documentation → Help improve guides and tutorials
🧪 Validation Studies → Share your results to help us improve

📖 Citation

If PowerBarcoder aids your research, please cite:

@software{powerbarcoder2025,
  title = {PowerBarcoder: A Reference-Guided and LLM-Arbitrated Pipeline for DNA Barcoding},
  author = {Kuo, W. and Contributors},
  year = {2025},
  url = {https://github.com/PowerBarcoder/PowerBarcoder},
  version = {2.0.0},
  note = {AI-native NGS pipeline for metabarcoding analysis}
}

📄 License

Licensed under Apache 2.0. Open source and free for academic and commercial use.
See LICENSE for details.

🙏 Acknowledgements

PowerBarcoder builds upon:

DADA2 for error modeling and denoising
BLAST+ for sequence alignment
BOLD and GenBank for reference databases

Special thanks to the bioinformatics and molecular ecology communities for feedback and validation.

Maximize your data. Minimize your doubts. 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
docs		docs
src		src
static		static
templates		templates
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PowerBarcoder 🧬

🚀 The Problem: Long Amplicons, Short Reads

💡 The Solution: PowerBarcoder

⚡ Why Researchers Choose PowerBarcoder

🎯 Maximize Data Recovery

🤖 AI-Powered Accuracy

🔬 Hybrid Precision

🛠️ Built for Modern Science

🏁 Get Started in 3 Minutes

Option 1: Docker (Recommended) 🐳

Option 2: Local Installation (For Developers) 💻

💡 Best Practices

📋 Data Preparation

🎯 Pipeline Configuration

🔬 Quality Control

📊 Result Interpretation

🚀 Optimization Tips

📚 Documentation

🎓 Example Workflow

❓ FAQ

Q1: What sequencing platforms are supported?

Q2: Do I need an API key for LLM providers?

Q3: How much does LLM arbitration cost?

Q4: Can I use PowerBarcoder offline?

🤝 Contributing

📖 Citation

📄 License

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PowerBarcoder 🧬

🚀 The Problem: Long Amplicons, Short Reads

💡 The Solution: PowerBarcoder

⚡ Why Researchers Choose PowerBarcoder

🎯 Maximize Data Recovery

🤖 AI-Powered Accuracy

🔬 Hybrid Precision

🛠️ Built for Modern Science

🏁 Get Started in 3 Minutes

Option 1: Docker (Recommended) 🐳

Option 2: Local Installation (For Developers) 💻

💡 Best Practices

📋 Data Preparation

🎯 Pipeline Configuration

🔬 Quality Control

📊 Result Interpretation

🚀 Optimization Tips

📚 Documentation

🎓 Example Workflow

❓ FAQ

Q1: What sequencing platforms are supported?

Q2: Do I need an API key for LLM providers?

Q3: How much does LLM arbitration cost?

Q4: Can I use PowerBarcoder offline?

🤝 Contributing

📖 Citation

📄 License

🙏 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages