Skip to content

PowerBarcoder/PowerBarcoder_Public

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PowerBarcoder 🧬

Stop discarding your best data. Let AI rescue it.

PowerBarcoder Logo

The AI-Native Platform for Next-Generation DNA Barcoding

NGS LLM Docker License


πŸš€ The Problem: Long Amplicons, Short Reads

Standard pipelines discard up to 30% of valuable data because forward and reverse reads don't overlap sufficiently. You lose critical biological signals simply because traditional algorithms are too rigid.

πŸ’‘ The Solution: PowerBarcoder

PowerBarcoder combines Reference-Guided Merging with LLM Arbitration to recover those lost sequences with high fidelity. Think of it as having a bioinformatician review every ambiguous readβ€”at machine speed.


⚑ Why Researchers Choose PowerBarcoder

🎯 Maximize Data Recovery

Don't settle for standard yields. Our Two-Stage Error Learning adapts to your batch-specific noise patterns, recovering significantly more ASVs than traditional DADA2 pipelines.

πŸ€– AI-Powered Accuracy

When algorithms disagree, PowerBarcoder calls in an expert. Our LLM Arbitration Engine (compatible with GPT-4, Claude, Gemini, and local models) reviews ambiguous cases, achieving 91.5% consistency with human domain experts.

πŸ”¬ Hybrid Precision

We don't rely solely on public databases. Our Hybrid Reference Strategy combines global references (BOLD, GenBank) with your batch's specific data for unmatched alignment accuracy.


πŸ› οΈ Built for Modern Science

Feature Benefit
🐳 Docker Ready Deploy anywhere in minutes. No dependency hell.
πŸ–₯️ Beautiful Web UI No command-line guessing. Monitor progress in real-time.
πŸ“Š Audit-Ready Every AI decision comes with a reasoning log. Full traceability.
πŸ”„ Streaming Analysis Watch your data process live via Server-Sent Events (SSE).
πŸ“ˆ Quality Reports Automated Excel reports with charts and statistics.

🏁 Get Started in 3 Minutes

Option 1: Docker (Recommended) 🐳

Step 1: Pull and Run

# Pull the latest image
docker pull powerbarcoder/powerbarcoder:latest

# Run the container
docker run -d -p 5000:5000 \
  -v $(pwd)/data:/PowerBarcoder_data \
  --name powerbarcoder \
  powerbarcoder/powerbarcoder:latest

Step 2: Open the Web UI

Navigate to http://localhost:5000 in your browser.

Step 3: Upload Your Data

  1. Click "New Batch" in the dashboard
  2. Upload your paired-end FASTQ files (.fastq.gz or .fq.gz)
  3. Provide your sample metadata CSV (optional but recommended)

Step 4: Configure Pipeline

  • Working Directory: Auto-created in /PowerBarcoder_data/batch_YYYYMMDD_HHMMSS
  • Reference Database: Choose BOLD, GenBank, or upload custom FASTA
  • LLM Provider (optional): OpenAI, Anthropic, Google, or Local (Ollama)
  • Pipeline Mode: Standard or Express (skips LLM arbitration)

Step 5: Start Processing

Click "Start Pipeline" and monitor progress in real-time. The dashboard shows:

  • Current pipeline stage (Demultiplex β†’ Denoise β†’ Merge β†’ BLAST β†’ QC)
  • Progress percentage
  • Live logs via SSE streaming
  • Estimated time remaining

Step 6: Download Results

When complete, download:

  • ASV Table (CSV): Final species identification results
  • QC Report (Excel): Quality metrics with charts
  • Merge Tracking (CSV): Detailed merge statistics
  • Audit Logs: AI reasoning logs (if LLM was used)

Option 2: Local Installation (For Developers) πŸ’»

# Clone the repository
git clone https://github.com/PowerBarcoder/PowerBarcoder.git
cd PowerBarcoder

# Install dependencies
pip install -r requirements.txt

# Install R dependencies (required for DADA2)
Rscript -e "install.packages(c('dada2', 'optparse', 'jsonlite'))"

# Run the application
python app.py

Then open http://localhost:5000 in your browser.


πŸ’‘ Best Practices

πŸ“‹ Data Preparation

  1. Quality Check First: Run FastQC on your raw reads to identify potential issues
  2. Consistent Naming: Use clear, consistent sample names (e.g., Site01_Rep1_R1.fastq.gz)
  3. Metadata is Key: Provide a CSV with sample info (sampling site, date, replicate number)

🎯 Pipeline Configuration

  1. Choose the Right Mode:

    • Standard Mode: For publication-quality results (includes LLM arbitration)
    • Express Mode: For quick exploratory analysis (skips LLM)
  2. Reference Selection:

    • Use BOLD for animal barcoding (COI)
    • Use GenBank for broader taxonomic coverage
    • Upload custom FASTA for targeted studies (e.g., invasive species monitoring)
  3. LLM Provider Selection:

    • GPT-4: Best overall accuracy, higher cost
    • Claude 3.5: Strong biological reasoning, competitive pricing
    • Gemini Pro: Fast and cost-effective
    • Local (Ollama): Free but requires powerful GPU

πŸ”¬ Quality Control

  1. Review QC Report: Check for batch effects, chimera rates, and merge success rates
  2. Inspect Audit Logs: For critical decisions, review AI reasoning in audit logs
  3. Compare Replicates: Technical replicates should show high consistency
  4. Validate Outliers: Unexpected species may be real discoveriesβ€”or contamination

πŸ“Š Result Interpretation

  1. ASV Abundance: Higher read count β‰  higher biomass (PCR bias exists)
  2. Species Assignment: Check confidence scores (based on BLAST identity %)
  3. Merge Statistics: Low merge rates may indicate:
    • Amplicon too long for read length
    • Low library quality
    • Primer mismatch

πŸš€ Optimization Tips

  1. Batch Processing: Group samples by project for consistent reference selection
  2. Iterative Refinement: Re-run with adjusted parameters if initial results are poor
  3. Resource Allocation: For large batches (>100 samples), allocate 32GB+ RAM
  4. Cost Management: Use Express mode for pilot studies, Standard mode for final analysis

πŸ“š Documentation

Explore in-depth guides for each component:

Topic Guide
Pipeline Stages Batch Processing β€’ Demultiplexing β€’ Denoising β€’ Merging
AI & Intelligence Quality Control β€’ LLM Integration
System Architecture Data Flow β€’ Log Flow β€’ Frontend

πŸŽ“ Example Workflow

Here's a real-world example using marine biodiversity monitoring data:

# 1. Start PowerBarcoder
docker run -d -p 5000:5000 -v $(pwd)/data:/PowerBarcoder_data \
  --name powerbarcoder powerbarcoder/powerbarcoder:latest

# 2. Access Web UI
# β†’ http://localhost:5000

# 3. Upload data
# β†’ Navigate to "New Batch"
# β†’ Upload: marine_samples_R1.fastq.gz, marine_samples_R2.fastq.gz
# β†’ Upload metadata: marine_metadata.csv

# 4. Configure
# β†’ Reference: BOLD (COI database)
# β†’ LLM: Claude 3.5 Sonnet
# β†’ Mode: Standard

# 5. Start and monitor
# β†’ Click "Start Pipeline"
# β†’ Watch real-time progress in dashboard

# 6. Review results
# β†’ Download ASV table: marine_samples_asv_table.csv
# β†’ Download QC report: marine_samples_qc_report.xlsx
# β†’ Check audit logs for ambiguous species assignments

# 7. Analyze
# β†’ Import ASV table into R/Python for diversity analysis
# β†’ Cross-reference unexpected species with GBIF occurrence data
# β†’ Report findings with full audit trail

❓ FAQ

Q1: What sequencing platforms are supported?

PowerBarcoder works with any Illumina paired-end data (MiSeq, NovaSeq, etc.). We've tested with:

  • COI barcoding (600-700 bp amplicons)
  • 16S rRNA (V3-V4 region, ~450 bp)
  • ITS regions (variable length)

Q2: Do I need an API key for LLM providers?

Yes, for cloud providers (OpenAI, Anthropic, Google). Set your API key via:

  • Environment variable: OPENAI_API_KEY, ANTHROPIC_API_KEY, or GOOGLE_API_KEY
  • Web UI: Settings β†’ LLM Configuration

For local models (Ollama), no API key needed.

Q3: How much does LLM arbitration cost?

Cost depends on your data volume and provider:

  • GPT-4: ~$0.10 per 100 ambiguous merges
  • Claude 3.5: ~$0.05 per 100 merges
  • Gemini Pro: ~$0.02 per 100 merges
  • Local (Ollama): Free (requires GPU)

Typical batch (50 samples): $1-5 total LLM cost.

Q4: Can I use PowerBarcoder offline?

Yes! Use Local mode with Ollama:

  1. Install Ollama: https://ollama.ai
  2. Pull a biology-capable model: ollama pull llama3.1:70b
  3. Configure PowerBarcoder to use Local provider

🀝 Contributing

We welcome contributions from the community:

  • 🐞 Bug Reports β†’ GitHub Issues
  • ✨ Feature Requests β†’ Share your use case and we'll consider it
  • πŸ”§ Code Contributions β†’ Fork and submit a PR
  • πŸ“– Documentation β†’ Help improve guides and tutorials
  • πŸ§ͺ Validation Studies β†’ Share your results to help us improve

πŸ“– Citation

If PowerBarcoder aids your research, please cite:

@software{powerbarcoder2025,
  title = {PowerBarcoder: A Reference-Guided and LLM-Arbitrated Pipeline for DNA Barcoding},
  author = {Kuo, W. and Contributors},
  year = {2025},
  url = {https://github.com/PowerBarcoder/PowerBarcoder},
  version = {2.0.0},
  note = {AI-native NGS pipeline for metabarcoding analysis}
}

πŸ“„ License

Licensed under Apache 2.0. Open source and free for academic and commercial use.
See LICENSE for details.


πŸ™ Acknowledgements

PowerBarcoder builds upon:

  • DADA2 for error modeling and denoising
  • BLAST+ for sequence alignment
  • BOLD and GenBank for reference databases

Special thanks to the bioinformatics and molecular ecology communities for feedback and validation.


Maximize your data. Minimize your doubts. πŸš€

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors