Stop discarding your best data. Let AI rescue it.
The AI-Native Platform for Next-Generation DNA Barcoding
Standard pipelines discard up to 30% of valuable data because forward and reverse reads don't overlap sufficiently. You lose critical biological signals simply because traditional algorithms are too rigid.
PowerBarcoder combines Reference-Guided Merging with LLM Arbitration to recover those lost sequences with high fidelity. Think of it as having a bioinformatician review every ambiguous readβat machine speed.
Don't settle for standard yields. Our Two-Stage Error Learning adapts to your batch-specific noise patterns, recovering significantly more ASVs than traditional DADA2 pipelines.
When algorithms disagree, PowerBarcoder calls in an expert. Our LLM Arbitration Engine (compatible with GPT-4, Claude, Gemini, and local models) reviews ambiguous cases, achieving 91.5% consistency with human domain experts.
We don't rely solely on public databases. Our Hybrid Reference Strategy combines global references (BOLD, GenBank) with your batch's specific data for unmatched alignment accuracy.
| Feature | Benefit |
|---|---|
| π³ Docker Ready | Deploy anywhere in minutes. No dependency hell. |
| π₯οΈ Beautiful Web UI | No command-line guessing. Monitor progress in real-time. |
| π Audit-Ready | Every AI decision comes with a reasoning log. Full traceability. |
| π Streaming Analysis | Watch your data process live via Server-Sent Events (SSE). |
| π Quality Reports | Automated Excel reports with charts and statistics. |
Step 1: Pull and Run
# Pull the latest image
docker pull powerbarcoder/powerbarcoder:latest
# Run the container
docker run -d -p 5000:5000 \
-v $(pwd)/data:/PowerBarcoder_data \
--name powerbarcoder \
powerbarcoder/powerbarcoder:latestStep 2: Open the Web UI
Navigate to http://localhost:5000 in your browser.
Step 3: Upload Your Data
- Click "New Batch" in the dashboard
- Upload your paired-end FASTQ files (
.fastq.gzor.fq.gz) - Provide your sample metadata CSV (optional but recommended)
Step 4: Configure Pipeline
- Working Directory: Auto-created in
/PowerBarcoder_data/batch_YYYYMMDD_HHMMSS - Reference Database: Choose BOLD, GenBank, or upload custom FASTA
- LLM Provider (optional): OpenAI, Anthropic, Google, or Local (Ollama)
- Pipeline Mode: Standard or Express (skips LLM arbitration)
Step 5: Start Processing
Click "Start Pipeline" and monitor progress in real-time. The dashboard shows:
- Current pipeline stage (Demultiplex β Denoise β Merge β BLAST β QC)
- Progress percentage
- Live logs via SSE streaming
- Estimated time remaining
Step 6: Download Results
When complete, download:
- ASV Table (CSV): Final species identification results
- QC Report (Excel): Quality metrics with charts
- Merge Tracking (CSV): Detailed merge statistics
- Audit Logs: AI reasoning logs (if LLM was used)
# Clone the repository
git clone https://github.com/PowerBarcoder/PowerBarcoder.git
cd PowerBarcoder
# Install dependencies
pip install -r requirements.txt
# Install R dependencies (required for DADA2)
Rscript -e "install.packages(c('dada2', 'optparse', 'jsonlite'))"
# Run the application
python app.pyThen open http://localhost:5000 in your browser.
- Quality Check First: Run FastQC on your raw reads to identify potential issues
- Consistent Naming: Use clear, consistent sample names (e.g.,
Site01_Rep1_R1.fastq.gz) - Metadata is Key: Provide a CSV with sample info (sampling site, date, replicate number)
-
Choose the Right Mode:
- Standard Mode: For publication-quality results (includes LLM arbitration)
- Express Mode: For quick exploratory analysis (skips LLM)
-
Reference Selection:
- Use BOLD for animal barcoding (COI)
- Use GenBank for broader taxonomic coverage
- Upload custom FASTA for targeted studies (e.g., invasive species monitoring)
-
LLM Provider Selection:
- GPT-4: Best overall accuracy, higher cost
- Claude 3.5: Strong biological reasoning, competitive pricing
- Gemini Pro: Fast and cost-effective
- Local (Ollama): Free but requires powerful GPU
- Review QC Report: Check for batch effects, chimera rates, and merge success rates
- Inspect Audit Logs: For critical decisions, review AI reasoning in audit logs
- Compare Replicates: Technical replicates should show high consistency
- Validate Outliers: Unexpected species may be real discoveriesβor contamination
- ASV Abundance: Higher read count β higher biomass (PCR bias exists)
- Species Assignment: Check confidence scores (based on BLAST identity %)
- Merge Statistics: Low merge rates may indicate:
- Amplicon too long for read length
- Low library quality
- Primer mismatch
- Batch Processing: Group samples by project for consistent reference selection
- Iterative Refinement: Re-run with adjusted parameters if initial results are poor
- Resource Allocation: For large batches (>100 samples), allocate 32GB+ RAM
- Cost Management: Use Express mode for pilot studies, Standard mode for final analysis
Explore in-depth guides for each component:
| Topic | Guide |
|---|---|
| Pipeline Stages | Batch Processing β’ Demultiplexing β’ Denoising β’ Merging |
| AI & Intelligence | Quality Control β’ LLM Integration |
| System Architecture | Data Flow β’ Log Flow β’ Frontend |
Here's a real-world example using marine biodiversity monitoring data:
# 1. Start PowerBarcoder
docker run -d -p 5000:5000 -v $(pwd)/data:/PowerBarcoder_data \
--name powerbarcoder powerbarcoder/powerbarcoder:latest
# 2. Access Web UI
# β http://localhost:5000
# 3. Upload data
# β Navigate to "New Batch"
# β Upload: marine_samples_R1.fastq.gz, marine_samples_R2.fastq.gz
# β Upload metadata: marine_metadata.csv
# 4. Configure
# β Reference: BOLD (COI database)
# β LLM: Claude 3.5 Sonnet
# β Mode: Standard
# 5. Start and monitor
# β Click "Start Pipeline"
# β Watch real-time progress in dashboard
# 6. Review results
# β Download ASV table: marine_samples_asv_table.csv
# β Download QC report: marine_samples_qc_report.xlsx
# β Check audit logs for ambiguous species assignments
# 7. Analyze
# β Import ASV table into R/Python for diversity analysis
# β Cross-reference unexpected species with GBIF occurrence data
# β Report findings with full audit trailPowerBarcoder works with any Illumina paired-end data (MiSeq, NovaSeq, etc.). We've tested with:
- COI barcoding (600-700 bp amplicons)
- 16S rRNA (V3-V4 region, ~450 bp)
- ITS regions (variable length)
Yes, for cloud providers (OpenAI, Anthropic, Google). Set your API key via:
- Environment variable:
OPENAI_API_KEY,ANTHROPIC_API_KEY, orGOOGLE_API_KEY - Web UI: Settings β LLM Configuration
For local models (Ollama), no API key needed.
Cost depends on your data volume and provider:
- GPT-4: ~$0.10 per 100 ambiguous merges
- Claude 3.5: ~$0.05 per 100 merges
- Gemini Pro: ~$0.02 per 100 merges
- Local (Ollama): Free (requires GPU)
Typical batch (50 samples): $1-5 total LLM cost.
Yes! Use Local mode with Ollama:
- Install Ollama:
https://ollama.ai - Pull a biology-capable model:
ollama pull llama3.1:70b - Configure PowerBarcoder to use Local provider
We welcome contributions from the community:
- π Bug Reports β GitHub Issues
- β¨ Feature Requests β Share your use case and we'll consider it
- π§ Code Contributions β Fork and submit a PR
- π Documentation β Help improve guides and tutorials
- π§ͺ Validation Studies β Share your results to help us improve
If PowerBarcoder aids your research, please cite:
@software{powerbarcoder2025,
title = {PowerBarcoder: A Reference-Guided and LLM-Arbitrated Pipeline for DNA Barcoding},
author = {Kuo, W. and Contributors},
year = {2025},
url = {https://github.com/PowerBarcoder/PowerBarcoder},
version = {2.0.0},
note = {AI-native NGS pipeline for metabarcoding analysis}
}Licensed under Apache 2.0. Open source and free for academic and commercial use.
See LICENSE for details.
PowerBarcoder builds upon:
- DADA2 for error modeling and denoising
- BLAST+ for sequence alignment
- BOLD and GenBank for reference databases
Special thanks to the bioinformatics and molecular ecology communities for feedback and validation.
Maximize your data. Minimize your doubts. π
