Genomics Data Pipeline
What you'll learn
~30 min- Build a Python CLI pipeline for FASTQ quality control using a single AI prompt
- Understand streaming file parsing, config-driven workflows, and reproducible logging
- Troubleshoot common issues with file formats, gzipped files, and large dataset handling
- Extend the pipeline with MultiQC output, trimming recommendations, and comparison mode
What you’re building
Every sequencing run produces FASTQ files. Every FASTQ file needs quality assessment before downstream analysis. This is the most repetitive task in a genomics lab — and the most important one to automate correctly.
In this lesson you will build a Python CLI tool that takes a directory of FASTQ files, runs quality metrics on each one, and generates a clean HTML summary report with interactive charts. It is config-driven, logs everything, and produces reproducible output. A grad student can build this in an afternoon and use it for the rest of their PhD.
This is different from the browser-based tools in Lessons 1 and 2. This tool runs on the command line — on your laptop, on a lab workstation, or on a shared compute server. It processes real files, writes real output, and integrates into real workflows.
If your institution uses an HPC cluster, replace python -m venv with conda create, use conda install instead of pip install, and submit long-running jobs through your cluster’s scheduler (SLURM, PBS, or SGE). The pipeline logic stays the same — only the environment setup changes.
The showcase
When finished, your pipeline will:
- Accept a directory of
.fastqor.fastq.gzfiles as input. - For each file, compute:
- Total read count
- Read length distribution
- Per-base quality score distribution (like FastQC’s per-base quality plot)
- Per-sequence quality score distribution
- GC content distribution
- Adapter content estimation (searching for common Illumina adapter sequences)
- Duplicate rate estimate (based on first 100,000 reads)
- Generate an HTML report with all metrics visualized as interactive Plotly charts.
- Save per-file JSON summaries for programmatic access.
- Log all operations with timestamps for reproducibility.
- Run from a YAML configuration file so you can standardize settings across your lab.
Think of it as a lightweight, customizable FastQC that you fully control.
When you submit samples to a sequencing facility (Illumina, PacBio, Oxford Nanopore) or a gene expression center (10X Genomics single-cell, spatial transcriptomics, PacBio Iso-Seq), the data you get back is exactly what this pipeline processes: FASTQ files ranging from a few hundred megabytes to tens of gigabytes. The pipeline you are building is your first-pass QC — run it the moment data lands on the shared drive, before investing hours in alignment or assembly. If something went wrong during library prep or sequencing, you will catch it in the QC report rather than discovering it halfway through your analysis.
If you are working with hands-on genomic datasets in a bioinformatics course, this pipeline is a practical tool you can use throughout the course for every dataset you encounter.
The prompt
Open your AI CLI tool (such as Claude Code, Gemini CLI, or your preferred tool) in an empty directory and paste:
Create a Python CLI tool called fastq-qc that processes FASTQ files and generatesquality control reports. Use only Python 3.10+ standard library plus these PyPIpackages: plotly, pyyaml. No BioPython dependency.
PROJECT STRUCTURE:fastq-qc/├── fastq_qc/│ ├── __init__.py│ ├── cli.py # argparse CLI entry point│ ├── config.py # YAML config loader with defaults│ ├── fastq_parser.py # streaming FASTQ parser (handles .gz)│ ├── metrics.py # quality metric calculators│ ├── report.py # HTML report generator with Plotly│ └── logger.py # timestamped logging setup├── config.yaml # default configuration file├── requirements.txt # plotly, pyyaml├── setup.py # pip install -e . support└── README.md # usage instructions
CLI INTERFACE: python -m fastq_qc --input /path/to/fastq/dir --output /path/to/reports python -m fastq_qc --input /path/to/fastq/dir --output /path/to/reports --config config.yaml
OPTIONS: --input, -i Directory containing .fastq or .fastq.gz files (required) --output, -o Output directory for reports (required, created if missing) --config, -c Path to YAML config file (optional, uses defaults) --sample-size Number of reads to sample for metrics (default: 500000, 0=all) --threads Number of parallel files to process (default: 1) --verbose Enable debug logging
FASTQ PARSER (fastq_parser.py): - Streaming parser that yields (header, sequence, quality) tuples - Handles both .fastq and .fastq.gz transparently - Validates FASTQ format: header starts with @, separator line is +, sequence and quality have equal length - Raises clear errors for malformed records with line numbers - Memory-efficient: processes one record at a time, never loads entire file
METRICS (metrics.py): Calculate these metrics per file: 1. read_count: total number of reads 2. read_lengths: Counter of read lengths (for distribution) 3. per_base_quality: for each position (0 to max_read_length), collect quality scores across all reads, report mean, median, Q1, Q3 4. per_sequence_quality: histogram of mean quality scores per read 5. gc_content: histogram of per-read GC fraction (bins of 1%) 6. base_content: per-position frequency of A, T, G, C, N 7. adapter_content: for each position, fraction of reads containing a known adapter k-mer starting at that position. Check TruSeq: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA and Nextera: CTGTCTCTTATACACATCTCCGAGCCCACGAGAC 8. duplicate_estimate: hash first 50 nt of each read, count collisions among first 100K reads, extrapolate
CONFIG (config.yaml): sample_size: 500000 min_quality_threshold: 20 adapter_sequences: truseq: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" nextera: "CTGTCTCTTATACACATCTCCGAGCCCACGAGAC" report: title: "FASTQ Quality Control Report" plots_per_row: 2
REPORT (report.py): Generate a single self-contained HTML file with embedded Plotly.js: - Header with report title, generation timestamp, tool version - Summary table: filename, read count, mean quality, mean GC%, pass/warn/fail status based on configurable thresholds - Per-file section with: - Per-base quality box plot (like FastQC) with green/yellow/red background zones at quality 30/20/0 - Read length distribution histogram - Per-sequence quality distribution histogram - GC content distribution with theoretical normal overlay - Per-base sequence content line plot (A/T/G/C/N across positions) - Adapter content line plot - All charts are Plotly (interactive zoom, hover tooltips) - Dark theme consistent with other tools in this track
LOGGING (logger.py): - All operations logged with ISO timestamps - Log to both console (INFO level) and file (DEBUG level) - Log file saved in output directory as qc_run_YYYYMMDD_HHMMSS.log - Log: config used, files found, per-file start/end times, any warnings
OUTPUT STRUCTURE: output_dir/ ├── qc_report.html # main HTML report ├── qc_run_20240115_143022.log # run log └── per_file/ ├── sample1.json # per-file metrics as JSON └── sample2.json
Generate all files with complete, working implementations. The tool should workend-to-end on real FASTQ files.Real FASTQ files are large (1-50 GB each). The --sample-size flag exists for a reason — set it to 500,000 reads for quick QC during a run, or 0 (all reads) for a final comprehensive report. For your first test, create a small test FASTQ file or use the --sample-size 1000 flag.
What you get
After the LLM generates the project, set it up in a virtual environment:
cd fastq-qcpython -m venv .venvsource .venv/bin/activate # On Windows: .venv\Scripts\activatepip install -e .Create a test file
If you do not have FASTQ files handy, ask the LLM to generate one:
Create a Python script called generate_test_fastq.py that generates a syntheticFASTQ file with 10,000 reads. Read length 150 bp. Quality scores should simulatetypical Illumina data: high quality (30-40) for the first 100 bases, degrading to20-30 for the last 50 bases. Include 5% of reads with adapter contamination(TruSeq adapter appended). Include 2% duplicate reads. Save as test_data/sample1.fastq.Then run the pipeline:
python -m fastq_qc --input test_data/ --output test_output/ --verboseExpected output
test_output/├── qc_report.html├── qc_run_20240115_143022.log└── per_file/ └── sample1.jsonOpen qc_report.html in your browser. You should see:
- Summary table showing sample1.fastq with ~10,000 reads, mean quality around 30-33, and GC% around 50%.
- Per-base quality plot showing high quality on the left degrading toward the right — the classic Illumina quality curve.
- Adapter content plot showing a bump of ~5% adapter contamination increasing toward the 3’ end of reads.
- GC content distribution centered around 50% with a roughly normal shape.
This tool is designed to slot into an existing workflow. The per-file JSON output can be consumed by downstream scripts:
import json
with open("test_output/per_file/sample1.json") as f: metrics = json.load(f)
if metrics["mean_quality"] < 25: print("WARNING: Low quality - consider re-sequencing")
if metrics["adapter_fraction"] > 0.1: print("WARNING: High adapter content - run trimming")You can also add a post-processing step that automatically triggers adapter trimming (Trimmomatic, fastp) when adapter content exceeds a threshold. The config file makes it easy to standardize thresholds across your lab group.
Worked example: Processing a real MiSeq run
Here is a realistic scenario for a graduate student running amplicon sequencing on an Illumina MiSeq.
Step 1. Your sequencing core delivers data to a shared drive. You have a folder with 48 FASTQ files (24 samples, paired-end):
/sequencing/run_2024_01/├── sample01_R1.fastq.gz├── sample01_R2.fastq.gz├── sample02_R1.fastq.gz├── sample02_R2.fastq.gz...└── sample24_R2.fastq.gzStep 2. Run the QC pipeline with sampling for a quick first look:
python -m fastq_qc \ --input /sequencing/run_2024_01/ \ --output /qc/run_2024_01/ \ --sample-size 100000 \ --verboseThis processes 100,000 reads from each file, taking about 2-3 minutes total.
Step 3. Open the HTML report and check the summary table. Look for:
- Red flags: any sample with mean quality below 25 (likely a library prep issue).
- Adapter content: R2 files typically show more adapter contamination than R1 in paired-end data.
- GC distribution: bimodal GC distributions suggest contamination or mixed organisms.
- Duplicate rate: amplicon data often shows 30-60% duplicates; whole-genome should be under 10%.
Step 4. If you see problems, run a comprehensive report on the flagged samples:
python -m fastq_qc \ --input /sequencing/run_2024_01/ \ --output /qc/run_2024_01_full/ \ --sample-size 0 \ --config config.yamlFor paired-end data, the QC tool processes R1 and R2 files independently. This is standard — most QC tools (including FastQC) do the same. To check that R1 and R2 are properly paired (same headers in the same order), you can add a follow-up prompt:
Add a --paired flag. When set, for each sample find the R1 and R2 files by namepattern (_R1/_R2 or _1/_2). Verify that the read headers match between R1 and R2.Report any orphaned reads or mismatched pairs. Show read-length distributionfor each mate in the report if both files are present.Worked example: Comparing quality across sequencing runs
If you run the same library type regularly (e.g., 16S amplicon sequencing every month), you want to track quality over time:
# This month's runpython -m fastq_qc --input /sequencing/run_2024_01/ --output /qc/run_2024_01/
# Compare to last month (requires the --compare customization below)python -m fastq_qc --input /sequencing/run_2024_01/ --output /qc/compare_jan/ \ --compare /qc/run_2023_12/The comparison mode (added via the --compare customization in the section below) overlays per-base quality plots, making it easy to spot instrument degradation. Build the base pipeline first, then add this feature.
When Things Go Wrong
Use the Symptom → Evidence → Request pattern: describe what you see, paste the error, then ask for a fix.
How it works
The pipeline follows a standard pattern for scientific CLI tools:
- Parse arguments with argparse. Validate that input directory exists and contains FASTQ files.
- Load config from YAML file (or use defaults). This makes the tool reproducible — check the config into version control alongside your data.
- Stream through each file one record at a time. Never load the entire file into memory. For gzipped files, use Python’s
gzip.open(). - Compute metrics as running aggregates. Per-base quality uses a position-indexed list of accumulators. GC content uses histogram bins. This keeps memory usage constant regardless of file size.
- Generate report by embedding Plotly.js from CDN and creating one chart per metric per file. The HTML file is self-contained except for the Plotly CDN link. For fully offline use, set
include_plotlyjs=Truein the report generator to embed the full Plotly library. - Log everything so you can trace exactly what happened, when, and with what parameters.
Customize it
Add MultiQC-compatible output
Add a MultiQC-compatible output module. Write metrics in the format that MultiQCexpects (see multiqc.info for the custom content specification). This way theuser can combine this tool's output with other tools (Trimmomatic, STAR, featureCounts)into a single MultiQC report. Output the files to a multiqc_data/ subdirectory.Add automatic trimming recommendations
Add a "Recommendations" section to the HTML report for each file. Based on themetrics, automatically suggest:- Whether adapter trimming is needed (adapter content > 5%)- Suggested quality trim threshold (based on where mean per-base quality drops below 20)- Whether the library is over-sequenced (duplicate rate > 30%)- Expected data loss after trimmingShow these as a color-coded checklist (green=OK, yellow=warn, red=action needed).Add email notification
Add a --notify flag that accepts an email address. When the pipeline finishes,send a summary email using Python's smtplib with the local SMTP server. Includethe summary table (text format) and attach the HTML report. This is useful forlong-running jobs on a shared server.Add comparison mode
Add a --compare flag that accepts a previous output directory. When set, the reportincludes side-by-side comparisons: show per-base quality plots overlaid for thecurrent run vs the previous run. This helps catch sequencing quality degradationover time. Highlight metrics that changed by more than 10%.Put your config.yaml in a git repository alongside your analysis scripts. When you publish a paper, the config file documents exactly what QC thresholds you used. Reviewers increasingly ask for this level of reproducibility documentation.
Beyond Genomics: Document Processing Pipelines
The pipeline architecture you just built — input files, stage-by-stage processing, validation, structured output — transfers directly to non-biological domains. Here is a striking example from historical document processing.
Projects digitizing hundreds of thousands of archival military personnel files (spanning millions of pages) from WWII and Korea follow this same pattern. These are official casualty records of service members — handwritten forms, typed reports, telegrams, and medical records from the 1940s and 1950s.
The OCR processing pipeline follows a pattern you will recognize:
- Column detection — identify the column layout of each scanned page (OpenCV)
- Row/text extraction — extract text from each detected region (OCR engine)
- Validation — cross-check extracted fields against known formats (service number patterns, date formats, rank abbreviations)
- ML training data generation — validated output feeds back into model training for improved accuracy on the next batch
This is the same input → process → validate → output architecture as your FASTQ pipeline. The config-driven approach works the same way: page layout templates replace adapter sequences, OCR confidence thresholds replace quality score thresholds, and the HTML report summarizes extraction accuracy instead of sequencing quality.
Here is a prompt to build a document processing pipeline inspired by the AMCA architecture:
Create a Python CLI tool called doc-pipeline that processes scanned document imagesthrough an OCR pipeline. Structure it identically to the FASTQ QC tool:
PROJECT STRUCTURE:doc-pipeline/├── doc_pipeline/│ ├── cli.py # argparse CLI entry point│ ├── config.py # YAML config loader│ ├── image_loader.py # load PNG/TIFF/JPEG images, detect page orientation│ ├── column_detector.py # detect text columns using contour analysis (OpenCV)│ ├── text_extractor.py # extract text from detected regions (pytesseract)│ ├── validator.py # validate extracted fields against regex patterns│ ├── report.py # HTML summary report with extraction stats│ └── logger.py # timestamped logging├── config.yaml # field patterns, confidence thresholds├── requirements.txt # opencv-python, pytesseract, plotly, pyyaml└── README.md
The config.yaml should include: field_patterns: service_number: "[A-Z]?\\d{7,8}" date: "\\d{1,2}[/-]\\d{1,2}[/-]\\d{2,4}" rank: list of WWII military rank abbreviations confidence_threshold: 0.85 output_format: "json"
The report should show per-page extraction results, field-level confidence scores,and a summary of how many pages passed validation. Log everything with timestamps.Pipelines like this have achieved name capture rates above 99% and service number accuracy above 97% across millions of pages — using the same Python/OpenCV/ML stack you learn in this module.
The point: the skills you are building with FASTQ files transfer to any domain that processes structured data at scale. Historical document digitization, medical record extraction, environmental sensor data — the pipeline architecture is the same. The config file changes; the pattern does not.
The bigger picture
Browser tools (Lessons 1-2) are great for interactive, one-off analyses. But genomics workflows are batch processes — you get 48 samples from a sequencing run and need to QC all of them with consistent parameters. That requires CLI tools.
The pattern here scales:
- Single file:
python -m fastq_qc --input one_sample/ --output report/ - Full run:
python -m fastq_qc --input /sequencing/run_042/ --output /qc/run_042/ - Automated: Add the command to a Snakemake or Nextflow pipeline and it runs automatically when new data appears.
You built a reproducible, config-driven, well-logged pipeline tool. This is the kind of software that makes a lab more efficient — and you did it with one prompt and a few customizations.
Key takeaways
- CLI tools complement browser tools: interactive dashboards are great for exploration, but batch processing requires command-line tools that run unattended, log their work, and produce structured output.
- Config-driven workflows are reproducible: a YAML config file documents your QC parameters. Check it into version control alongside your analysis scripts so reviewers and collaborators can reproduce your exact settings.
- Streaming parsers prevent memory disasters: real FASTQ files are 1-50 GB each. If the parser loads the entire file into memory, it will crash on a typical lab workstation. Always process one record at a time.
- Gzipped files require explicit handling: the most common crash in genomics pipelines is opening a
.gzfile with plainopen()instead ofgzip.open(). This is the first thing to check when a parser fails. - The
--sample-sizeflag is essential: quick QC on 100K reads takes seconds. Full QC on 500M reads takes hours. Use sampling for rapid checks during a run, and full analysis for the final report. - Reproducibility requires version pinning and random seeds: pin your dependency versions in
requirements.txt(e.g.,plotly==5.18.0) and set random seeds for any sampling or hashing operations. This ensures that re-running the pipeline on the same data produces the same results.
Portfolio suggestion
This pipeline tool is one of the strongest portfolio pieces from the module because it demonstrates real-world engineering practices:
- Save the project including the config.yaml, a sample QC report (HTML), and the log file from a run.
- Document the workflow in the README: what input it takes, what output it produces, and how to interpret the results.
- Include a “lessons learned” section describing any bugs you encountered and how you fixed them with follow-up prompts.
If you want to go further, run the tool on real sequencing data from your lab and include a de-identified report in your portfolio. A QC report on real data is far more compelling than synthetic test data. You can also add a Snakemake rule that integrates this tool into a larger pipeline — even a two-step pipeline (QC then trim) demonstrates pipeline orchestration skills.
🔍Advanced: Adding VCF and BAM support
Once you have the FASTQ QC pipeline working, you can extend it to handle other common genomics file formats. Each of these is a follow-up prompt:
BAM/SAM QC:
Add support for BAM/SAM files. Use pysam (add to requirements.txt) to parsealigned reads. Calculate alignment rate, mapping quality distribution, insertsize distribution (paired-end), and coverage per chromosome. Add these asadditional panels in the HTML report.VCF summary:
Add support for VCF files. Parse the VCF header and variant records. Report:total variants, SNP/indel ratio, Ti/Tv ratio, variant quality distribution,allele frequency spectrum, and per-chromosome variant counts. This gives a quickoverview of a variant calling run.These extensions turn your QC tool into a comprehensive genomics data summarizer — like a custom MultiQC that you fully control and can extend for any file format your lab uses.
Your pipeline crashes with a UnicodeDecodeError on the first FASTQ file. The file has a .fastq.gz extension. What is the most likely cause?
Try it yourself
- Generate the pipeline with the prompt above.
- Create a test FASTQ file (or use one from your own data).
- Run the pipeline and open the HTML report.
- Examine the per-base quality plot. Does the quality degradation pattern match what you expect from your sequencing platform?
- Try the
--sample-sizeflag with different values and observe the speed/accuracy tradeoff. - Add one customization from the list above.
What’s next
In Lesson 4, you will build a mass spectrometry data viewer — an interactive Plotly-based application for visualizing spectra, detecting peaks, overlaying runs, and exporting annotated figures. This is directly relevant to proteomics and mass spectrometry core facilities.