Genomics Data Pipeline

What you’re building

Every sequencing run produces FASTQ files. Every FASTQ file needs quality assessment before downstream analysis. This is the most repetitive task in a genomics lab — and the most important one to automate correctly.

In this lesson you will build a Python CLI tool that takes a directory of FASTQ files, runs quality metrics on each one, and generates a clean HTML summary report with interactive charts. It is config-driven, logs everything, and produces reproducible output. A grad student can build this in an afternoon and use it for the rest of their PhD.

This is different from the browser-based tools in Lessons 1 and 2. This tool runs on the command line — on your laptop, on a lab workstation, or on a shared compute server. It processes real files, writes real output, and integrates into real workflows.

💡Running on HPC?

If your institution uses an HPC cluster, replace python -m venv with conda create, use conda install instead of pip install, and submit long-running jobs through your cluster’s scheduler (SLURM, PBS, or SGE). The pipeline logic stays the same — only the environment setup changes.

The showcase

When finished, your pipeline will:

Accept a directory of .fastq or .fastq.gz files as input.
For each file, compute:
- Total read count
- Read length distribution
- Per-base quality score distribution (like FastQC’s per-base quality plot)
- Per-sequence quality score distribution
- GC content distribution
- Adapter content estimation (searching for common Illumina adapter sequences)
- Duplicate rate estimate (based on first 100,000 reads)
Generate an HTML report with all metrics visualized as interactive Plotly charts.
Save per-file JSON summaries for programmatic access.
Log all operations with timestamps for reproducibility.
Run from a YAML configuration file so you can standardize settings across your lab.

Think of it as a lightweight, customizable FastQC that you fully control.

ℹSequencing Core Facility Context

When you submit samples to a sequencing facility (Illumina, PacBio, Oxford Nanopore) or a gene expression center (10X Genomics single-cell, spatial transcriptomics, PacBio Iso-Seq), the data you get back is exactly what this pipeline processes: FASTQ files ranging from a few hundred megabytes to tens of gigabytes. The pipeline you are building is your first-pass QC — run it the moment data lands on the shared drive, before investing hours in alignment or assembly. If something went wrong during library prep or sequencing, you will catch it in the QC report rather than discovering it halfway through your analysis.

If you are working with hands-on genomic datasets in a bioinformatics course, this pipeline is a practical tool you can use throughout the course for every dataset you encounter.

The prompt

Open your AI CLI tool (such as Claude Code, Gemini CLI, or your preferred tool) in an empty directory and paste:

Create a Python CLI tool called fastq-qc that processes FASTQ files and generates
quality control reports. Use only Python 3.10+ standard library plus these PyPI
packages: plotly, pyyaml. No BioPython dependency.

PROJECT STRUCTURE:
fastq-qc/
├── fastq_qc/
│   ├── __init__.py
│   ├── cli.py              # argparse CLI entry point
│   ├── config.py           # YAML config loader with defaults
│   ├── fastq_parser.py     # streaming FASTQ parser (handles .gz)
│   ├── metrics.py          # quality metric calculators
│   ├── report.py           # HTML report generator with Plotly
│   └── logger.py           # timestamped logging setup
├── config.yaml             # default configuration file
├── requirements.txt        # plotly, pyyaml
├── setup.py                # pip install -e . support
└── README.md               # usage instructions

CLI INTERFACE:
  python -m fastq_qc --input /path/to/fastq/dir --output /path/to/reports
  python -m fastq_qc --input /path/to/fastq/dir --output /path/to/reports --config config.yaml

OPTIONS:
  --input, -i     Directory containing .fastq or .fastq.gz files (required)
  --output, -o    Output directory for reports (required, created if missing)
  --config, -c    Path to YAML config file (optional, uses defaults)
  --sample-size   Number of reads to sample for metrics (default: 500000, 0=all)
  --threads       Number of parallel files to process (default: 1)
  --verbose       Enable debug logging

FASTQ PARSER (fastq_parser.py):
  - Streaming parser that yields (header, sequence, quality) tuples
  - Handles both .fastq and .fastq.gz transparently
  - Validates FASTQ format: header starts with @, separator line is +, sequence
    and quality have equal length
  - Raises clear errors for malformed records with line numbers
  - Memory-efficient: processes one record at a time, never loads entire file

METRICS (metrics.py):
  Calculate these metrics per file:
  1. read_count: total number of reads
  2. read_lengths: Counter of read lengths (for distribution)
  3. per_base_quality: for each position (0 to max_read_length), collect quality
     scores across all reads, report mean, median, Q1, Q3
  4. per_sequence_quality: histogram of mean quality scores per read
  5. gc_content: histogram of per-read GC fraction (bins of 1%)
  6. base_content: per-position frequency of A, T, G, C, N
  7. adapter_content: for each position, fraction of reads containing a known
     adapter k-mer starting at that position. Check TruSeq:
     AGATCGGAAGAGCACACGTCTGAACTCCAGTCA and Nextera:
     CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
  8. duplicate_estimate: hash first 50 nt of each read, count collisions among
     first 100K reads, extrapolate

CONFIG (config.yaml):
  sample_size: 500000
  min_quality_threshold: 20
  adapter_sequences:
    truseq: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    nextera: "CTGTCTCTTATACACATCTCCGAGCCCACGAGAC"
  report:
    title: "FASTQ Quality Control Report"
    plots_per_row: 2

REPORT (report.py):
  Generate a single self-contained HTML file with embedded Plotly.js:
  - Header with report title, generation timestamp, tool version
  - Summary table: filename, read count, mean quality, mean GC%, pass/warn/fail
    status based on configurable thresholds
  - Per-file section with:
    - Per-base quality box plot (like FastQC) with green/yellow/red background
      zones at quality 30/20/0
    - Read length distribution histogram
    - Per-sequence quality distribution histogram
    - GC content distribution with theoretical normal overlay
    - Per-base sequence content line plot (A/T/G/C/N across positions)
    - Adapter content line plot
  - All charts are Plotly (interactive zoom, hover tooltips)
  - Dark theme consistent with other tools in this track

LOGGING (logger.py):
  - All operations logged with ISO timestamps
  - Log to both console (INFO level) and file (DEBUG level)
  - Log file saved in output directory as qc_run_YYYYMMDD_HHMMSS.log
  - Log: config used, files found, per-file start/end times, any warnings

OUTPUT STRUCTURE:
  output_dir/
  ├── qc_report.html              # main HTML report
  ├── qc_run_20240115_143022.log  # run log
  └── per_file/
      ├── sample1.json            # per-file metrics as JSON
      └── sample2.json

Generate all files with complete, working implementations. The tool should work
end-to-end on real FASTQ files.

⚠File size matters

Real FASTQ files are large (1-50 GB each). The --sample-size flag exists for a reason — set it to 500,000 reads for quick QC during a run, or 0 (all reads) for a final comprehensive report. For your first test, create a small test FASTQ file or use the --sample-size 1000 flag.

What you get

After the LLM generates the project, set it up in a virtual environment:

cd fastq-qc
python -m venv .venv
source .venv/bin/activate    # On Windows: .venv\Scripts\activate
pip install -e .

Create a test file

If you do not have FASTQ files handy, ask the LLM to generate one:

Create a Python script called generate_test_fastq.py that generates a synthetic
FASTQ file with 10,000 reads. Read length 150 bp. Quality scores should simulate
typical Illumina data: high quality (30-40) for the first 100 bases, degrading to
20-30 for the last 50 bases. Include 5% of reads with adapter contamination
(TruSeq adapter appended). Include 2% duplicate reads. Save as test_data/sample1.fastq.

Then run the pipeline:

python -m fastq_qc --input test_data/ --output test_output/ --verbose

Expected output

test_output/
├── qc_report.html
├── qc_run_20240115_143022.log
└── per_file/
    └── sample1.json

Open qc_report.html in your browser. You should see:

Summary table showing sample1.fastq with ~10,000 reads, mean quality around 30-33, and GC% around 50%.
Per-base quality plot showing high quality on the left degrading toward the right — the classic Illumina quality curve.
Adapter content plot showing a bump of ~5% adapter contamination increasing toward the 3’ end of reads.
GC content distribution centered around 50% with a roughly normal shape.

🔍For Researchers: Integrating with your existing pipeline

This tool is designed to slot into an existing workflow. The per-file JSON output can be consumed by downstream scripts:

import json

with open("test_output/per_file/sample1.json") as f:
    metrics = json.load(f)

if metrics["mean_quality"] < 25:
    print("WARNING: Low quality - consider re-sequencing")

if metrics["adapter_fraction"] > 0.1:
    print("WARNING: High adapter content - run trimming")

You can also add a post-processing step that automatically triggers adapter trimming (Trimmomatic, fastp) when adapter content exceeds a threshold. The config file makes it easy to standardize thresholds across your lab group.

Worked example: Processing a real MiSeq run

Here is a realistic scenario for a graduate student running amplicon sequencing on an Illumina MiSeq.

Step 1. Your sequencing core delivers data to a shared drive. You have a folder with 48 FASTQ files (24 samples, paired-end):

/sequencing/run_2024_01/
├── sample01_R1.fastq.gz
├── sample01_R2.fastq.gz
├── sample02_R1.fastq.gz
├── sample02_R2.fastq.gz
...
└── sample24_R2.fastq.gz

Step 2. Run the QC pipeline with sampling for a quick first look:

python -m fastq_qc \
  --input /sequencing/run_2024_01/ \
  --output /qc/run_2024_01/ \
  --sample-size 100000 \
  --verbose

This processes 100,000 reads from each file, taking about 2-3 minutes total.

Step 3. Open the HTML report and check the summary table. Look for:

Red flags: any sample with mean quality below 25 (likely a library prep issue).
Adapter content: R2 files typically show more adapter contamination than R1 in paired-end data.
GC distribution: bimodal GC distributions suggest contamination or mixed organisms.
Duplicate rate: amplicon data often shows 30-60% duplicates; whole-genome should be under 10%.

Step 4. If you see problems, run a comprehensive report on the flagged samples:

python -m fastq_qc \
  --input /sequencing/run_2024_01/ \
  --output /qc/run_2024_01_full/ \
  --sample-size 0 \
  --config config.yaml

ℹPaired-end QC tip

For paired-end data, the QC tool processes R1 and R2 files independently. This is standard — most QC tools (including FastQC) do the same. To check that R1 and R2 are properly paired (same headers in the same order), you can add a follow-up prompt:

Add a --paired flag. When set, for each sample find the R1 and R2 files by name
pattern (_R1/_R2 or _1/_2). Verify that the read headers match between R1 and R2.
Report any orphaned reads or mismatched pairs. Show read-length distribution
for each mate in the report if both files are present.

Worked example: Comparing quality across sequencing runs

If you run the same library type regularly (e.g., 16S amplicon sequencing every month), you want to track quality over time:

# This month's run
python -m fastq_qc --input /sequencing/run_2024_01/ --output /qc/run_2024_01/

# Compare to last month (requires the --compare customization below)
python -m fastq_qc --input /sequencing/run_2024_01/ --output /qc/compare_jan/ \
  --compare /qc/run_2023_12/

The comparison mode (added via the --compare customization in the section below) overlays per-base quality plots, making it easy to spot instrument degradation. Build the base pipeline first, then add this feature.

🔧

When Things Go Wrong

Use the Symptom → Evidence → Request pattern: describe what you see, paste the error, then ask for a fix.

Symptom

Pipeline crashes immediately with 'UnicodeDecodeError' on a .fastq.gz file

Evidence

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x1f in position 0: invalid start byte at fastq_parser.py line 12

What to ask the AI

"The parser is trying to read a gzipped file as plain text. The .fastq.gz file needs to be opened with gzip.open() instead of open(). Can you check the file extension and use gzip.open(filename, 'rt') for .gz files and regular open() for .fastq files?"

Symptom

Report shows 0 reads for some files but the files are not empty

Evidence

sample03.fastq shows 0 reads in the summary table. I checked the file with 'wc -l' and it has 400,000 lines (100,000 reads). The log says 'Processing sample03.fastq... complete in 0.01s.'

What to ask the AI

"The parser might be failing silently on this file. Can you add a check: if the parsed read count is 0 but the file is not empty, log a WARNING with the first 4 lines of the file so I can see the format? The file might have Windows line endings (\r\n) or a different FASTQ variant."

Symptom

Memory usage spikes to 16 GB and the process gets killed

Evidence

Running on a 48-file dataset with --sample-size 0 (all reads). The process runs fine for the first 10 files, then the system OOM-killer terminates it. Each file is about 2 GB.

What to ask the AI

"The metrics accumulator is storing too much data in memory. Can you make per_base_quality use running statistics (online mean and variance) instead of storing every quality score? Also, process one file completely and write its JSON before starting the next, so we release memory between files."

Symptom

BAM files in the input directory cause a crash

Evidence

Error: 'FASTQ validation failed: header line does not start with @ at line 1' for file alignment.bam. I accidentally left BAM files in the same directory.

What to ask the AI

"The tool is trying to parse non-FASTQ files. Can you add a file filter that only processes files ending in .fastq, .fq, .fastq.gz, or .fq.gz? Log a warning for any other files found in the input directory and skip them."

Symptom

Plotly charts in the report show as blank white rectangles

Evidence

The HTML report loads and the summary table looks correct, but all the Plotly charts are empty white boxes. No error in the browser console. I am on a restricted lab workstation.

What to ask the AI

"The Plotly CDN might be blocked by the network firewall. Can you embed the Plotly.js library directly in the HTML file instead of loading from CDN? Use plotly.io.to_html(full_html=False, include_plotlyjs='cdn') and change it to include_plotlyjs=True to embed the full library."

Symptom

Quality scores look wrong -- all values are between 0 and 40 but expected Phred+33

Evidence

The per-base quality plot shows raw ASCII values (64-104) instead of Phred quality scores (0-40). Hovering over a box shows Q=73 at position 1.

What to ask the AI

"The quality score parser is not converting ASCII to Phred. For Illumina 1.8+ (Phred+33), subtract 33 from the ASCII value of each quality character. So '!' (ASCII 33) = Q0, 'I' (ASCII 73) = Q40. Can you add the Phred+33 offset subtraction in the parser?"

How it works

The pipeline follows a standard pattern for scientific CLI tools:

Parse arguments with argparse. Validate that input directory exists and contains FASTQ files.
Load config from YAML file (or use defaults). This makes the tool reproducible — check the config into version control alongside your data.
Stream through each file one record at a time. Never load the entire file into memory. For gzipped files, use Python’s gzip.open().
Compute metrics as running aggregates. Per-base quality uses a position-indexed list of accumulators. GC content uses histogram bins. This keeps memory usage constant regardless of file size.
Generate report by embedding Plotly.js from CDN and creating one chart per metric per file. The HTML file is self-contained except for the Plotly CDN link. For fully offline use, set include_plotlyjs=True in the report generator to embed the full Plotly library.
Log everything so you can trace exactly what happened, when, and with what parameters.

Customize it

Add MultiQC-compatible output

Add a MultiQC-compatible output module. Write metrics in the format that MultiQC
expects (see multiqc.info for the custom content specification). This way the
user can combine this tool's output with other tools (Trimmomatic, STAR, featureCounts)
into a single MultiQC report. Output the files to a multiqc_data/ subdirectory.

Add automatic trimming recommendations

Add a "Recommendations" section to the HTML report for each file. Based on the
metrics, automatically suggest:
- Whether adapter trimming is needed (adapter content > 5%)
- Suggested quality trim threshold (based on where mean per-base quality drops
  below 20)
- Whether the library is over-sequenced (duplicate rate > 30%)
- Expected data loss after trimming
Show these as a color-coded checklist (green=OK, yellow=warn, red=action needed).

Add email notification

Add a --notify flag that accepts an email address. When the pipeline finishes,
send a summary email using Python's smtplib with the local SMTP server. Include
the summary table (text format) and attach the HTML report. This is useful for
long-running jobs on a shared server.

Add comparison mode

Add a --compare flag that accepts a previous output directory. When set, the report
includes side-by-side comparisons: show per-base quality plots overlaid for the
current run vs the previous run. This helps catch sequencing quality degradation
over time. Highlight metrics that changed by more than 10%.

💡Version control your configs

Put your config.yaml in a git repository alongside your analysis scripts. When you publish a paper, the config file documents exactly what QC thresholds you used. Reviewers increasingly ask for this level of reproducibility documentation.

Beyond Genomics: Document Processing Pipelines

The pipeline architecture you just built — input files, stage-by-stage processing, validation, structured output — transfers directly to non-biological domains. Here is a striking example from historical document processing.

Projects digitizing hundreds of thousands of archival military personnel files (spanning millions of pages) from WWII and Korea follow this same pattern. These are official casualty records of service members — handwritten forms, typed reports, telegrams, and medical records from the 1940s and 1950s.

The OCR processing pipeline follows a pattern you will recognize:

Column detection — identify the column layout of each scanned page (OpenCV)
Row/text extraction — extract text from each detected region (OCR engine)
Validation — cross-check extracted fields against known formats (service number patterns, date formats, rank abbreviations)
ML training data generation — validated output feeds back into model training for improved accuracy on the next batch

This is the same input → process → validate → output architecture as your FASTQ pipeline. The config-driven approach works the same way: page layout templates replace adapter sequences, OCR confidence thresholds replace quality score thresholds, and the HTML report summarizes extraction accuracy instead of sequencing quality.

Here is a prompt to build a document processing pipeline inspired by the AMCA architecture:

Create a Python CLI tool called doc-pipeline that processes scanned document images
through an OCR pipeline. Structure it identically to the FASTQ QC tool:

PROJECT STRUCTURE:
doc-pipeline/
├── doc_pipeline/
│   ├── cli.py              # argparse CLI entry point
│   ├── config.py           # YAML config loader
│   ├── image_loader.py     # load PNG/TIFF/JPEG images, detect page orientation
│   ├── column_detector.py  # detect text columns using contour analysis (OpenCV)
│   ├── text_extractor.py   # extract text from detected regions (pytesseract)
│   ├── validator.py        # validate extracted fields against regex patterns
│   ├── report.py           # HTML summary report with extraction stats
│   └── logger.py           # timestamped logging
├── config.yaml             # field patterns, confidence thresholds
├── requirements.txt        # opencv-python, pytesseract, plotly, pyyaml
└── README.md

The config.yaml should include:
  field_patterns:
    service_number: "[A-Z]?\\d{7,8}"
    date: "\\d{1,2}[/-]\\d{1,2}[/-]\\d{2,4}"
    rank: list of WWII military rank abbreviations
  confidence_threshold: 0.85
  output_format: "json"

The report should show per-page extraction results, field-level confidence scores,
and a summary of how many pages passed validation. Log everything with timestamps.

🔍Why this matters beyond biology

Pipelines like this have achieved name capture rates above 99% and service number accuracy above 97% across millions of pages — using the same Python/OpenCV/ML stack you learn in this module.

The point: the skills you are building with FASTQ files transfer to any domain that processes structured data at scale. Historical document digitization, medical record extraction, environmental sensor data — the pipeline architecture is the same. The config file changes; the pattern does not.

The bigger picture

Browser tools (Lessons 1-2) are great for interactive, one-off analyses. But genomics workflows are batch processes — you get 48 samples from a sequencing run and need to QC all of them with consistent parameters. That requires CLI tools.

The pattern here scales:

Single file: python -m fastq_qc --input one_sample/ --output report/
Full run: python -m fastq_qc --input /sequencing/run_042/ --output /qc/run_042/
Automated: Add the command to a Snakemake or Nextflow pipeline and it runs automatically when new data appears.

You built a reproducible, config-driven, well-logged pipeline tool. This is the kind of software that makes a lab more efficient — and you did it with one prompt and a few customizations.

Key takeaways

CLI tools complement browser tools: interactive dashboards are great for exploration, but batch processing requires command-line tools that run unattended, log their work, and produce structured output.
Config-driven workflows are reproducible: a YAML config file documents your QC parameters. Check it into version control alongside your analysis scripts so reviewers and collaborators can reproduce your exact settings.
Streaming parsers prevent memory disasters: real FASTQ files are 1-50 GB each. If the parser loads the entire file into memory, it will crash on a typical lab workstation. Always process one record at a time.
Gzipped files require explicit handling: the most common crash in genomics pipelines is opening a .gz file with plain open() instead of gzip.open(). This is the first thing to check when a parser fails.
The --sample-size flag is essential: quick QC on 100K reads takes seconds. Full QC on 500M reads takes hours. Use sampling for rapid checks during a run, and full analysis for the final report.
Reproducibility requires version pinning and random seeds: pin your dependency versions in requirements.txt (e.g., plotly==5.18.0) and set random seeds for any sampling or hashing operations. This ensures that re-running the pipeline on the same data produces the same results.

Portfolio suggestion

This pipeline tool is one of the strongest portfolio pieces from the module because it demonstrates real-world engineering practices:

Save the project including the config.yaml, a sample QC report (HTML), and the log file from a run.
Document the workflow in the README: what input it takes, what output it produces, and how to interpret the results.
Include a “lessons learned” section describing any bugs you encountered and how you fixed them with follow-up prompts.

If you want to go further, run the tool on real sequencing data from your lab and include a de-identified report in your portfolio. A QC report on real data is far more compelling than synthetic test data. You can also add a Snakemake rule that integrates this tool into a larger pipeline — even a two-step pipeline (QC then trim) demonstrates pipeline orchestration skills.

🔍Advanced: Adding VCF and BAM support

Once you have the FASTQ QC pipeline working, you can extend it to handle other common genomics file formats. Each of these is a follow-up prompt:

BAM/SAM QC:

Add support for BAM/SAM files. Use pysam (add to requirements.txt) to parse
aligned reads. Calculate alignment rate, mapping quality distribution, insert
size distribution (paired-end), and coverage per chromosome. Add these as
additional panels in the HTML report.

VCF summary:

Add support for VCF files. Parse the VCF header and variant records. Report:
total variants, SNP/indel ratio, Ti/Tv ratio, variant quality distribution,
allele frequency spectrum, and per-chromosome variant counts. This gives a quick
overview of a variant calling run.

These extensions turn your QC tool into a comprehensive genomics data summarizer — like a custom MultiQC that you fully control and can extend for any file format your lab uses.

KNOWLEDGE CHECK

Your pipeline crashes with a UnicodeDecodeError on the first FASTQ file. The file has a .fastq.gz extension. What is the most likely cause?

Try it yourself

Generate the pipeline with the prompt above.
Create a test FASTQ file (or use one from your own data).
Run the pipeline and open the HTML report.
Examine the per-base quality plot. Does the quality degradation pattern match what you expect from your sequencing platform?
Try the --sample-size flag with different values and observe the speed/accuracy tradeoff.
Add one customization from the list above.

What’s next

In Lesson 4, you will build a mass spectrometry data viewer — an interactive Plotly-based application for visualizing spectra, detecting peaks, overlaying runs, and exporting annotated figures. This is directly relevant to proteomics and mass spectrometry core facilities.

What you'll learn

What you’re building

The showcase

The prompt

What you get

Create a test file

Expected output

Worked example: Processing a real MiSeq run

Worked example: Comparing quality across sequencing runs

When Things Go Wrong

How it works

Customize it

Add MultiQC-compatible output

Add automatic trimming recommendations

Add email notification

Add comparison mode

Beyond Genomics: Document Processing Pipelines

The bigger picture

Key takeaways

Portfolio suggestion

Try it yourself

What’s next