Applied Module 12 · AI-Powered Bioinformatics Tools

Reproducible RNA-Seq Workflow Orchestrator

What you'll learn

~45 min
  • Build a Snakemake workflow chaining QC, alignment, counting, and differential expression analysis
  • Generate a Snakefile from a sample-sheet CSV
  • Troubleshoot DAG dependencies and file I/O errors
  • Extend the workflow with config-driven parameterization

What you’re building

In Lesson 3, you built a FASTQ QC pipeline. In Lesson 5, you built a differential expression dashboard. In a real RNA-seq experiment, these are two steps in a much longer chain: raw reads go through QC, trimming, alignment, counting, and finally differential expression. Running each step manually — copying output paths, checking that files exist, restarting failed steps — is tedious, error-prone, and not reproducible.

Snakemake solves this. It is a workflow management system built for bioinformatics. You define rules (each step in the pipeline), declare what files each rule needs and produces, and Snakemake figures out the execution order, parallelizes independent steps, and reruns only what changed. It is the dominant workflow tool in academic bioinformatics, used in thousands of published papers.

In this lesson you will build a Python CLI tool that generates complete Snakemake workflows from a sample-sheet CSV. Give it a CSV listing your samples, FASTQ paths, and conditions, and it produces a ready-to-run Snakefile plus a config.yaml with all parameters. The workflow chains FastQC, Trim Galore, HISAT2 alignment, featureCounts, MultiQC, and a DE analysis step that connects directly to Lesson 5’s dashboard.

This is the capstone lesson for the module. It ties together everything you have built into a single reproducible pipeline that goes from raw FASTQ files to a list of differentially expressed genes.

Software pattern: Declarative workflow orchestration

Define inputs → declare dependencies → let the engine handle execution order, parallelism, and failure recovery. This pattern appears everywhere: CI/CD pipelines (GitHub Actions), data engineering (Airflow, dbt), infrastructure (Terraform). The syntax changes; the concept of declaring “what depends on what” is universal.

💡Running on HPC?

Snakemake has built-in cluster support. Instead of snakemake -j 4 (4 local cores), use snakemake -j 100 --cluster "sbatch -p compute -n 1 -t 4:00:00" to submit each rule as a separate SLURM job. The workflow generator you are building will include this command in its output README. No code changes are needed to go from laptop to cluster.


The showcase

The finished CLI tool will:

  • Read a sample-sheet CSV with columns: sample_id, fastq_r1, fastq_r2, condition, replicate.
  • Generate a complete Snakefile with rules for every step in the RNA-seq pipeline:
    1. FastQC — quality assessment of raw reads
    2. Trim Galore — adapter trimming and quality filtering
    3. FastQC (post-trim) — quality assessment of trimmed reads
    4. HISAT2 alignment — splice-aware alignment to a reference genome
    5. SAMtools sort/index — BAM file processing
    6. featureCounts — gene-level read counting
    7. Count matrix merge — combine per-sample counts into a single matrix
    8. MultiQC — aggregate all QC reports into one HTML summary
    9. DE analysis — run the t-test/FDR pipeline from Lesson 5 on the merged count matrix
  • Generate a config.yaml with all tunable parameters: genome reference path, annotation GTF path, trimming quality threshold, alignment parameters, featureCounts strand setting, DE thresholds.
  • Generate a DAG visualization command so you can see the workflow graph before running.
  • Generate a README with exact commands for dry-run, local execution, and cluster execution.

The tool does not run the pipeline itself — it generates the Snakefile and config that you then execute with snakemake. This separation is intentional: the generator runs anywhere (your laptop, a login node), and the generated workflow runs wherever your data and tools live (HPC, cloud, local workstation).

🔍Snakemake vs. Nextflow

Both are excellent workflow managers for bioinformatics. Here is when to choose each:

Choose Snakemake if:

  • Your lab primarily uses Python
  • You want to run workflows locally first, then scale to a cluster with minimal changes
  • You are in an academic environment where Snakemake is the de facto standard
  • You need tight integration with conda environments (Snakemake has built-in conda: directives)

Choose Nextflow if:

  • Your lab or institute has standardized on Nextflow (common in large genomics centers)
  • You need cloud-native execution on AWS Batch or Google Cloud Life Sciences
  • You want to use nf-core community pipelines out of the box
  • You prefer Groovy-based DSL over Python-based DSL

Both tools solve the same problem. This lesson uses Snakemake because it is more Python-native, and because you have been writing Python tools throughout this module. The Customize section includes a prompt to convert the generated workflow to Nextflow if your lab uses that instead.


The prompt

Open your AI CLI tool (such as Claude Code, Gemini CLI, or your preferred tool) in an empty directory and paste:

Create a Python CLI tool called rnaseq-workflow-generator that reads a sample-sheet
CSV and generates a complete Snakemake workflow for RNA-seq analysis. Use Python 3.10+
with Jinja2 for templating and PyYAML for config generation.
PROJECT STRUCTURE:
rnaseq-workflow-generator/
├── workflow_generator/
│ ├── __init__.py
│ ├── cli.py # argparse CLI entry point
│ ├── sample_sheet.py # sample sheet CSV parser and validator
│ ├── snakefile_builder.py # Snakemake rule generation using Jinja2
│ ├── config_builder.py # config.yaml generation
│ └── templates/
│ ├── Snakefile.j2 # Jinja2 template for the Snakefile
│ └── config.yaml.j2 # Jinja2 template for config.yaml
├── example_sample_sheet.csv # example input with 6 samples
├── requirements.txt # jinja2, pyyaml
├── setup.py # pip install -e . support
└── README.md
CLI INTERFACE:
python -m workflow_generator --sample-sheet samples.csv --output-dir workflow/
python -m workflow_generator --sample-sheet samples.csv --output-dir workflow/ \
--genome /ref/mm10/genome.fa --gtf /ref/mm10/genes.gtf --aligner hisat2
OPTIONS:
--sample-sheet, -s Path to sample sheet CSV (required)
--output-dir, -o Directory for generated workflow files (required, created if missing)
--genome, -g Path to genome FASTA (default: /ref/genome.fa, user must edit)
--gtf Path to gene annotation GTF (default: /ref/genes.gtf)
--aligner Alignment tool: hisat2 or star (default: hisat2)
--strandedness Library strandedness: unstranded, forward, reverse (default: reverse)
--trim-quality Trim Galore quality threshold (default: 20)
--min-mapq Minimum mapping quality for featureCounts (default: 10)
--fdr-threshold DE analysis FDR threshold (default: 0.05)
--lfc-threshold DE analysis log2FC threshold (default: 1.0)
--threads Threads per rule (default: 4)
--no-de Skip the DE analysis step
SAMPLE SHEET FORMAT (sample_sheet.py):
Parse and validate a CSV with these columns:
- sample_id: unique identifier (alphanumeric + underscore, no spaces)
- fastq_r1: path to R1 FASTQ file (absolute path or relative to workflow dir)
- fastq_r2: path to R2 FASTQ file (for paired-end; blank for single-end)
- condition: experimental group name (e.g., "control", "treated")
- replicate: replicate number within the condition (e.g., 1, 2, 3)
Validation rules:
- All sample_ids must be unique
- All FASTQ paths must be non-empty (warn if files do not exist at generation time)
- At least 2 conditions must be present for DE analysis
- At least 2 replicates per condition recommended (warn if fewer)
- Reject sample_ids with spaces or special characters
EXAMPLE SAMPLE SHEET (example_sample_sheet.csv):
sample_id,fastq_r1,fastq_r2,condition,replicate
Control_1,/data/fastq/ctrl1_R1.fastq.gz,/data/fastq/ctrl1_R2.fastq.gz,control,1
Control_2,/data/fastq/ctrl2_R1.fastq.gz,/data/fastq/ctrl2_R2.fastq.gz,control,2
Control_3,/data/fastq/ctrl3_R1.fastq.gz,/data/fastq/ctrl3_R2.fastq.gz,control,3
Treated_1,/data/fastq/trt1_R1.fastq.gz,/data/fastq/trt1_R2.fastq.gz,treated,1
Treated_2,/data/fastq/trt2_R1.fastq.gz,/data/fastq/trt2_R2.fastq.gz,treated,2
Treated_3,/data/fastq/trt3_R1.fastq.gz,/data/fastq/trt3_R2.fastq.gz,treated,3
SNAKEFILE TEMPLATE (Snakefile.j2):
Generate a Snakemake workflow with these rules in dependency order:
configfile: "config.yaml"
SAMPLES = config["samples"]
CONDITIONS = config["conditions"]
GENOME = config["genome"]
GTF = config["gtf"]
rule all:
input:
"results/multiqc/multiqc_report.html",
"results/counts/count_matrix.csv",
{% if not no_de %}"results/de/de_results.csv"{% endif %}
rule fastqc_raw:
input: get R1 and R2 paths from config for each sample
output: "results/fastqc_raw/{sample}_R1_fastqc.html",
"results/fastqc_raw/{sample}_R2_fastqc.html"
threads: 2
shell: "fastqc {input} -o results/fastqc_raw/ -t {threads}"
rule trim_galore:
input: R1 and R2 from config
output: "results/trimmed/{sample}_R1_val_1.fq.gz",
"results/trimmed/{sample}_R2_val_2.fq.gz"
params: quality = config["trim_quality"]
threads: 4
shell: """
trim_galore --paired --quality {params.quality} \
--output_dir results/trimmed/ \
--cores {threads} {input.r1} {input.r2}
"""
rule fastqc_trimmed:
input: "results/trimmed/{sample}_R1_val_1.fq.gz",
"results/trimmed/{sample}_R2_val_2.fq.gz"
output: "results/fastqc_trimmed/{sample}_R1_val_1_fastqc.html"
threads: 2
shell: "fastqc {input} -o results/fastqc_trimmed/ -t {threads}"
rule hisat2_align:
input: "results/trimmed/{sample}_R1_val_1.fq.gz",
"results/trimmed/{sample}_R2_val_2.fq.gz"
output: temp("results/aligned/{sample}.unsorted.bam")
params: index = config["hisat2_index"]
threads: config["threads"]
log: "logs/hisat2/{sample}.log"
shell: """
hisat2 -x {params.index} -1 {input[0]} -2 {input[1]} \
--rna-strandness RF --threads {threads} 2> {log} | \
samtools view -bS - > {output}
"""
(If --aligner star, generate STAR rule instead with appropriate parameters)
rule samtools_sort:
input: "results/aligned/{sample}.unsorted.bam"
output: "results/aligned/{sample}.sorted.bam"
threads: 4
shell: "samtools sort -@ {threads} -o {output} {input}"
rule samtools_index:
input: "results/aligned/{sample}.sorted.bam"
output: "results/aligned/{sample}.sorted.bam.bai"
shell: "samtools index {input}"
rule featurecounts:
input:
bam = "results/aligned/{sample}.sorted.bam",
bai = "results/aligned/{sample}.sorted.bam.bai"
output: "results/counts/{sample}.counts.txt"
params:
strand = config["featurecounts_strand"],
gtf = GTF
threads: 2
shell: """
featureCounts -a {params.gtf} -o {output} \
-s {params.strand} -p --countReadPairs \
-T {threads} -Q {config[min_mapq]} {input.bam}
"""
rule merge_counts:
input: expand("results/counts/{sample}.counts.txt", sample=SAMPLES)
output: "results/counts/count_matrix.csv"
script: "scripts/merge_counts.py"
(Generate a scripts/merge_counts.py that reads featureCounts output files,
extracts the gene name and count columns, and merges into a single CSV
with genes as rows and samples as columns)
rule multiqc:
input:
expand("results/fastqc_raw/{sample}_R1_fastqc.html", sample=SAMPLES),
expand("results/fastqc_trimmed/{sample}_R1_val_1_fastqc.html", sample=SAMPLES),
expand("results/counts/{sample}.counts.txt", sample=SAMPLES)
output: "results/multiqc/multiqc_report.html"
shell: "multiqc results/ -o results/multiqc/ --force"
rule de_analysis:
input: "results/counts/count_matrix.csv"
output: "results/de/de_results.csv",
"results/de/volcano.html"
params:
fdr = config["fdr_threshold"],
lfc = config["lfc_threshold"],
conditions = CONDITIONS
script: "scripts/de_analysis.py"
(Generate a scripts/de_analysis.py that implements the same t-test + BH
correction from Lesson 5, reads the count matrix, performs DE analysis,
saves results CSV, and generates a standalone Plotly volcano plot HTML)
CONFIG TEMPLATE (config.yaml.j2):
Generate a YAML config with all parameters from the CLI plus:
samples:
Control_1:
r1: /data/fastq/ctrl1_R1.fastq.gz
r2: /data/fastq/ctrl1_R2.fastq.gz
condition: control
Control_2:
...
conditions:
control: [Control_1, Control_2, Control_3]
treated: [Treated_1, Treated_2, Treated_3]
genome: /ref/genome.fa
gtf: /ref/genes.gtf
hisat2_index: /ref/hisat2_index/genome
threads: 4
trim_quality: 20
min_mapq: 10
featurecounts_strand: 2 (2 = reverse strand for dUTP libraries)
fdr_threshold: 0.05
lfc_threshold: 1.0
OUTPUT STRUCTURE:
When the generator runs, it produces:
workflow/
├── Snakefile
├── config.yaml
├── scripts/
│ ├── merge_counts.py
│ └── de_analysis.py
└── README.md (with commands for: dry-run, local run, cluster run, DAG viz)
The README should include:
# Quick start
snakemake -n # dry run (check the plan)
snakemake -j 4 # run locally with 4 cores
snakemake --dag | dot -Tpng > dag.png # visualize the workflow DAG
# HPC cluster (SLURM)
snakemake -j 100 --cluster "sbatch -p compute -n 1 -c {threads} -t 4:00:00"
# With conda environments per rule
snakemake -j 4 --use-conda
Generate all files with complete implementations. The generated Snakefile should be
a valid, runnable Snakemake workflow (assuming the reference genome and tools are
installed). The CLI tool itself only needs jinja2 and pyyaml as dependencies.
What runs where

The CLI tool you are building (the generator) runs on any machine with Python — your laptop, a login node, anywhere. It reads a CSV and writes files. The generated Snakefile requires bioinformatics tools (FastQC, Trim Galore, HISAT2, SAMtools, featureCounts, MultiQC) which are typically installed on HPC clusters or bioinformatics workstations. You do not need these tools installed to build and test the generator. You can verify the generated Snakefile with snakemake -n (dry run) as long as Snakemake itself is installed.


What you get

After generation, set up the generator:

Terminal window
cd rnaseq-workflow-generator
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e .

Generate a workflow from the example sample sheet

Terminal window
python -m workflow_generator \
--sample-sheet example_sample_sheet.csv \
--output-dir my_workflow/ \
--genome /ref/mm10/genome.fa \
--gtf /ref/mm10/genes.gtf

Expected output

my_workflow/
├── Snakefile (~150-250 lines)
├── config.yaml (~40-60 lines)
├── scripts/
│ ├── merge_counts.py (~40-60 lines)
│ └── de_analysis.py (~100-150 lines)
└── README.md

First run walkthrough

  1. Open my_workflow/Snakefile and read through the rules. Each rule should have clear input:, output:, and shell: (or script:) sections. The rule names should read like a protocol: fastqc_raw, trim_galore, hisat2_align, samtools_sort, featurecounts, merge_counts, multiqc, de_analysis.

  2. Open my_workflow/config.yaml. Verify that your sample information is correct: each sample has R1/R2 paths, condition, and replicate. Check that the genome and GTF paths match your local setup.

  3. If you have Snakemake installed, run a dry run:

Terminal window
cd my_workflow
snakemake -n

This shows every job that would run, in dependency order, without actually executing anything. You should see jobs listed for every sample at every step: 6 samples x FastQC = 6 jobs, 6 x trim = 6 jobs, 6 x align = 6 jobs, and so on, plus 1 merge, 1 MultiQC, and 1 DE analysis.

  1. Visualize the DAG (directed acyclic graph):
Terminal window
snakemake --dag | dot -Tpng > dag.png

Open dag.png. You should see a tree-like structure where each sample’s processing chain is independent until the merge_counts rule, which takes all per-sample count files as input. The de_analysis rule depends on merge_counts, and multiqc depends on all FastQC and featureCounts outputs.

  1. Check scripts/de_analysis.py. It should implement the same normalization, t-test, and BH correction from Lesson 5, producing both a results CSV and a standalone volcano plot HTML that you can open directly in a browser.
🔍For Researchers: The reproducibility argument

A Snakefile is a complete, executable record of your analysis pipeline. Combined with the config.yaml, it captures every parameter, every tool, and every dependency. When a reviewer asks “how did you process your RNA-seq data?”, you can point to the Snakefile.

Put both files under version control:

Terminal window
cd my_workflow
git init
git add Snakefile config.yaml scripts/
git commit -m "RNA-seq workflow for project X"

Now your entire analysis is reproducible. Clone the repo on a different machine, edit the paths in config.yaml, run snakemake -j 4, and you get identical results. This is the standard that journals like Nature Methods and Genome Biology increasingly require.


Worked example: From raw FASTQ to DE results in one pipeline

Here is a complete walkthrough of using the generated workflow on a real dataset.

Step 1. You receive data from your sequencing core. The core provides a sample manifest and a directory of FASTQ files:

/sequencing/run_2024_03/
├── ctrl1_R1.fastq.gz
├── ctrl1_R2.fastq.gz
├── ctrl2_R1.fastq.gz
├── ctrl2_R2.fastq.gz
├── ctrl3_R1.fastq.gz
├── ctrl3_R2.fastq.gz
├── trt1_R1.fastq.gz
├── trt1_R2.fastq.gz
├── trt2_R1.fastq.gz
├── trt2_R2.fastq.gz
├── trt3_R1.fastq.gz
└── trt3_R2.fastq.gz

Step 2. Create a sample sheet CSV from the core’s manifest:

sample_id,fastq_r1,fastq_r2,condition,replicate
Control_1,/sequencing/run_2024_03/ctrl1_R1.fastq.gz,/sequencing/run_2024_03/ctrl1_R2.fastq.gz,control,1
Control_2,/sequencing/run_2024_03/ctrl2_R1.fastq.gz,/sequencing/run_2024_03/ctrl2_R2.fastq.gz,control,2
Control_3,/sequencing/run_2024_03/ctrl3_R1.fastq.gz,/sequencing/run_2024_03/ctrl3_R2.fastq.gz,control,3
Treated_1,/sequencing/run_2024_03/trt1_R1.fastq.gz,/sequencing/run_2024_03/trt1_R2.fastq.gz,treated,1
Treated_2,/sequencing/run_2024_03/trt2_R1.fastq.gz,/sequencing/run_2024_03/trt2_R2.fastq.gz,treated,2
Treated_3,/sequencing/run_2024_03/trt3_R1.fastq.gz,/sequencing/run_2024_03/trt3_R2.fastq.gz,treated,3

Step 3. Generate the workflow:

Terminal window
python -m workflow_generator \
--sample-sheet samples.csv \
--output-dir rnaseq_analysis/ \
--genome /ref/mm10/genome.fa \
--gtf /ref/mm10/genes.gtf \
--aligner hisat2 \
--strandedness reverse \
--threads 8

Step 4. Inspect and dry-run:

Terminal window
cd rnaseq_analysis
snakemake -n # verify the plan

Step 5. Run the pipeline:

Terminal window
# Local execution (workstation with 32 cores)
snakemake -j 8
# OR cluster execution (SLURM)
snakemake -j 100 --cluster "sbatch -p compute -n 1 -c {threads} -t 4:00:00 --mem=32G"

Step 6. Check results:

Terminal window
# Open the MultiQC report -- aggregated QC across all samples
open results/multiqc/multiqc_report.html
# Open the DE volcano plot
open results/de/volcano.html
# Check the DE results
head results/de/de_results.csv

Step 7. For interactive exploration, upload the count matrix to Lesson 5’s Streamlit dashboard:

Terminal window
cd ../rnaseq-de-dashboard
streamlit run app.py
# Upload results/counts/count_matrix.csv
The pipeline-to-dashboard handoff

The generated workflow produces results/counts/count_matrix.csv — exactly the format that the Lesson 5 DE dashboard expects. This is intentional. The pipeline gives you the statistical results (CSV and volcano plot). The dashboard gives you interactive exploration (adjust thresholds, zoom into genes, generate heatmaps). Use both together: the pipeline for reproducible batch processing, the dashboard for interactive follow-up analysis.


🔧

When Things Go Wrong

Use the Symptom → Evidence → Request pattern: describe what you see, paste the error, then ask for a fix.

Symptom
snakemake -n reports a CyclicGraphException
Evidence
CyclicGraphException: The DAG of jobs contains a cycle. Cycle: fastqc_trimmed -> trim_galore -> fastqc_trimmed. This started after I edited the Snakefile to add a rule.
What to ask the AI
"The Snakefile has a circular dependency -- a rule's output is also listed as its own input, or two rules depend on each other. Can you check the input/output declarations for fastqc_trimmed and trim_galore? The trimmed FastQC should take trimmed FASTQ files as input, not raw FASTQ. And trim_galore should NOT depend on fastqc_trimmed -- trimming comes before post-trim QC. Fix the dependency order so it flows: fastqc_raw -> trim_galore -> fastqc_trimmed."
Symptom
MissingInputException for FASTQ files that clearly exist
Evidence
MissingInputException: Missing input files for rule trim_galore: /data/fastq/ctrl1_R1.fastq.gz. But the file exists: ls -la /data/fastq/ctrl1_R1.fastq.gz shows it right there.
What to ask the AI
"Snakemake resolves paths relative to the working directory. If the Snakefile is in /home/user/workflow/ but the FASTQ paths in config.yaml are relative, Snakemake cannot find them. Can you: (1) make the generator always write absolute paths to config.yaml, and (2) add a validation step that checks whether the FASTQ files actually exist at generation time and warns if they do not?"
Symptom
STAR alignment crashes with 'EXITING because of fatal PARAMETER error: --limitGenomeGenerateRAM'
Evidence
STAR exits immediately with a memory error when aligning to the human genome. The machine has 64 GB RAM. The log says: 'EXITING because of fatal PARAMETER error: --limitGenomeGenerateRAM is too small for this genome.'
What to ask the AI
"STAR requires ~32 GB of RAM to load the human genome index. The SLURM job might be requesting less memory than STAR needs. Can you: (1) add a --mem parameter to the STAR rule in the Snakefile (default 40G), (2) add it as a resource in the Snakefile so cluster execution requests enough memory: resources: mem_mb=40000, and (3) add a note in the generated README about STAR memory requirements?"
Symptom
featureCounts reports 0% assignment rate
Evidence
featureCounts runs but reports: 'Successfully assigned alignments: 0 (0.0%)'. The BAM files are not empty -- samtools flagstat shows 20 million mapped reads.
What to ask the AI
"The 0% assignment usually means the strandedness setting is wrong. If featureCounts uses -s 2 (reverse) but the library was prepared with an unstranded protocol, no reads match the expected strand. Can you add a validation step to the generator that: (1) warns the user to check strandedness if they are unsure, (2) includes a comment in the config.yaml explaining what 0/1/2 mean for the -s flag, and (3) suggests running featureCounts with -s 0 (unstranded) first to check if assignment rate improves?"
Symptom
Generated Snakefile has Jinja2 template syntax instead of Snakemake syntax
Evidence
The Snakefile contains literal {{ and {% tags instead of rendered values. Running snakemake gives: SyntaxError: invalid syntax at line 3.
What to ask the AI
"The Jinja2 template is not being rendered -- the raw template was copied to the output instead of the rendered result. Can you check that snakefile_builder.py is actually calling jinja2.Template.render() with the context variables? The common mistake is writing the template file directly to disk instead of rendering it first."

Understanding Snakemake DAGs

The Snakefile defines rules. Each rule declares inputs, outputs, and a command. Snakemake builds a directed acyclic graph (DAG) — a dependency tree — from these declarations. Understanding the DAG is the key to debugging workflow issues.

How Snakemake resolves dependencies: Starting from the rule all targets, Snakemake works backward. “To produce multiqc_report.html, I need all FastQC outputs and count files. To produce a count file, I need a sorted BAM. To produce a sorted BAM, I need an unsorted BAM. To produce an unsorted BAM, I need trimmed FASTQs.” This backward resolution is called target-based scheduling.

Wildcards expand per sample: The {sample} wildcard in rules like trim_galore means Snakemake creates one instance of that rule for each sample. With 6 samples, you get 6 independent trimming jobs. These can run in parallel because they do not depend on each other.

The merge point: Rules like merge_counts that use expand() to require all samples’ count files create a bottleneck in the DAG. Everything upstream of the merge runs in parallel per sample. Everything downstream runs once on the merged result.

temp() for intermediate files: The temp() directive on the unsorted BAM files tells Snakemake to delete them after the sorted BAM is produced. This saves disk space — unsorted BAMs are transient.

Rerunning only what changed: If you modify the config (e.g., change the trimming quality threshold), Snakemake detects that the trim rule’s parameters changed and reruns only the trimming step and everything downstream. It does not re-run FastQC on the raw reads because that step is unaffected.

💬Why this matters for your career

If you can hand a PI a sample-sheet template and say “fill this in and I will generate the entire pipeline,” you become the person in the lab who makes everyone else more productive. Workflow automation is one of the highest-leverage skills in computational biology. Every sequencing experiment in your lab follows roughly the same pipeline — the only things that change are the sample names, FASTQ paths, and a few parameters. This tool codifies that.


Customize it

Add a variant calling branch

Add an optional variant calling branch to the Snakefile. After alignment and sorting,
add rules for: (1) GATK MarkDuplicates, (2) GATK SplitNCigarReads (RNA-seq specific),
(3) GATK HaplotypeCaller per sample, (4) GATK GenotypeGVCFs to combine per-sample
GVCFs, (5) GATK VariantFiltration with RNA-seq-appropriate filters. Add a
--variant-calling flag to the CLI that enables this branch. The variant calling
rules should run in parallel with featureCounts (both depend on sorted BAMs but
not on each other). Include the GATK memory requirements as resources in each rule.

Add container support

Add Singularity/Apptainer container support to the generated Snakefile. For each
rule, add a container: directive pointing to a BioContainers image. For example:
- fastqc: "docker://biocontainers/fastqc:v0.12.1"
- trim_galore: "docker://quay.io/biocontainers/trim-galore:0.6.10"
- hisat2: "docker://quay.io/biocontainers/hisat2:2.2.1"
- samtools: "docker://quay.io/biocontainers/samtools:1.19"
- subread (featureCounts): "docker://quay.io/biocontainers/subread:2.0.6"
Add a --use-containers flag to the CLI. Include a note in the generated README
about running with snakemake --use-singularity. This makes the workflow fully
reproducible without installing any bioinformatics tools on the host machine.

Convert to Nextflow

Add a --format nextflow flag to the CLI. When set, generate a Nextflow workflow
(main.nf and nextflow.config) instead of a Snakefile. Map each Snakemake rule to
a Nextflow process. Use Nextflow channels for the FASTQ input and the merge step.
Include profiles for local execution and SLURM cluster execution in nextflow.config.
This lets the user choose their preferred workflow engine from the same sample sheet.

Add email notification on completion

Add a --notify flag that accepts an email address. Add an onsuccess and onerror
handler to the generated Snakefile. On success, send an email with the subject
"RNA-seq pipeline complete: N samples processed" and attach the MultiQC report
and DE results summary. On error, send an email with the failed rule name and
log file contents. Use Python's smtplib with the local SMTP server. This is
essential for long-running HPC jobs where you do not want to keep checking
the queue.
💡Version your workflows, not just your code

The best practice is to keep your generated workflow in a separate git repository from the generator tool:

Terminal window
# Generator tool (shared across projects)
git clone your-generator-repo
cd rnaseq-workflow-generator
# Generated workflow (per project)
python -m workflow_generator -s samples.csv -o /projects/drug_study/workflow/
cd /projects/drug_study/workflow/
git init
git add Snakefile config.yaml scripts/
git commit -m "Initial workflow for drug treatment RNA-seq study"

This way each project has its own versioned workflow with project-specific config, and the generator tool evolves independently.


Connecting to the full module

This capstone lesson synthesizes tools from across the module:

LessonToolHow it connects to this workflow
L1: Sequence DashboardBrowser-based FASTA analysisQuick sequence QC before designing experiments
L2: CRISPR Guide DesignerReact guide RNA toolDesign the guides that create the mutant samples you sequence
L3: Genomics Data PipelinePython FASTQ QCThe same QC logic is now automated as the fastqc_raw and fastqc_trimmed rules
L4: Mass Spec ViewerDash spectrometry appProteomics validation of RNA-seq hits (protein-level confirmation of gene expression changes)
L5: DE DashboardStreamlit DE analysisThe de_analysis rule implements the same statistics; the dashboard provides interactive follow-up
L6: Workflow OrchestratorSnakemake generatorChains everything into a reproducible, end-to-end pipeline

The progression is deliberate: single HTML file (L1) to React app (L2) to CLI tool (L3) to web application (L4) to data analysis dashboard (L5) to workflow orchestrator (L6). Each lesson introduces a new software pattern while building on the bioinformatics context from the previous ones.


Key takeaways

  • Workflow managers turn manual multi-step analyses into reproducible, automated pipelines: instead of running 8 tools in sequence with copy-pasted paths, you declare the dependencies and let Snakemake handle execution.
  • The DAG is the core concept: every workflow is a directed acyclic graph. Understanding which rules depend on which outputs is the key to debugging. If you can draw the DAG on paper, you can write (or generate) the Snakefile.
  • Generating workflows from templates is more flexible than hardcoding: a Jinja2 template that accepts parameters from a config produces different Snakefiles for different experiments. The generator tool is reusable; the generated workflow is project-specific.
  • Config files make pipelines reproducible: all parameters (genome path, trimming threshold, DE thresholds) live in config.yaml, not hardcoded in the Snakefile. Change the config to rerun with different settings. Check the config into version control for reproducibility.
  • The same Snakefile runs locally and on a cluster: snakemake -j 4 for local, snakemake -j 100 --cluster "sbatch ..." for SLURM. No code changes required — only the execution command changes.
  • Separate the generator from the generated workflow: the generator is a reusable Python tool. The generated Snakefile is project-specific. Version them separately.

What you’ve built in this module

Across six lessons, you have built:

  1. A sequence analysis dashboard (L1) — single HTML file, zero dependencies, runs on any lab computer.
  2. A CRISPR guide RNA designer (L2) — React + Vite app with scoring, filtering, and oligo export.
  3. A genomics QC pipeline (L3) — Python CLI tool with config-driven FASTQ processing and HTML reports.
  4. A mass spectrometry data viewer (L4) — Dash web app with peak detection, multi-spectrum overlay, and publication-quality export.
  5. An RNA-seq DE dashboard (L5) — Streamlit app with normalization, statistical testing, volcano/MA/heatmap visualization.
  6. A reproducible workflow orchestrator (L6) — Python CLI that generates Snakemake pipelines from a sample sheet, chaining QC, alignment, counting, and DE analysis into an end-to-end reproducible workflow.

Each tool followed the same pattern: see what is possible, get the exact prompt, run it, customize it. Each tool is something you can use in your actual research, share with your lab, and extend as your needs evolve.

The progression was deliberate. You started with a browser tool that takes 15 minutes to build, and finished with a workflow orchestrator that generates complete analysis pipelines. The complexity increased, but the build pattern stayed the same: describe what you want, let the LLM build it, verify the output, customize iteratively.

The LLM wrote the code. You made the scientific decisions: what analysis to run, what parameters matter, what thresholds to set, how to interpret the results. That combination — domain expertise plus AI-powered tooling — is the skill this module teaches. It is a skill that makes labs more productive, research more reproducible, and scientists more capable. And it is a skill that compounds: every tool you build teaches you to build the next one faster and better.


Portfolio suggestion

This workflow orchestrator is the capstone portfolio piece for the module. To make it maximally effective:

  1. Show the full stack: include screenshots or links to tools from L1 through L6. Demonstrate that you can build at every level — from a single HTML file to a pipeline orchestrator.
  2. Include a generated Snakefile and DAG image: a DAG visualization is immediately impressive to anyone in bioinformatics. Label the image with the pipeline steps.
  3. If you ran the pipeline on real data: include the MultiQC report and a DE volcano plot. Even running on a small test dataset demonstrates end-to-end functionality.
  4. Write a summary document covering: the problem (manual, non-reproducible RNA-seq analysis), the solution (automated Snakemake generation), and the impact (hours saved per experiment, reproducibility achieved). This is the kind of document that makes an impression in a lab interview or a grant application.
🔍Advanced: Building a web interface for the workflow generator

For labs where not everyone is comfortable with the command line, you can wrap the generator in a web interface:

Create a Streamlit web application that wraps the workflow generator. The user
uploads a sample-sheet CSV, fills in parameters (genome, GTF, aligner, thresholds)
via form widgets, and clicks "Generate Workflow." The app calls the generator
functions from workflow_generator/ and provides a download button for a ZIP file
containing the Snakefile, config.yaml, scripts/, and README. Also display the
generated Snakefile in a code block for review before downloading. This lets
non-CLI users generate reproducible workflows without touching the terminal.

This is a practical addition for core facilities and shared computing environments where PIs or technicians need to set up workflows without CLI expertise.


KNOWLEDGE CHECK

Your Snakemake workflow has 6 samples. The pipeline runs FastQC, trimming, alignment, and featureCounts per sample, then merges all counts into one matrix. You run snakemake -j 6 (6 parallel jobs). Which statement about the execution is correct?


Try it yourself

  1. Generate the workflow orchestrator with the prompt above.
  2. Run the generator with the example sample sheet and inspect the generated Snakefile. Does the rule structure make sense?
  3. Open the generated config.yaml. Are all parameters present and correctly populated from the sample sheet?
  4. If you have Snakemake installed, run snakemake -n in the generated workflow directory to verify the dry run completes without errors.
  5. Generate the DAG visualization and examine the dependency graph.
  6. Try changing a parameter in config.yaml (e.g., trim_quality from 20 to 30) and run snakemake -n again. Does Snakemake correctly identify which rules need to be rerun?
  7. Pick one customization from the list above and add it with a follow-up prompt.
  8. Celebrate. You have built six bioinformatics tools in this module — from a single HTML file to a full pipeline orchestrator.