Reproducible RNA-Seq Workflow Orchestrator
What you'll learn
~45 min- Build a Snakemake workflow chaining QC, alignment, counting, and differential expression analysis
- Generate a Snakefile from a sample-sheet CSV
- Troubleshoot DAG dependencies and file I/O errors
- Extend the workflow with config-driven parameterization
What you’re building
In Lesson 3, you built a FASTQ QC pipeline. In Lesson 5, you built a differential expression dashboard. In a real RNA-seq experiment, these are two steps in a much longer chain: raw reads go through QC, trimming, alignment, counting, and finally differential expression. Running each step manually — copying output paths, checking that files exist, restarting failed steps — is tedious, error-prone, and not reproducible.
Snakemake solves this. It is a workflow management system built for bioinformatics. You define rules (each step in the pipeline), declare what files each rule needs and produces, and Snakemake figures out the execution order, parallelizes independent steps, and reruns only what changed. It is the dominant workflow tool in academic bioinformatics, used in thousands of published papers.
In this lesson you will build a Python CLI tool that generates complete Snakemake workflows from a sample-sheet CSV. Give it a CSV listing your samples, FASTQ paths, and conditions, and it produces a ready-to-run Snakefile plus a config.yaml with all parameters. The workflow chains FastQC, Trim Galore, HISAT2 alignment, featureCounts, MultiQC, and a DE analysis step that connects directly to Lesson 5’s dashboard.
This is the capstone lesson for the module. It ties together everything you have built into a single reproducible pipeline that goes from raw FASTQ files to a list of differentially expressed genes.
Define inputs → declare dependencies → let the engine handle execution order, parallelism, and failure recovery. This pattern appears everywhere: CI/CD pipelines (GitHub Actions), data engineering (Airflow, dbt), infrastructure (Terraform). The syntax changes; the concept of declaring “what depends on what” is universal.
Snakemake has built-in cluster support. Instead of snakemake -j 4 (4 local cores), use snakemake -j 100 --cluster "sbatch -p compute -n 1 -t 4:00:00" to submit each rule as a separate SLURM job. The workflow generator you are building will include this command in its output README. No code changes are needed to go from laptop to cluster.
The showcase
The finished CLI tool will:
- Read a sample-sheet CSV with columns: sample_id, fastq_r1, fastq_r2, condition, replicate.
- Generate a complete Snakefile with rules for every step in the RNA-seq pipeline:
- FastQC — quality assessment of raw reads
- Trim Galore — adapter trimming and quality filtering
- FastQC (post-trim) — quality assessment of trimmed reads
- HISAT2 alignment — splice-aware alignment to a reference genome
- SAMtools sort/index — BAM file processing
- featureCounts — gene-level read counting
- Count matrix merge — combine per-sample counts into a single matrix
- MultiQC — aggregate all QC reports into one HTML summary
- DE analysis — run the t-test/FDR pipeline from Lesson 5 on the merged count matrix
- Generate a config.yaml with all tunable parameters: genome reference path, annotation GTF path, trimming quality threshold, alignment parameters, featureCounts strand setting, DE thresholds.
- Generate a DAG visualization command so you can see the workflow graph before running.
- Generate a README with exact commands for dry-run, local execution, and cluster execution.
The tool does not run the pipeline itself — it generates the Snakefile and config that you then execute with snakemake. This separation is intentional: the generator runs anywhere (your laptop, a login node), and the generated workflow runs wherever your data and tools live (HPC, cloud, local workstation).
🔍Snakemake vs. Nextflow
Both are excellent workflow managers for bioinformatics. Here is when to choose each:
Choose Snakemake if:
- Your lab primarily uses Python
- You want to run workflows locally first, then scale to a cluster with minimal changes
- You are in an academic environment where Snakemake is the de facto standard
- You need tight integration with conda environments (Snakemake has built-in
conda:directives)
Choose Nextflow if:
- Your lab or institute has standardized on Nextflow (common in large genomics centers)
- You need cloud-native execution on AWS Batch or Google Cloud Life Sciences
- You want to use nf-core community pipelines out of the box
- You prefer Groovy-based DSL over Python-based DSL
Both tools solve the same problem. This lesson uses Snakemake because it is more Python-native, and because you have been writing Python tools throughout this module. The Customize section includes a prompt to convert the generated workflow to Nextflow if your lab uses that instead.
The prompt
Open your AI CLI tool (such as Claude Code, Gemini CLI, or your preferred tool) in an empty directory and paste:
Create a Python CLI tool called rnaseq-workflow-generator that reads a sample-sheetCSV and generates a complete Snakemake workflow for RNA-seq analysis. Use Python 3.10+with Jinja2 for templating and PyYAML for config generation.
PROJECT STRUCTURE:rnaseq-workflow-generator/├── workflow_generator/│ ├── __init__.py│ ├── cli.py # argparse CLI entry point│ ├── sample_sheet.py # sample sheet CSV parser and validator│ ├── snakefile_builder.py # Snakemake rule generation using Jinja2│ ├── config_builder.py # config.yaml generation│ └── templates/│ ├── Snakefile.j2 # Jinja2 template for the Snakefile│ └── config.yaml.j2 # Jinja2 template for config.yaml├── example_sample_sheet.csv # example input with 6 samples├── requirements.txt # jinja2, pyyaml├── setup.py # pip install -e . support└── README.md
CLI INTERFACE: python -m workflow_generator --sample-sheet samples.csv --output-dir workflow/ python -m workflow_generator --sample-sheet samples.csv --output-dir workflow/ \ --genome /ref/mm10/genome.fa --gtf /ref/mm10/genes.gtf --aligner hisat2
OPTIONS: --sample-sheet, -s Path to sample sheet CSV (required) --output-dir, -o Directory for generated workflow files (required, created if missing) --genome, -g Path to genome FASTA (default: /ref/genome.fa, user must edit) --gtf Path to gene annotation GTF (default: /ref/genes.gtf) --aligner Alignment tool: hisat2 or star (default: hisat2) --strandedness Library strandedness: unstranded, forward, reverse (default: reverse) --trim-quality Trim Galore quality threshold (default: 20) --min-mapq Minimum mapping quality for featureCounts (default: 10) --fdr-threshold DE analysis FDR threshold (default: 0.05) --lfc-threshold DE analysis log2FC threshold (default: 1.0) --threads Threads per rule (default: 4) --no-de Skip the DE analysis step
SAMPLE SHEET FORMAT (sample_sheet.py):Parse and validate a CSV with these columns: - sample_id: unique identifier (alphanumeric + underscore, no spaces) - fastq_r1: path to R1 FASTQ file (absolute path or relative to workflow dir) - fastq_r2: path to R2 FASTQ file (for paired-end; blank for single-end) - condition: experimental group name (e.g., "control", "treated") - replicate: replicate number within the condition (e.g., 1, 2, 3)
Validation rules: - All sample_ids must be unique - All FASTQ paths must be non-empty (warn if files do not exist at generation time) - At least 2 conditions must be present for DE analysis - At least 2 replicates per condition recommended (warn if fewer) - Reject sample_ids with spaces or special characters
EXAMPLE SAMPLE SHEET (example_sample_sheet.csv):sample_id,fastq_r1,fastq_r2,condition,replicateControl_1,/data/fastq/ctrl1_R1.fastq.gz,/data/fastq/ctrl1_R2.fastq.gz,control,1Control_2,/data/fastq/ctrl2_R1.fastq.gz,/data/fastq/ctrl2_R2.fastq.gz,control,2Control_3,/data/fastq/ctrl3_R1.fastq.gz,/data/fastq/ctrl3_R2.fastq.gz,control,3Treated_1,/data/fastq/trt1_R1.fastq.gz,/data/fastq/trt1_R2.fastq.gz,treated,1Treated_2,/data/fastq/trt2_R1.fastq.gz,/data/fastq/trt2_R2.fastq.gz,treated,2Treated_3,/data/fastq/trt3_R1.fastq.gz,/data/fastq/trt3_R2.fastq.gz,treated,3
SNAKEFILE TEMPLATE (Snakefile.j2):Generate a Snakemake workflow with these rules in dependency order:
configfile: "config.yaml"
SAMPLES = config["samples"]CONDITIONS = config["conditions"]GENOME = config["genome"]GTF = config["gtf"]
rule all: input: "results/multiqc/multiqc_report.html", "results/counts/count_matrix.csv", {% if not no_de %}"results/de/de_results.csv"{% endif %}
rule fastqc_raw: input: get R1 and R2 paths from config for each sample output: "results/fastqc_raw/{sample}_R1_fastqc.html", "results/fastqc_raw/{sample}_R2_fastqc.html" threads: 2 shell: "fastqc {input} -o results/fastqc_raw/ -t {threads}"
rule trim_galore: input: R1 and R2 from config output: "results/trimmed/{sample}_R1_val_1.fq.gz", "results/trimmed/{sample}_R2_val_2.fq.gz" params: quality = config["trim_quality"] threads: 4 shell: """ trim_galore --paired --quality {params.quality} \ --output_dir results/trimmed/ \ --cores {threads} {input.r1} {input.r2} """
rule fastqc_trimmed: input: "results/trimmed/{sample}_R1_val_1.fq.gz", "results/trimmed/{sample}_R2_val_2.fq.gz" output: "results/fastqc_trimmed/{sample}_R1_val_1_fastqc.html" threads: 2 shell: "fastqc {input} -o results/fastqc_trimmed/ -t {threads}"
rule hisat2_align: input: "results/trimmed/{sample}_R1_val_1.fq.gz", "results/trimmed/{sample}_R2_val_2.fq.gz" output: temp("results/aligned/{sample}.unsorted.bam") params: index = config["hisat2_index"] threads: config["threads"] log: "logs/hisat2/{sample}.log" shell: """ hisat2 -x {params.index} -1 {input[0]} -2 {input[1]} \ --rna-strandness RF --threads {threads} 2> {log} | \ samtools view -bS - > {output} """ (If --aligner star, generate STAR rule instead with appropriate parameters)
rule samtools_sort: input: "results/aligned/{sample}.unsorted.bam" output: "results/aligned/{sample}.sorted.bam" threads: 4 shell: "samtools sort -@ {threads} -o {output} {input}"
rule samtools_index: input: "results/aligned/{sample}.sorted.bam" output: "results/aligned/{sample}.sorted.bam.bai" shell: "samtools index {input}"
rule featurecounts: input: bam = "results/aligned/{sample}.sorted.bam", bai = "results/aligned/{sample}.sorted.bam.bai" output: "results/counts/{sample}.counts.txt" params: strand = config["featurecounts_strand"], gtf = GTF threads: 2 shell: """ featureCounts -a {params.gtf} -o {output} \ -s {params.strand} -p --countReadPairs \ -T {threads} -Q {config[min_mapq]} {input.bam} """
rule merge_counts: input: expand("results/counts/{sample}.counts.txt", sample=SAMPLES) output: "results/counts/count_matrix.csv" script: "scripts/merge_counts.py" (Generate a scripts/merge_counts.py that reads featureCounts output files, extracts the gene name and count columns, and merges into a single CSV with genes as rows and samples as columns)
rule multiqc: input: expand("results/fastqc_raw/{sample}_R1_fastqc.html", sample=SAMPLES), expand("results/fastqc_trimmed/{sample}_R1_val_1_fastqc.html", sample=SAMPLES), expand("results/counts/{sample}.counts.txt", sample=SAMPLES) output: "results/multiqc/multiqc_report.html" shell: "multiqc results/ -o results/multiqc/ --force"
rule de_analysis: input: "results/counts/count_matrix.csv" output: "results/de/de_results.csv", "results/de/volcano.html" params: fdr = config["fdr_threshold"], lfc = config["lfc_threshold"], conditions = CONDITIONS script: "scripts/de_analysis.py" (Generate a scripts/de_analysis.py that implements the same t-test + BH correction from Lesson 5, reads the count matrix, performs DE analysis, saves results CSV, and generates a standalone Plotly volcano plot HTML)
CONFIG TEMPLATE (config.yaml.j2):Generate a YAML config with all parameters from the CLI plus: samples: Control_1: r1: /data/fastq/ctrl1_R1.fastq.gz r2: /data/fastq/ctrl1_R2.fastq.gz condition: control Control_2: ... conditions: control: [Control_1, Control_2, Control_3] treated: [Treated_1, Treated_2, Treated_3] genome: /ref/genome.fa gtf: /ref/genes.gtf hisat2_index: /ref/hisat2_index/genome threads: 4 trim_quality: 20 min_mapq: 10 featurecounts_strand: 2 (2 = reverse strand for dUTP libraries) fdr_threshold: 0.05 lfc_threshold: 1.0
OUTPUT STRUCTURE:When the generator runs, it produces: workflow/ ├── Snakefile ├── config.yaml ├── scripts/ │ ├── merge_counts.py │ └── de_analysis.py └── README.md (with commands for: dry-run, local run, cluster run, DAG viz)
The README should include: # Quick start snakemake -n # dry run (check the plan) snakemake -j 4 # run locally with 4 cores snakemake --dag | dot -Tpng > dag.png # visualize the workflow DAG
# HPC cluster (SLURM) snakemake -j 100 --cluster "sbatch -p compute -n 1 -c {threads} -t 4:00:00"
# With conda environments per rule snakemake -j 4 --use-conda
Generate all files with complete implementations. The generated Snakefile should bea valid, runnable Snakemake workflow (assuming the reference genome and tools areinstalled). The CLI tool itself only needs jinja2 and pyyaml as dependencies.The CLI tool you are building (the generator) runs on any machine with Python — your laptop, a login node, anywhere. It reads a CSV and writes files. The generated Snakefile requires bioinformatics tools (FastQC, Trim Galore, HISAT2, SAMtools, featureCounts, MultiQC) which are typically installed on HPC clusters or bioinformatics workstations. You do not need these tools installed to build and test the generator. You can verify the generated Snakefile with snakemake -n (dry run) as long as Snakemake itself is installed.
What you get
After generation, set up the generator:
cd rnaseq-workflow-generatorpython -m venv .venvsource .venv/bin/activate # On Windows: .venv\Scripts\activatepip install -e .Generate a workflow from the example sample sheet
python -m workflow_generator \ --sample-sheet example_sample_sheet.csv \ --output-dir my_workflow/ \ --genome /ref/mm10/genome.fa \ --gtf /ref/mm10/genes.gtfExpected output
my_workflow/├── Snakefile (~150-250 lines)├── config.yaml (~40-60 lines)├── scripts/│ ├── merge_counts.py (~40-60 lines)│ └── de_analysis.py (~100-150 lines)└── README.mdFirst run walkthrough
-
Open
my_workflow/Snakefileand read through the rules. Each rule should have clearinput:,output:, andshell:(orscript:) sections. The rule names should read like a protocol:fastqc_raw,trim_galore,hisat2_align,samtools_sort,featurecounts,merge_counts,multiqc,de_analysis. -
Open
my_workflow/config.yaml. Verify that your sample information is correct: each sample has R1/R2 paths, condition, and replicate. Check that the genome and GTF paths match your local setup. -
If you have Snakemake installed, run a dry run:
cd my_workflowsnakemake -nThis shows every job that would run, in dependency order, without actually executing anything. You should see jobs listed for every sample at every step: 6 samples x FastQC = 6 jobs, 6 x trim = 6 jobs, 6 x align = 6 jobs, and so on, plus 1 merge, 1 MultiQC, and 1 DE analysis.
- Visualize the DAG (directed acyclic graph):
snakemake --dag | dot -Tpng > dag.pngOpen dag.png. You should see a tree-like structure where each sample’s processing chain is independent until the merge_counts rule, which takes all per-sample count files as input. The de_analysis rule depends on merge_counts, and multiqc depends on all FastQC and featureCounts outputs.
- Check
scripts/de_analysis.py. It should implement the same normalization, t-test, and BH correction from Lesson 5, producing both a results CSV and a standalone volcano plot HTML that you can open directly in a browser.
A Snakefile is a complete, executable record of your analysis pipeline. Combined with the config.yaml, it captures every parameter, every tool, and every dependency. When a reviewer asks “how did you process your RNA-seq data?”, you can point to the Snakefile.
Put both files under version control:
cd my_workflowgit initgit add Snakefile config.yaml scripts/git commit -m "RNA-seq workflow for project X"Now your entire analysis is reproducible. Clone the repo on a different machine, edit the paths in config.yaml, run snakemake -j 4, and you get identical results. This is the standard that journals like Nature Methods and Genome Biology increasingly require.
Worked example: From raw FASTQ to DE results in one pipeline
Here is a complete walkthrough of using the generated workflow on a real dataset.
Step 1. You receive data from your sequencing core. The core provides a sample manifest and a directory of FASTQ files:
/sequencing/run_2024_03/├── ctrl1_R1.fastq.gz├── ctrl1_R2.fastq.gz├── ctrl2_R1.fastq.gz├── ctrl2_R2.fastq.gz├── ctrl3_R1.fastq.gz├── ctrl3_R2.fastq.gz├── trt1_R1.fastq.gz├── trt1_R2.fastq.gz├── trt2_R1.fastq.gz├── trt2_R2.fastq.gz├── trt3_R1.fastq.gz└── trt3_R2.fastq.gzStep 2. Create a sample sheet CSV from the core’s manifest:
sample_id,fastq_r1,fastq_r2,condition,replicateControl_1,/sequencing/run_2024_03/ctrl1_R1.fastq.gz,/sequencing/run_2024_03/ctrl1_R2.fastq.gz,control,1Control_2,/sequencing/run_2024_03/ctrl2_R1.fastq.gz,/sequencing/run_2024_03/ctrl2_R2.fastq.gz,control,2Control_3,/sequencing/run_2024_03/ctrl3_R1.fastq.gz,/sequencing/run_2024_03/ctrl3_R2.fastq.gz,control,3Treated_1,/sequencing/run_2024_03/trt1_R1.fastq.gz,/sequencing/run_2024_03/trt1_R2.fastq.gz,treated,1Treated_2,/sequencing/run_2024_03/trt2_R1.fastq.gz,/sequencing/run_2024_03/trt2_R2.fastq.gz,treated,2Treated_3,/sequencing/run_2024_03/trt3_R1.fastq.gz,/sequencing/run_2024_03/trt3_R2.fastq.gz,treated,3Step 3. Generate the workflow:
python -m workflow_generator \ --sample-sheet samples.csv \ --output-dir rnaseq_analysis/ \ --genome /ref/mm10/genome.fa \ --gtf /ref/mm10/genes.gtf \ --aligner hisat2 \ --strandedness reverse \ --threads 8Step 4. Inspect and dry-run:
cd rnaseq_analysissnakemake -n # verify the planStep 5. Run the pipeline:
# Local execution (workstation with 32 cores)snakemake -j 8
# OR cluster execution (SLURM)snakemake -j 100 --cluster "sbatch -p compute -n 1 -c {threads} -t 4:00:00 --mem=32G"Step 6. Check results:
# Open the MultiQC report -- aggregated QC across all samplesopen results/multiqc/multiqc_report.html
# Open the DE volcano plotopen results/de/volcano.html
# Check the DE resultshead results/de/de_results.csvStep 7. For interactive exploration, upload the count matrix to Lesson 5’s Streamlit dashboard:
cd ../rnaseq-de-dashboardstreamlit run app.py# Upload results/counts/count_matrix.csvThe generated workflow produces results/counts/count_matrix.csv — exactly the format that the Lesson 5 DE dashboard expects. This is intentional. The pipeline gives you the statistical results (CSV and volcano plot). The dashboard gives you interactive exploration (adjust thresholds, zoom into genes, generate heatmaps). Use both together: the pipeline for reproducible batch processing, the dashboard for interactive follow-up analysis.
When Things Go Wrong
Use the Symptom → Evidence → Request pattern: describe what you see, paste the error, then ask for a fix.
Understanding Snakemake DAGs
The Snakefile defines rules. Each rule declares inputs, outputs, and a command. Snakemake builds a directed acyclic graph (DAG) — a dependency tree — from these declarations. Understanding the DAG is the key to debugging workflow issues.
How Snakemake resolves dependencies: Starting from the rule all targets, Snakemake works backward. “To produce multiqc_report.html, I need all FastQC outputs and count files. To produce a count file, I need a sorted BAM. To produce a sorted BAM, I need an unsorted BAM. To produce an unsorted BAM, I need trimmed FASTQs.” This backward resolution is called target-based scheduling.
Wildcards expand per sample: The {sample} wildcard in rules like trim_galore means Snakemake creates one instance of that rule for each sample. With 6 samples, you get 6 independent trimming jobs. These can run in parallel because they do not depend on each other.
The merge point: Rules like merge_counts that use expand() to require all samples’ count files create a bottleneck in the DAG. Everything upstream of the merge runs in parallel per sample. Everything downstream runs once on the merged result.
temp() for intermediate files: The temp() directive on the unsorted BAM files tells Snakemake to delete them after the sorted BAM is produced. This saves disk space — unsorted BAMs are transient.
Rerunning only what changed: If you modify the config (e.g., change the trimming quality threshold), Snakemake detects that the trim rule’s parameters changed and reruns only the trimming step and everything downstream. It does not re-run FastQC on the raw reads because that step is unaffected.
If you can hand a PI a sample-sheet template and say “fill this in and I will generate the entire pipeline,” you become the person in the lab who makes everyone else more productive. Workflow automation is one of the highest-leverage skills in computational biology. Every sequencing experiment in your lab follows roughly the same pipeline — the only things that change are the sample names, FASTQ paths, and a few parameters. This tool codifies that.
Customize it
Add a variant calling branch
Add an optional variant calling branch to the Snakefile. After alignment and sorting,add rules for: (1) GATK MarkDuplicates, (2) GATK SplitNCigarReads (RNA-seq specific),(3) GATK HaplotypeCaller per sample, (4) GATK GenotypeGVCFs to combine per-sampleGVCFs, (5) GATK VariantFiltration with RNA-seq-appropriate filters. Add a--variant-calling flag to the CLI that enables this branch. The variant callingrules should run in parallel with featureCounts (both depend on sorted BAMs butnot on each other). Include the GATK memory requirements as resources in each rule.Add container support
Add Singularity/Apptainer container support to the generated Snakefile. For eachrule, add a container: directive pointing to a BioContainers image. For example:- fastqc: "docker://biocontainers/fastqc:v0.12.1"- trim_galore: "docker://quay.io/biocontainers/trim-galore:0.6.10"- hisat2: "docker://quay.io/biocontainers/hisat2:2.2.1"- samtools: "docker://quay.io/biocontainers/samtools:1.19"- subread (featureCounts): "docker://quay.io/biocontainers/subread:2.0.6"Add a --use-containers flag to the CLI. Include a note in the generated READMEabout running with snakemake --use-singularity. This makes the workflow fullyreproducible without installing any bioinformatics tools on the host machine.Convert to Nextflow
Add a --format nextflow flag to the CLI. When set, generate a Nextflow workflow(main.nf and nextflow.config) instead of a Snakefile. Map each Snakemake rule toa Nextflow process. Use Nextflow channels for the FASTQ input and the merge step.Include profiles for local execution and SLURM cluster execution in nextflow.config.This lets the user choose their preferred workflow engine from the same sample sheet.Add email notification on completion
Add a --notify flag that accepts an email address. Add an onsuccess and onerrorhandler to the generated Snakefile. On success, send an email with the subject"RNA-seq pipeline complete: N samples processed" and attach the MultiQC reportand DE results summary. On error, send an email with the failed rule name andlog file contents. Use Python's smtplib with the local SMTP server. This isessential for long-running HPC jobs where you do not want to keep checkingthe queue.The best practice is to keep your generated workflow in a separate git repository from the generator tool:
# Generator tool (shared across projects)git clone your-generator-repocd rnaseq-workflow-generator
# Generated workflow (per project)python -m workflow_generator -s samples.csv -o /projects/drug_study/workflow/cd /projects/drug_study/workflow/git initgit add Snakefile config.yaml scripts/git commit -m "Initial workflow for drug treatment RNA-seq study"This way each project has its own versioned workflow with project-specific config, and the generator tool evolves independently.
Connecting to the full module
This capstone lesson synthesizes tools from across the module:
| Lesson | Tool | How it connects to this workflow |
|---|---|---|
| L1: Sequence Dashboard | Browser-based FASTA analysis | Quick sequence QC before designing experiments |
| L2: CRISPR Guide Designer | React guide RNA tool | Design the guides that create the mutant samples you sequence |
| L3: Genomics Data Pipeline | Python FASTQ QC | The same QC logic is now automated as the fastqc_raw and fastqc_trimmed rules |
| L4: Mass Spec Viewer | Dash spectrometry app | Proteomics validation of RNA-seq hits (protein-level confirmation of gene expression changes) |
| L5: DE Dashboard | Streamlit DE analysis | The de_analysis rule implements the same statistics; the dashboard provides interactive follow-up |
| L6: Workflow Orchestrator | Snakemake generator | Chains everything into a reproducible, end-to-end pipeline |
The progression is deliberate: single HTML file (L1) to React app (L2) to CLI tool (L3) to web application (L4) to data analysis dashboard (L5) to workflow orchestrator (L6). Each lesson introduces a new software pattern while building on the bioinformatics context from the previous ones.
Key takeaways
- Workflow managers turn manual multi-step analyses into reproducible, automated pipelines: instead of running 8 tools in sequence with copy-pasted paths, you declare the dependencies and let Snakemake handle execution.
- The DAG is the core concept: every workflow is a directed acyclic graph. Understanding which rules depend on which outputs is the key to debugging. If you can draw the DAG on paper, you can write (or generate) the Snakefile.
- Generating workflows from templates is more flexible than hardcoding: a Jinja2 template that accepts parameters from a config produces different Snakefiles for different experiments. The generator tool is reusable; the generated workflow is project-specific.
- Config files make pipelines reproducible: all parameters (genome path, trimming threshold, DE thresholds) live in
config.yaml, not hardcoded in the Snakefile. Change the config to rerun with different settings. Check the config into version control for reproducibility. - The same Snakefile runs locally and on a cluster:
snakemake -j 4for local,snakemake -j 100 --cluster "sbatch ..."for SLURM. No code changes required — only the execution command changes. - Separate the generator from the generated workflow: the generator is a reusable Python tool. The generated Snakefile is project-specific. Version them separately.
What you’ve built in this module
Across six lessons, you have built:
- A sequence analysis dashboard (L1) — single HTML file, zero dependencies, runs on any lab computer.
- A CRISPR guide RNA designer (L2) — React + Vite app with scoring, filtering, and oligo export.
- A genomics QC pipeline (L3) — Python CLI tool with config-driven FASTQ processing and HTML reports.
- A mass spectrometry data viewer (L4) — Dash web app with peak detection, multi-spectrum overlay, and publication-quality export.
- An RNA-seq DE dashboard (L5) — Streamlit app with normalization, statistical testing, volcano/MA/heatmap visualization.
- A reproducible workflow orchestrator (L6) — Python CLI that generates Snakemake pipelines from a sample sheet, chaining QC, alignment, counting, and DE analysis into an end-to-end reproducible workflow.
Each tool followed the same pattern: see what is possible, get the exact prompt, run it, customize it. Each tool is something you can use in your actual research, share with your lab, and extend as your needs evolve.
The progression was deliberate. You started with a browser tool that takes 15 minutes to build, and finished with a workflow orchestrator that generates complete analysis pipelines. The complexity increased, but the build pattern stayed the same: describe what you want, let the LLM build it, verify the output, customize iteratively.
The LLM wrote the code. You made the scientific decisions: what analysis to run, what parameters matter, what thresholds to set, how to interpret the results. That combination — domain expertise plus AI-powered tooling — is the skill this module teaches. It is a skill that makes labs more productive, research more reproducible, and scientists more capable. And it is a skill that compounds: every tool you build teaches you to build the next one faster and better.
Portfolio suggestion
This workflow orchestrator is the capstone portfolio piece for the module. To make it maximally effective:
- Show the full stack: include screenshots or links to tools from L1 through L6. Demonstrate that you can build at every level — from a single HTML file to a pipeline orchestrator.
- Include a generated Snakefile and DAG image: a DAG visualization is immediately impressive to anyone in bioinformatics. Label the image with the pipeline steps.
- If you ran the pipeline on real data: include the MultiQC report and a DE volcano plot. Even running on a small test dataset demonstrates end-to-end functionality.
- Write a summary document covering: the problem (manual, non-reproducible RNA-seq analysis), the solution (automated Snakemake generation), and the impact (hours saved per experiment, reproducibility achieved). This is the kind of document that makes an impression in a lab interview or a grant application.
🔍Advanced: Building a web interface for the workflow generator
For labs where not everyone is comfortable with the command line, you can wrap the generator in a web interface:
Create a Streamlit web application that wraps the workflow generator. The useruploads a sample-sheet CSV, fills in parameters (genome, GTF, aligner, thresholds)via form widgets, and clicks "Generate Workflow." The app calls the generatorfunctions from workflow_generator/ and provides a download button for a ZIP filecontaining the Snakefile, config.yaml, scripts/, and README. Also display thegenerated Snakefile in a code block for review before downloading. This letsnon-CLI users generate reproducible workflows without touching the terminal.This is a practical addition for core facilities and shared computing environments where PIs or technicians need to set up workflows without CLI expertise.
Your Snakemake workflow has 6 samples. The pipeline runs FastQC, trimming, alignment, and featureCounts per sample, then merges all counts into one matrix. You run snakemake -j 6 (6 parallel jobs). Which statement about the execution is correct?
Try it yourself
- Generate the workflow orchestrator with the prompt above.
- Run the generator with the example sample sheet and inspect the generated Snakefile. Does the rule structure make sense?
- Open the generated config.yaml. Are all parameters present and correctly populated from the sample sheet?
- If you have Snakemake installed, run
snakemake -nin the generated workflow directory to verify the dry run completes without errors. - Generate the DAG visualization and examine the dependency graph.
- Try changing a parameter in config.yaml (e.g., trim_quality from 20 to 30) and run
snakemake -nagain. Does Snakemake correctly identify which rules need to be rerun? - Pick one customization from the list above and add it with a follow-up prompt.
- Celebrate. You have built six bioinformatics tools in this module — from a single HTML file to a full pipeline orchestrator.