Applied Module 12 · AI-Powered Bioinformatics Tools

RNA-Seq Differential Expression Dashboard

What you'll learn

~35 min
  • Build an interactive RNA-seq differential expression dashboard from count matrix CSV
  • Understand log2 fold change, statistical testing, and Benjamini-Hochberg correction
  • Generate volcano, MA, and clustered heatmap plots with Plotly
  • Troubleshoot count matrix formatting and zero-count gene handling

What you’re building

Differential expression analysis is the single most requested service in university bioinformatics cores. A researcher sequences RNA from two conditions — treated vs. control, mutant vs. wild-type, tumor vs. normal — and the core question is always the same: which genes changed?

The standard tools (DESeq2, edgeR) are R packages that require writing R scripts, managing Bioconductor installations, and debugging cryptic error messages about factor levels. They are statistically rigorous and essential for publication. But for a first look at your data — “did the experiment work?” — you need something faster.

In this lesson you will build an interactive RNA-seq differential expression dashboard using Python and Streamlit. Upload a count matrix CSV, pick your conditions, and get volcano plots, MA plots, and clustered heatmaps in your browser within seconds. It includes a built-in sample data generator so you can test immediately without real data.

This is a different framework from Lesson 4’s Dash application. Streamlit is simpler — fewer callbacks, less boilerplate, faster to prototype. It is increasingly popular in bioinformatics for exactly this kind of quick-exploration tool.

Software pattern: Upload, analyze, visualize, export

Accept tabular data → run statistical analysis → generate interactive plots → export significant results. This pattern works for any hypothesis testing workflow: A/B testing in marketing, clinical trial analysis, environmental monitoring. The statistics and plots change; the architecture does not.

💡Running on HPC?

If you are working on a remote server or HPC cluster, use a conda environment instead of venv for easier dependency management. For the Streamlit web interface, use SSH port forwarding (ssh -L 8501:localhost:8501 user@server) to view the dashboard in your local browser.


The showcase

The finished application will provide:

  • CSV upload panel: drag-and-drop a count matrix (genes as rows, samples as columns) or click to generate simulated data.
  • Sample data generator: creates a realistic mouse liver RNA-seq dataset with 2 conditions (treated/control), 3 replicates each, ~15,000 genes, and ~500 truly differentially expressed genes.
  • Condition assignment: select which columns belong to each condition via multiselect dropdowns.
  • Library size normalization: median-of-ratios normalization (the same approach DESeq2 uses internally).
  • Statistical testing: per-gene t-test with Benjamini-Hochberg FDR correction. Optionally, a negative binomial approximation for more accurate RNA-seq modeling.
  • Interactive volcano plot: log2 fold change (x-axis) vs. -log10(adjusted p-value) (y-axis), with color-coded significance thresholds. Hover any point to see the gene name, fold change, and p-value.
  • MA plot: mean expression (x-axis) vs. log2 fold change (y-axis). Highlights genes that are significant at the selected FDR threshold.
  • Clustered heatmap: top N differentially expressed genes, with hierarchical clustering on both genes and samples.
  • Results table: sortable, filterable table of all genes with log2FC, p-value, adjusted p-value, mean expression, and significance flag.
  • CSV export: download the significant gene list as a CSV ready for pathway analysis tools (DAVID, Enrichr, g:Profiler).

The prompt

Open your AI CLI tool (such as Claude Code, Gemini CLI, or your preferred tool) in an empty directory and paste:

Create a Python Streamlit application for RNA-seq differential expression analysis.
Call it rnaseq-de-dashboard.
PROJECT STRUCTURE:
rnaseq-de-dashboard/
├── app.py # main Streamlit application
├── de_analysis.py # normalization, statistical testing, FDR correction
├── visualization.py # volcano plot, MA plot, heatmap with Plotly
├── sample_data.py # simulated count matrix generator
├── requirements.txt # streamlit, pandas, numpy, scipy, plotly, scikit-learn
└── README.md
SAMPLE DATA GENERATOR (sample_data.py):
Generate a realistic simulated RNA-seq count matrix:
- 15,000 genes (named Gene_0001 through Gene_15000)
- 6 samples: Control_1, Control_2, Control_3, Treated_1, Treated_2, Treated_3
- Simulates mouse liver RNA-seq with realistic count distributions:
- Per-gene base expression: mean drawn from log-normal (median ~200, range 5-50000),
dispersion varying inversely with expression (0.005-0.035) to model realistic RNA-seq overdispersion
- 300 genes upregulated in treated (multiply counts by 2x to 8x fold change)
- 200 genes downregulated in treated (divide counts by 2x to 6x fold change)
- 5% of genes have zero counts across all samples (not expressed)
- Add biological variability: per-sample scaling factor (0.8 to 1.2) to simulate
different library sizes
- Return a pandas DataFrame with gene names as index, sample names as columns
- Also return a ground truth DataFrame listing which genes are truly DE and their
true fold changes (for benchmarking)
DE ANALYSIS MODULE (de_analysis.py):
1. PREPROCESSING
- Remove genes with zero counts across all samples
- Filter low-count genes: keep genes with at least N counts in at least M samples
(default N=10, M=2, configurable via sidebar)
- Log a summary: genes before filtering, genes after filtering, genes removed
2. NORMALIZATION
- Implement median-of-ratios normalization (DESeq2 method):
a) Compute geometric mean of each gene across all samples
b) For each sample, divide counts by the geometric mean
c) Take the median of these ratios per sample = size factor
d) Divide each sample's counts by its size factor
- Also offer simple CPM (counts per million) as an alternative
- Show a sidebar toggle between normalization methods
3. STATISTICAL TESTING
- For each gene, run a two-sample t-test (scipy.stats.ttest_ind) between the
two condition groups on log2(normalized_counts + 1)
- Calculate log2 fold change: mean(log2(condition2 + 1)) - mean(log2(condition1 + 1))
- Calculate mean expression: mean of log2(all_samples + 1)
- Return raw p-values for all genes
4. MULTIPLE TESTING CORRECTION
- Apply Benjamini-Hochberg FDR correction (scipy.stats.false_discovery_control
or statsmodels.stats.multitest.multipletests)
- Mark genes as significant if adjusted p-value < threshold (default 0.05)
AND abs(log2FC) > threshold (default 1.0)
- Both thresholds configurable via sidebar sliders
APP LAYOUT (app.py):
Use Streamlit with a dark theme. Layout:
1. HEADER
- Title: "RNA-Seq Differential Expression Dashboard"
- Subtitle with brief description
2. SIDEBAR
- File upload widget (.csv files)
- "Generate Sample Data" button
- Condition assignment: two st.multiselect widgets for selecting which
columns belong to Condition 1 vs Condition 2
- Analysis parameters:
- Min count filter (slider, 1-50, default 10)
- Min samples filter (slider, 1-6, default 2)
- Normalization method (radio: "Median of Ratios" / "CPM")
- FDR threshold (slider, 0.001-0.1, default 0.05)
- log2FC threshold (slider, 0.5-3.0, default 1.0)
- "Run Analysis" button
3. RESULTS TABS (main area, use st.tabs)
Tab 1: Overview
- Summary metrics in st.metric cards: total genes tested, significant up,
significant down, not significant
- Library size bar chart showing total counts per sample (before normalization)
- Normalization factor bar chart
Tab 2: Volcano Plot
- Plotly scatter plot: x = log2FC, y = -log10(adjusted p-value)
- Color coding: red = significant up, blue = significant down, gray = NS
- Horizontal dashed line at -log10(FDR threshold)
- Two vertical dashed lines at +/- log2FC threshold
- Hover: gene name, log2FC, adj p-value, mean expression
- Top 10 significant genes labeled on the plot
- Dark theme matching other tools
Tab 3: MA Plot
- Plotly scatter plot: x = mean expression (log2), y = log2FC
- Same color coding as volcano plot
- Horizontal dashed line at y=0
- Hover: gene name, log2FC, adj p-value, mean expression
Tab 4: Heatmap
- Show top N differentially expressed genes (slider: 20-100, default 50)
- Sorted by adjusted p-value
- Hierarchical clustering on both genes (rows) and samples (columns)
using scipy.cluster.hierarchy
- Z-score normalize each row for display
- Color scale: blue (low) to white (mid) to red (high)
- Plotly heatmap with gene names on y-axis, sample names on x-axis
- Dendrogram on both axes if possible, otherwise just clustered order
Tab 5: Results Table
- st.dataframe showing all tested genes with columns:
Gene, log2FC, p-value, adjusted p-value, mean expression,
significant (yes/no)
- Sortable by any column
- Filter: show all / significant only / up only / down only
- "Download Significant Genes (CSV)" button
- "Download Full Results (CSV)" button
DESIGN:
- Streamlit dark theme (set in .streamlit/config.toml)
- Create .streamlit/config.toml with:
[theme]
primaryColor = "#10b981"
backgroundColor = "#0a0a0f"
secondaryBackgroundColor = "#1a1a2e"
textColor = "#e0e0e0"
- Plotly charts use plotly_dark template
- Professional, core-facility-ready layout
Generate all files with complete implementations. Include the .streamlit/config.toml.
The app should work end-to-end: streamlit run app.py opens the dashboard ready to use.
Dependencies

This tool uses scipy for statistical testing, pandas for data handling, and Streamlit for the web application. If you cannot install scipy, ask the LLM to implement the t-test and FDR correction from scratch using numpy only. The statistical logic is straightforward — scipy just makes it convenient.


What you get

After generation, set up the project:

Terminal window
cd rnaseq-de-dashboard
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
streamlit run app.py

Streamlit will open http://localhost:8501 in your browser automatically.

Expected project structure

rnaseq-de-dashboard/
├── app.py (~300-400 lines)
├── de_analysis.py (~150-200 lines)
├── visualization.py (~200-250 lines)
├── sample_data.py (~80-120 lines)
├── .streamlit/
│ └── config.toml
├── requirements.txt
└── README.md

First run walkthrough

  1. Click Generate Sample Data in the sidebar. The simulated count matrix loads with 15,000 genes and 6 samples.
  2. The condition assignment dropdowns should auto-populate: Control_1/2/3 in Condition 1, Treated_1/2/3 in Condition 2. If not, assign them manually.
  3. Leave the default parameters and click Run Analysis.
  4. Check the Overview tab. You should see approximately 300-500 significant genes (depending on filtering and the random seed). The library size chart should show slightly different total counts per sample (simulating real library size variation).
  5. Switch to the Volcano Plot tab. The classic volcano shape should appear:
    • A cluster of gray dots in the center (not significant).
    • Red dots in the upper-right (significantly upregulated).
    • Blue dots in the upper-left (significantly downregulated).
    • The top 10 most significant genes labeled with their names.
  6. Switch to the MA Plot tab. Significant genes should appear as colored dots distributed across the expression range, while non-significant genes cluster around log2FC = 0.
  7. Switch to the Heatmap tab. The top 50 DE genes should show a clear pattern: one block of high expression in treated samples and low in controls, and another block with the opposite pattern. The clustering should group the three control replicates together and the three treated replicates together.
  8. Switch to the Results Table tab. Filter to “Significant only” and click Download Significant Genes (CSV).

Common issues and fixes

ProblemFollow-up prompt
Streamlit shows a blank pageThe app is not rendering. Make sure app.py has the Streamlit imports at the top and uses st.set_page_config as the first Streamlit command. Also check that .streamlit/config.toml exists and has valid TOML syntax.
Volcano plot has no colored pointsAll points are gray. The significance thresholds might be too strict for the simulated data. Lower the default log2FC threshold to 0.5 and the FDR threshold to 0.1. Also verify that the adjusted p-values are being used, not the raw p-values.
Heatmap is all one colorThe heatmap shows no contrast. Make sure you are z-score normalizing each row before plotting: for each gene, subtract the row mean and divide by the row standard deviation. This puts all genes on the same scale.
Download button produces empty CSVThe CSV download is empty. Check that the filtered DataFrame is not being overwritten before the download button is created. In Streamlit, the download button callback should reference the DataFrame directly, not a variable that gets reassigned.

Worked example: Comparing treated vs. control liver samples

Here is a practical scenario for a graduate student running a drug treatment RNA-seq experiment.

Step 1. You submitted RNA from 6 mouse liver samples to your sequencing core: 3 controls (DMSO vehicle) and 3 treated with a drug candidate. The core returned FASTQ files, which you aligned with STAR and counted with featureCounts. The output is a count matrix CSV with gene names as the first column and one column per sample.

Step 2. Your count matrix looks like this:

Gene,Control_1,Control_2,Control_3,Drug_1,Drug_2,Drug_3
Alb,245891,231420,258103,89432,95210,82104
Cyp1a2,12450,13201,11892,45230,48102,42891
Gapdh,89201,91034,87453,88921,90102,87234
...

Step 3. Upload the CSV to the dashboard. Assign Control_1/2/3 to Condition 1 and Drug_1/2/3 to Condition 2. Click Run Analysis.

Step 4. Examine the volcano plot. Look for:

  • Albumin (Alb) should appear as a blue dot in the lower-left quadrant — it is the most abundant liver gene and your drug is suppressing it. A large negative log2FC confirms the drug effect.
  • Cyp1a2 should appear as a red dot in the upper-right — this cytochrome P450 is being induced by the drug. A positive log2FC of ~2 (4-fold induction) is consistent with drug metabolism activation.
  • Gapdh should be a gray dot near the center — housekeeping genes should not change significantly.

Step 5. Check the heatmap. The clustering should separate your control and treated samples cleanly. If the treated replicates do not cluster together, that is a red flag — it could indicate batch effects, sample mislabeling, or high biological variability.

Step 6. Download the significant gene list. Upload it to Enrichr (maayanlab.cloud/Enrichr) or g:Profiler (biit.cs.ut.ee/gprofiler) for pathway enrichment analysis. You should see pathways related to drug metabolism (cytochrome P450, xenobiotic metabolism) enriched in the upregulated genes.

Core facility context: RNA-seq is everywhere

RNA-seq is the most common next-generation sequencing application. University sequencing cores, gene expression centers, and genomics facilities process hundreds of RNA-seq samples per year. Every one of those experiments needs differential expression analysis. The tools you are building here give you a fast first look at results before investing time in a full DESeq2 or edgeR analysis.

If you are taking a bioinformatics course, this dashboard covers the same DE analysis concepts you encounter in class — but packaged as an interactive tool you can use on your own data immediately.


🔧

When Things Go Wrong

Use the Symptom → Evidence → Request pattern: describe what you see, paste the error, then ask for a fix.

Symptom
Count matrix upload fails with 'Could not parse CSV'
Evidence
I exported the count matrix from featureCounts and uploaded it. The error says: 'ParserError: Error tokenizing data. Expected 7 fields in line 3, saw 8.' The file opens fine in Excel.
What to ask the AI
"The count matrix from featureCounts has comment lines at the top starting with '#' and a header row with extra annotation columns (Chr, Start, End, Strand, Length). Can you add CSV parsing logic that: (1) skips lines starting with '#', (2) auto-detects the gene name column (usually 'Geneid' or the first column), and (3) drops non-numeric annotation columns automatically? Keep only the gene name column and the numeric count columns."
Symptom
All adjusted p-values are 1.0 and nothing is significant
Evidence
The volcano plot shows all points at -log10(1.0) = 0 on the y-axis. The raw p-values look correct (some are very small), but after FDR correction everything becomes 1.0.
What to ask the AI
"The FDR correction is not working correctly. I think the issue is that genes with NaN p-values (from the t-test failing on zero-variance genes) are being included in the correction, inflating the number of tests. Can you: (1) remove genes where the t-test returns NaN before running FDR correction, (2) use scipy.stats.false_discovery_control or statsmodels multipletests with method='fdr_bh', and (3) verify that the corrected p-values are actually smaller than 1.0 for at least some genes?"
Symptom
Heatmap shows all genes in one block with no clustering pattern
Evidence
The heatmap renders but there is no visible pattern -- all cells look the same shade of red. There is no separation between treated and control samples.
What to ask the AI
"The heatmap is plotting raw counts instead of z-scored values. Raw counts vary by orders of magnitude, so a few highly expressed genes dominate the color scale. Can you z-score normalize each row before plotting? For each gene: z = (value - row_mean) / row_std. Also make sure the color scale is symmetric around zero (e.g., zmin=-3, zmax=3) so that up and down regulation are equally visible."
Symptom
Volcano plot is extremely slow with 20,000+ genes
Evidence
After uploading a count matrix with 25,000 genes, the volcano plot takes 30+ seconds to render and the Streamlit tab is unresponsive. The MA plot has the same issue.
What to ask the AI
"Plotly is trying to render 25,000 scatter points with hover data, which is slow. Can you add WebGL rendering? Change go.Scatter to go.Scattergl in the volcano and MA plot functions. Also, for the non-significant genes (gray dots), reduce the hover data to just the gene name instead of all fields. This should speed up rendering by 10x."
Symptom
P-value histogram shows a spike at 1.0 instead of a uniform distribution
Evidence
I added a p-value histogram to check the null distribution. Instead of a roughly uniform distribution with a spike near 0 (for true positives), there is a massive spike at exactly 1.0. About 40% of genes have raw p-value = 1.0.
What to ask the AI
"The spike at p=1.0 is from genes with identical counts across conditions (zero variance). The t-test returns p=1.0 when both groups have the same values. Can you add a pre-filter that removes genes where the variance across all samples is below a threshold (e.g., variance < 1)? These genes are uninformative and inflate the multiple testing burden. Show a note in the Overview tab: 'N genes removed due to zero/near-zero variance.'"

Understanding the statistics

The analysis pipeline implements a simplified version of what DESeq2 does internally. Here are the key concepts:

Library size normalization: Different samples are sequenced to different depths. One sample might have 20 million reads, another 35 million. Without normalization, a gene appears “upregulated” in the deeper sample purely because of sequencing depth. Median-of-ratios normalization estimates a size factor per sample that accounts for both sequencing depth and RNA composition differences. This is the same algorithm DESeq2 uses (Anders and Huber, 2010).

Log2 fold change: The ratio of expression between conditions, on a log2 scale. A log2FC of 1 means 2-fold upregulation. A log2FC of -2 means 4-fold downregulation. Log2FC of 0 means no change. The log scale makes the distribution symmetric: a 2-fold increase (+1) and a 2-fold decrease (-1) are equidistant from zero.

Benjamini-Hochberg correction: When you test 15,000 genes simultaneously, you expect 750 false positives at p < 0.05 by chance alone. The BH procedure controls the false discovery rate (FDR) — the expected proportion of false positives among your significant results. An adjusted p-value (q-value) of 0.05 means that among all genes you call significant, at most 5% are expected to be false positives.

Volcano plot interpretation: The name comes from the shape. Points at the top are highly significant (small p-value). Points on the far left and right have large fold changes. The most interesting genes are in the upper corners — large fold change AND high statistical significance. Points in the lower center are noise.

🔍For Researchers: When to use this tool vs. DESeq2/edgeR

Use this dashboard for:

  • First-pass exploration within minutes of getting your count matrix (“did the experiment work?”)
  • Lab meeting presentations where you need quick, interactive plots
  • Teaching RNA-seq analysis concepts to students (the interactive sliders make parameter effects visible)
  • Comparing different normalization or threshold choices interactively
  • Generating a candidate gene list to discuss with your PI before investing in a full analysis

Use DESeq2 or edgeR for:

  • Publication-ready statistical analysis (reviewers will ask which tool you used)
  • Proper negative binomial modeling of count data (t-tests on log-counts are an approximation)
  • Experiments with complex designs (multiple factors, batch effects, paired samples)
  • Small sample sizes (n=2 per condition) where the t-test lacks power and DESeq2’s information borrowing across genes is critical
  • Interaction effects, time series, or dose-response experiments

The key difference: this dashboard uses a t-test on log-transformed normalized counts, which is a reasonable approximation when you have 3+ replicates per condition and the counts are not too low. DESeq2 uses a negative binomial generalized linear model with empirical Bayes shrinkage, which is more statistically principled but requires R and more setup. For exploratory analysis, the t-test approach is fast and usually identifies the same top hits.


Customize it

Add gene set enrichment analysis

Add a new tab called "Enrichment" to the dashboard. After running DE analysis,
take the significant gene list and perform a simple over-representation analysis
against Gene Ontology (GO) terms. Include a bundled GO term database (download a
slim version with ~5000 terms and their associated gene lists for mouse). For each
GO term, run a Fisher's exact test comparing the overlap between the DE genes and
the GO term genes. Display the top 20 enriched GO terms as a bar chart (-log10
p-value) with the number of overlapping genes as hover text. This is a basic
version of what DAVID or g:Profiler does, but runs entirely offline.

Add PCA and sample correlation plots

Add a new tab called "Sample QC" that runs BEFORE the DE analysis. Include:
1. PCA plot of all samples using the top 500 most variable genes. Color points
by condition. The treated and control samples should separate on PC1 or PC2.
If they do not, the experiment may have failed or there may be batch effects.
2. Sample-to-sample correlation heatmap using Pearson correlation on log2(counts+1).
Replicates within a condition should have correlation > 0.95. Low correlations
suggest outlier samples.
3. Cook's distance bar chart per sample to identify outlier samples.
Show this tab first in the tab order so the user checks sample quality before
running DE analysis.

Add batch effect visualization

Add batch effect detection to the Sample QC tab. Let the user assign a batch
variable (e.g., sequencing lane, RNA extraction date) in the sidebar. Re-run
the PCA and color by batch instead of condition. If samples cluster by batch
rather than by condition, show a warning: "Batch effect detected -- consider
using a batch-corrected model." Also offer a simple batch correction using
ComBat-style adjustment (subtract the batch mean from each gene's log2 counts,
preserving the condition effect).
Add a search box at the top of the Results Table tab. When the user types a
gene name (or a comma-separated list), highlight those genes on the volcano
plot and MA plot with a distinct marker (larger size, star shape, labeled).
This lets the user check whether their genes of interest are significant
without scrolling through the full table. Also add a "Pathway Genes" text
area where the user can paste a list of genes from a pathway database and
see which ones are DE in their experiment.
💡The exploration-to-publication workflow

Here is how this tool fits into a real RNA-seq analysis workflow:

  1. Day 1: Count matrix arrives. Upload to this dashboard. Generate volcano plot. Answer: “Did the experiment work?” Share the plot with your PI.
  2. Day 2-3: Run the full DESeq2/edgeR analysis in R for publication-quality statistics. Compare the gene lists — the overlap between the quick dashboard and the full analysis should be >90% for the top hits.
  3. Day 4-5: Run pathway enrichment (Enrichr, g:Profiler) on the DESeq2 results. Use the dashboard’s heatmap to generate figures for your paper.
  4. Publication: Cite DESeq2/edgeR for the statistics. Use the dashboard plots for presentations and lab meetings.

The dashboard does not replace the formal analysis — it accelerates the exploration phase so you know where to focus.


Connecting to core facility workflows

Differential expression analysis is relevant to nearly every sequencing service a core facility offers:

RNA-seq — The direct application. Bulk RNA-seq from Illumina platforms produces the count matrices this dashboard consumes. Whether you are doing polyA-selected mRNA-seq or total RNA-seq with ribosomal depletion, the downstream count matrix has the same format.

Single-cell RNA-seq — 10X Genomics Chromium data can be aggregated to pseudo-bulk counts (sum counts per gene per condition across cells) and analyzed with this dashboard for a quick bulk-level comparison. This is a legitimate analysis strategy and is sometimes preferred for between-condition comparisons.

Spatial transcriptomics — Visium and MERFISH platforms produce spatially resolved expression data. Comparing expression between annotated tissue regions produces count matrices that fit this dashboard’s input format.

Gene expression arrays — While largely superseded by RNA-seq, some labs still generate microarray data. After RMA normalization, microarray expression matrices can be analyzed with the same t-test and volcano plot workflow. Ask the LLM to add a “pre-normalized” mode that skips the count normalization step.

If you are taking a genomics or bioinformatics course, this dashboard covers the same statistical concepts (fold change, multiple testing, FDR) that appear in lectures — but lets you manipulate the parameters interactively and see the effects in real time.


Key takeaways

  • Normalization is not optional: raw counts cannot be compared across samples without accounting for library size differences. Median-of-ratios normalization is the standard for RNA-seq because it handles both sequencing depth and RNA composition effects.
  • Multiple testing correction is the difference between 750 false positives and a reliable gene list: with 15,000 tests, a nominal p < 0.05 cutoff is meaningless. BH-corrected adjusted p-values control the false discovery rate.
  • Both fold change AND statistical significance matter: a gene with log2FC = 5 but p = 0.3 might be a noisy gene with one outlier replicate. A gene with log2FC = 0.1 and p = 0.0001 is statistically significant but biologically trivial. The volcano plot captures both dimensions.
  • Replicates determine power: with 2 replicates per condition, the t-test has very low power and you will miss many truly DE genes. With 3+ replicates, the dashboard’s t-test approach gives results comparable to DESeq2 for the top hits.
  • This tool is for exploration, not publication: use it for quick answers and interactive plotting. Use DESeq2 or edgeR for the statistics you report in a paper.

Portfolio suggestion

The RNA-seq DE dashboard is directly relevant to anyone working in genomics or molecular biology. For your portfolio:

  1. Run the dashboard on the simulated data and save screenshots of the volcano plot, MA plot, and heatmap.
  2. If you have real data, run your own count matrix through the dashboard and include a de-identified volcano plot. A volcano plot from real data — with genes labeled, biological interpretation noted — demonstrates both technical and scientific competency.
  3. Compare to DESeq2 results: if you have both, show the overlap between the dashboard’s significant gene list and the DESeq2 gene list. High concordance (>90% for top 100 genes) validates the approach.
  4. Write a brief methods note describing when you would use this tool vs. DESeq2. This demonstrates mature scientific judgment about tool selection.
🔍Advanced: Adding DESeq2-style dispersion estimation

The main limitation of the t-test approach is that it does not model the mean-variance relationship of count data. In RNA-seq, variance increases with mean expression (heteroscedasticity). DESeq2 handles this by fitting a dispersion parameter per gene using a negative binomial model.

You can add a simplified version:

Add a negative binomial test option to the statistical testing module. For each
gene: (1) estimate the dispersion parameter using the method of moments
(variance = mu + mu^2 * dispersion), (2) fit the dispersion-mean relationship
across all genes using a loess curve, (3) shrink per-gene dispersions toward the
fitted curve (empirical Bayes shrinkage), (4) use the shrunken dispersions in a
negative binomial test. This is a simplified version of the DESeq2 algorithm.
Add a sidebar toggle between "t-test (fast)" and "Negative binomial (more accurate)".

This is more statistically appropriate than the t-test, especially for low-count genes and small sample sizes. However, it is significantly more complex to implement and debug. Start with the t-test version, validate it works, then add this as an upgrade.


KNOWLEDGE CHECK

You generate a volcano plot from your RNA-seq experiment. You see a gene at coordinates (log2FC = 3.5, -log10(adj p-value) = 8). Your significance thresholds are log2FC > 1 and adjusted p-value < 0.05. What can you conclude about this gene?


Try it yourself

  1. Generate the RNA-seq DE dashboard with the prompt above.
  2. Click Generate Sample Data and run the analysis with default parameters.
  3. Examine the volcano plot. Can you identify the upregulated and downregulated gene clusters?
  4. Adjust the log2FC threshold slider from 1.0 to 0.5. How does the number of significant genes change?
  5. Switch to the heatmap tab. Does the clustering separate the control and treated samples?
  6. Download the significant gene list as CSV. Open it in a spreadsheet and sort by adjusted p-value.
  7. If you have a real RNA-seq count matrix, upload it and compare the results to your previous DESeq2/edgeR analysis.
  8. Pick one customization from the list above and add it with a follow-up prompt.

What’s next

In Lesson 6, you will build a reproducible RNA-seq workflow orchestrator — a Python CLI tool that generates Snakemake pipelines to chain FASTQ QC, alignment, counting, and differential expression into a single reproducible workflow. It is the capstone lesson for this module, synthesizing the tools from Lessons 3 and 5 into an end-to-end pipeline.