Applied Module 12 · AI-Powered Bioinformatics Tools

Proteomics Search Results Triage

What you'll learn

~25 min
  • Build a proteomics search results triage dashboard with a single AI prompt
  • Parse protein identification CSV data and filter by FDR score with Chart.js visualization
  • Troubleshoot common issues with CSV parsing, FDR filtering, and molecular weight distributions
  • Customize the dashboard with peptide-level views, GO annotation overlays, or batch comparison

What you’re building

Every mass spec run ends the same way: the search engine finishes and hands you a spreadsheet with hundreds or thousands of protein identifications. Some are real. Some are decoys. Some passed FDR thresholds but have a single peptide and 2% sequence coverage. Someone on staff has to open the file, sort it four different ways, squint at q-values, and decide which IDs are worth reporting to the researcher.

You are going to build a tool that does this triage in under five seconds.

💬This solves a real bottleneck in proteomics cores

Facility staff running MaxQuant or FragPipe on a busy week might process 10-20 runs. Each proteinGroups.txt file has hundreds of rows. Manually scanning each one for decoy hits, single-peptide IDs, and low-coverage proteins is tedious and error-prone. A triage dashboard that filters, flags, and visualizes the results lets staff spend their time on interpretation instead of spreadsheet gymnastics.

By the end of this lesson you will have a standalone proteomics search results triage dashboard that runs entirely in the browser. Upload a CSV of protein identifications, and it instantly filters decoy hits, flags proteins below your FDR threshold, highlights single-peptide IDs, and displays molecular weight distributions and coverage charts using Chart.js. No server, no installation — just one HTML file you can bookmark on the data analysis workstation.

Software pattern: Upload, filter, visualize

Upload → parse → filter by quality thresholds → display charts and tables. This pattern works for any scored results data: genomics variant calls, metabolomics feature lists, flow cytometry gating results. The techniques here transfer directly.

🔍Domain Primer: Key terms you'll see in this lesson

New to proteomics data analysis? Here are the terms you’ll encounter:

  • FDR (False Discovery Rate) — The estimated proportion of incorrect protein identifications in your results. An FDR of 1% means roughly 1 in 100 IDs is expected to be wrong. Controlled via the q-value (adjusted p-value per protein).
  • Q-value — The minimum FDR at which a particular protein would be considered significant. Proteins with q-value ≤ 0.01 pass the standard 1% FDR threshold.
  • PSM (Peptide Spectral Match) — A single match between an observed MS/MS spectrum and a peptide sequence in the database. More PSMs generally means higher confidence.
  • Sequence coverage — The percentage of the full protein sequence that was observed as peptides. Higher is better; very low coverage (under 5%) is a warning sign.
  • Unique peptides — Peptides that map to only one protein in the database. A protein identified by zero unique peptides cannot be distinguished from other family members.
  • Molecular weight (kDa) — The mass of the full protein in kilodaltons. Useful for sanity-checking IDs against expected molecular weight ranges (e.g., antibody heavy chains ~50 kDa).
  • MaxQuant / FragPipe — The two most common open-source proteomics search engines. MaxQuant outputs proteinGroups.txt; FragPipe outputs combined_protein.tsv. Both produce similar columns.
  • Decoy hit — A protein ID matched against a reversed or shuffled database sequence. Decoy hits are used to estimate FDR but should never appear in final results.

You don’t need to memorize these — the dashboard handles the filtering logic. You just need to know what the thresholds mean.

Who this is for

  • Mass spectrometry facility staff who review search engine output after every run and need a fast way to separate real identifications from noise.
  • Proteomics core managers who want a standardized triage step before results are delivered to researchers, reducing the chance of reporting decoy hits or single-peptide IDs.
  • Graduate students and postdocs running their own searches who need a visual sanity check before publishing a protein list.
Core Facility Context

UW-Madison’s Mass Spectrometry Facility and the Biotechnology Center’s Proteomics core process hundreds of samples per month. Whether you use MaxQuant, FragPipe, Proteome Discoverer, or another engine, the output is always a table of scored protein IDs that needs triage. This dashboard works with any search engine output exported as CSV.


The showcase

Here is what the finished dashboard looks like once you open the HTML file in a browser:

  • Upload zone at the top where you drop a CSV file or click to browse. A “Load Example” button loads embedded test data.
  • Filter controls for FDR threshold (default 0.01), minimum unique peptides (default 2), minimum sequence coverage (default 5%), and a toggle to hide decoy hits.
  • Summary bar showing total proteins, proteins passing all filters, flagged proteins, and decoy hits removed.
  • Molecular weight distribution chart (Chart.js histogram) showing the MW distribution of passing proteins.
  • Sequence coverage vs. score scatter plot so you can spot low-confidence outliers.
  • Results table with color-coded rows:
    • Green left border: passes all filters.
    • Red left border: fails FDR or is a decoy hit.
    • Yellow left border: passes FDR but has a single unique peptide or low coverage.
  • Export button that downloads the filtered results as a clean CSV.

Everything runs client-side. Your proteomics data never leaves the browser.


The prompt

Open your terminal Terminal The app where you type commands. Mac: Cmd+Space, type "Terminal". Windows: open WSL (Ubuntu) from the Start menu. Full lesson → , navigate to a project folder project folder A directory on your computer where the tool lives. Create one with "mkdir my-project && cd my-project". Full lesson → , start your AI CLI tool AI CLI tool Claude Code, Gemini CLI, or Codex CLI — a command-line AI that reads files, writes code, and runs commands. Full lesson → (e.g., by typing claude), and paste this prompt:

Build a single self-contained HTML file called protein-triage.html that triages
mass spectrometry protein identification results. Requirements:
1. FILE INPUT
- A drag-and-drop zone (dashed border, changes color on dragover) for CSV files
- Also a click-to-browse fallback button
- Parse the CSV client-side (handle quoted fields, commas inside quotes)
- Show the filename and row count after upload
2. SAMPLE DATA (embed as a "Load Example" button)
Include this sample CSV data with deliberate quality issues for testing:
Protein_ID,Gene_Name,Description,Molecular_Weight_kDa,Sequence_Coverage_Percent,Unique_Peptides,Total_Peptides,PSMs,Score,Q_Value,Decoy,Intensity
P04406,GAPDH,Glyceraldehyde-3-phosphate dehydrogenase,36.1,72.5,18,24,156,312.4,0.0001,FALSE,8.5e9
P68363,TUBA1B,Tubulin alpha-1B chain,50.8,58.3,12,20,98,245.1,0.0003,FALSE,5.2e9
P07437,TUBB,Tubulin beta chain,50.1,61.7,14,22,112,267.8,0.0002,FALSE,6.1e9
REV__Q8NB37,REV__GATD3A,Reversed glutamine amidotransferase,34.2,8.1,2,3,4,12.3,0.42,TRUE,1.1e6
P06733,ENO1,Alpha-enolase,47.4,45.2,9,15,67,189.5,0.0008,FALSE,3.8e9
Q99497,PARK7,Parkinson disease protein 7,20.1,3.8,1,1,2,8.7,0.008,FALSE,4.5e7
P62258,YWHAE,14-3-3 protein epsilon,29.3,42.6,8,12,45,156.2,0.0005,FALSE,2.9e9
REV__P12345,REV__FAKE1,Reversed decoy protein 1,45.0,5.2,1,2,3,9.1,0.65,TRUE,8.2e5
P60709,ACTB,Actin cytoplasmic 1,42.1,68.9,15,28,134,298.6,0.0001,FALSE,9.1e9
Q15149,PLEC,Plectin,533.5,4.2,1,1,2,7.5,0.012,FALSE,2.1e7
P11142,HSPA8,Heat shock cognate 71 kDa protein,71.1,55.8,16,22,89,234.7,0.0002,FALSE,4.7e9
P38646,HSPA9,Stress-70 protein mitochondrial,73.9,38.4,10,14,52,167.3,0.0009,FALSE,2.3e9
O43707,ACTN4,Alpha-actinin-4,105.3,22.1,6,9,28,89.4,0.003,FALSE,8.9e8
P35527,KRT9,Keratin type I cytoskeletal 9,62.3,15.6,4,7,18,45.2,0.005,FALSE,3.2e8
Q9NZI8,,Insulin-like growth factor 2 mRNA-binding protein 1,63.5,1.9,1,1,1,5.3,0.035,FALSE,9.8e6
3. FILTER CONTROLS
- FDR threshold slider: 0.001 to 0.05 (default 0.01, step 0.001), show current value
- Minimum unique peptides: dropdown with options 1, 2, 3 (default 2)
- Minimum sequence coverage: slider 0-20% (default 5%, step 1)
- "Hide decoys" checkbox (default checked)
- Filters update the table and charts in real time
- Show a count of how many proteins pass each individual filter
4. SUMMARY BAR
- Total proteins loaded
- Passing all filters (green badge)
- Flagged / below threshold (yellow badge)
- Decoy hits (red badge)
- Removed by filters (gray badge)
5. CHARTS (use Chart.js from CDN)
- Molecular weight distribution: histogram of MW for passing proteins (bins: 0-25, 25-50,
50-75, 75-100, 100-150, 150+ kDa), bar chart with MW on x-axis
- Sequence coverage vs. Score: scatter plot, passing proteins in green, failing in red,
with a dashed line at the FDR threshold score level
- Both charts update when filters change
6. RESULTS TABLE
- Sortable columns (click header to sort): Protein_ID, Gene_Name, Description, MW,
Coverage%, Unique_Peptides, PSMs, Score, Q_Value
- Color-coded left border: green = passes all filters, red = decoy or fails FDR,
yellow = passes FDR but single unique peptide or low coverage
- Highlight the Q_Value cell in red if above threshold
- Flag missing Gene_Name with a "—" placeholder and yellow highlight
- Show/hide decoy rows based on checkbox
7. EXPORT
- "Export Passing" button: downloads a CSV of only the proteins that pass all filters
- Include a timestamp and filter settings as a comment header in the exported CSV
8. DESIGN
- Dark theme: background #0f172a, cards #1e293b, text #e2e8f0, accent #10b981
- Clean sans-serif font (Inter from Google Fonts CDN)
- Responsive layout, single column, max-width 1100px
- Charts should be ~400px tall
- Green (#10b981) / red (#ef4444) / yellow (#eab308) color coding consistent throughout
9. TECHNICAL
- Pure HTML/CSS/JS in one file, no build step
- Chart.js from CDN (https://cdn.jsdelivr.net/npm/chart.js)
- CSV parser must handle quoted fields correctly
- All filtering and charting happens client-side
💡Copy-paste ready

That entire block is the prompt. Paste it as-is. The embedded sample data has deliberate issues — two decoy hits (REV__ prefix), single-peptide IDs with low coverage (PARK7, Plectin, IF2B1), a protein failing FDR (Plectin at q=0.012, IF2B1 at q=0.035), and a missing gene name (row 15). You can immediately verify the dashboard is filtering correctly.


What you get

After the LLM finishes (typically 60-90 seconds), you will have a single file: protein-triage.html. Open it in any browser.

Expected output structure

protein-triage.html (~600-900 lines)

Click Load Example and you should see:

  1. A summary bar showing 15 total proteins, 2 decoys removed, approximately 10 passing all filters, and the rest flagged.
  2. REV__GATD3A and REV__FAKE1 hidden (decoy checkbox is on by default). Their rows have red borders if you uncheck the hide toggle.
  3. PARK7 (q=0.008, 1 unique peptide, 3.8% coverage) flagged yellow: passes FDR but only one unique peptide and coverage below 5%.
  4. Plectin (q=0.012) flagged red: q-value above the default 0.01 threshold.
  5. IF2B1 (q=0.035, missing gene name) flagged red: fails FDR and has a blank gene name shown as ”—”.
  6. GAPDH, ACTB, tubulin chains all green: high coverage, many peptides, excellent scores.
  7. Molecular weight histogram showing most passing proteins cluster in the 25-75 kDa range.
  8. Scatter plot with passing proteins (green dots, upper-right region) clearly separated from flagged ones (red dots, lower-left).
What about proteinGroups.txt from MaxQuant?

MaxQuant’s native output is tab-separated with different column headers (like Mol. weight [kDa], Sequence coverage [%], Q-value). Export it as CSV from Excel or ask the AI to adapt the column mapping. See the Customize section below for a one-prompt adaptation.

If something is off

LLMs occasionally produce code with small bugs. Here are the most common issues and one-line fix prompts:

ProblemFollow-up prompt
Charts don’t renderThe Chart.js charts are not appearing. Make sure Chart.js is loaded from the CDN before the script tries to create charts, and that the canvas elements have explicit width and height attributes.
All proteins show as failingEvery protein is flagged red even though some have q-values well below 0.01. Check that the Q_Value column is being parsed as a float, not compared as a string.
Filters don’t update the tableMoving the FDR slider doesn't change the table or charts. Make sure each filter input has an event listener that re-runs the filter function and calls chart.update().

🔧

When Things Go Wrong

Use the Symptom → Evidence → Request pattern: describe what you see, paste the error, then ask for a fix.

Symptom
Q-value filtering is not working correctly -- proteins with q=0.0001 are flagged as failing
Evidence
Proteins like GAPDH with Q_Value 0.0001 show red borders even though 0.0001 is well below the 0.01 threshold
What to ask the AI
"The Q_Value column is being compared as a string instead of a number. Strings compare lexicographically ('0.0001' > '0.01' is true as a string). Parse Q_Value with parseFloat() before comparing it to the threshold."
Symptom
Molecular weight histogram shows all proteins in one bin
Evidence
The MW distribution chart has a single tall bar instead of a distribution across bins. All 15 proteins appear to be in the same bin.
What to ask the AI
"The Molecular_Weight_kDa column is being treated as a string. Parse it with parseFloat() before binning. Also check that the bin boundaries are numbers, not strings."
Symptom
Decoy toggle removes rows but chart still includes them
Evidence
After checking 'Hide decoys', the table hides the REV__ rows but the scatter plot still shows red dots for decoy proteins and the histogram still counts their molecular weights.
What to ask the AI
"The chart data is not being recomputed when the decoy toggle changes. Add the decoy checkbox to the same event listener that triggers re-filtering, and rebuild the chart datasets from the filtered data, not the full dataset."
Symptom
Exported CSV is empty or has only headers
Evidence
Clicking 'Export Passing' downloads a CSV file that contains only the header row with no data rows, even though the table shows 8 passing proteins.
What to ask the AI
"The export function is filtering the original data array but the filter logic does not match the display filter logic. Use the same filter function for both display and export. Check that the filtered array is not empty before generating the CSV."

How it works (the 2-minute explanation)

You do not need to read every line of the generated code, but here is the mental model:

  1. CSV parsing splits each line by commas (respecting quoted fields) and maps the first row as headers. Numeric columns like Molecular_Weight_kDa, Q_Value, and Sequence_Coverage_Percent are parsed as floats for proper comparison.
  2. Filtering applies four independent checks to each row: q-value vs. the FDR threshold, unique peptide count vs. the minimum, sequence coverage vs. the minimum percentage, and the Decoy column for target-decoy separation. A protein must pass all four to get a green border.
  3. Charting with Chart.js creates two views: a histogram bins passing proteins by molecular weight so you can see if the distribution looks reasonable (most cellular proteomes peak at 25-75 kDa), and a scatter plot maps coverage against score so outliers are visually obvious.
  4. Export writes only the passing rows to a new CSV with a comment header recording which filters were active, so the exported file is self-documenting.
🔍For Proteomics Staff: Why 1% FDR is the standard, not a law

The 1% protein-level FDR threshold is a community convention, not a biological constant. For discovery experiments where you want maximum coverage, 5% FDR might be acceptable. For clinical proteomics or publication-ready lists, some groups use 0.1%. The slider in this dashboard lets you see exactly how many IDs you gain or lose at each threshold — which is far more informative than a binary pass/fail cutoff hidden inside the search engine settings.


Customize it

The base dashboard is useful as-is, but every facility has unique needs. Each of these is a single follow-up prompt:

Adapt for MaxQuant proteinGroups.txt

My data comes from MaxQuant proteinGroups.txt which is tab-separated with these
column names: "Protein IDs", "Gene names", "Protein names", "Mol. weight [kDa]",
"Sequence coverage [%]", "Unique peptides", "Peptides", "MS/MS count", "Score",
"Q-value", "Reverse", "Intensity". Map these to the existing dashboard columns.
Also handle the "Reverse" column (contains "+" for decoy hits instead of TRUE/FALSE)
and the "Protein IDs" column which may contain multiple IDs separated by semicolons
(use the first one). Accept both TSV and CSV files.

Add peptide-level detail view

When I click on a protein row in the table, expand a detail panel below it showing
a bar chart of that protein's peptide coverage map -- a horizontal bar representing
the full protein sequence with colored blocks showing where identified peptides map.
Use the protein's Molecular_Weight_kDa as a proxy for sequence length (multiply by
110 to estimate amino acid count). Generate random but plausible peptide positions
for the example data since we don't have real peptide coordinates.

Add GO annotation overlay

Add a "Load GO Annotations" button that accepts a second CSV with columns:
Protein_ID, GO_Term, GO_Category (BP/MF/CC), GO_Description. After loading, add a
grouped bar chart showing the top 10 GO terms by number of passing proteins. Color
the bars by GO category: blue for Biological Process, orange for Molecular Function,
green for Cellular Component. Also add a GO_Category column to the results table.

Batch comparison mode

Add a "Compare Batches" mode. Let the user load two CSV files (Batch A and Batch B)
with the same column format. Show a Venn diagram (using canvas drawing, not a library)
of protein IDs found in each batch. Below the Venn diagram, show a table of proteins
unique to Batch A, unique to Batch B, and shared -- with columns for the score and
coverage from each batch side by side. Highlight proteins where the score differs by
more than 2-fold between batches.
The customization loop

Start with the working dashboard, then adapt it to your search engine’s output format and your facility’s reporting needs. Each prompt builds on what exists. You never need to plan the entire tool upfront — iterate from a solid foundation.


Try it yourself

  1. Open your CLI tool in an empty folder.
  2. Paste the main prompt from above.
  3. Open the generated protein-triage.html in your browser.
  4. Click Load Example to see the triage in action on the embedded test data.
  5. Move the FDR slider from 0.01 to 0.05 and watch Plectin and IF2B1 change from red to yellow (they pass FDR at 5% but still have low peptide counts). Move it to 0.001 and watch more proteins drop out.
  6. Export a real proteinGroups.txt from a recent MaxQuant run, open it in Excel, save as CSV, and drop it on the dashboard.

If you run a proteomics facility, put this HTML file on the data analysis workstation next to your MaxQuant results folder. It turns a 15-minute manual review into a 30-second visual check.


Key takeaways

  • One prompt, one tool: a detailed prompt with embedded sample data produces a working proteomics triage dashboard in under 2 minutes.
  • Filtering by FDR, peptide count, and coverage simultaneously catches problems that sorting by any single column would miss — like a protein that passes FDR with a great score but has only one unique peptide.
  • Embedding test data with deliberate quality issues (decoy hits, single-peptide IDs, failing q-values, missing gene names) guarantees you can verify the tool works immediately.
  • Chart.js histograms and scatter plots give you an instant visual sanity check: if the molecular weight distribution looks wrong or all your proteins cluster at low scores, something upstream needs attention.
  • Client-side processing means your proteomics data never leaves the workstation — important for unpublished data and clinical samples.

KNOWLEDGE CHECK

A protein has a q-value of 0.008 but only 1 unique peptide and 3.8% sequence coverage. With default filters (FDR ≤ 0.01, ≥ 2 unique peptides, ≥ 5% coverage), how should the dashboard classify it?

KNOWLEDGE CHECK

You load a proteinGroups.txt file and the molecular weight histogram shows a huge spike in the 150+ kDa bin with very few proteins elsewhere. What does this suggest?


What’s next

In the next lesson, you will build a Skeletal Element Inventory Visualizer that maps archaeological skeletal completeness data onto an interactive diagram — a different kind of triage for a very different research domain.