Applied Module 12 · AI-Powered Bioinformatics Tools

Forensic STR Profile Matcher

What you'll learn

~25 min
  • Build a forensic STR profile comparison tool with a single AI prompt
  • Visualize allele calls at standard CODIS loci with color-coded match status
  • Interpret partial profiles from degraded DNA and understand why loci drop out
  • Calculate match statistics and understand why they are insufficient for identification alone

What you’re building

A forensic DNA analyst receives STR typing results from skeletal remains recovered at a Korean War battlefield. The electropherogram shows peaks at some loci but not others — the DNA is too degraded for a complete profile. Three of the fifteen CODIS loci produced no result at all. The analyst needs to compare this partial profile against reference samples donated by families of missing service members and quickly assess which references are consistent with the evidence and which can be excluded.

Today that comparison happens in spreadsheets or specialized software that costs thousands of dollars per license. A browser-based visualization tool that displays profiles side-by-side, color-codes matches, and flags exclusions lets the analyst triage cases faster — identifying which reference comparisons warrant full statistical analysis.

That is what you will build in the next 20 minutes.

Educational prototype only

This tool demonstrates STR profile comparison concepts for training purposes. Real forensic identification requires validated software (e.g., GeneMarker, GeneMapper), statistical likelihood ratios, and accredited laboratory procedures. Never use training tools for actual casework.

By the end of this lesson you will have a forensic STR profile matcher that runs entirely in the browser. It displays allele calls at standard CODIS loci, visualizes profiles as grouped bar charts, color-codes match status, handles missing loci from degraded samples, and calculates triage-level match statistics. You will build it by giving a single, carefully-crafted prompt to an LLM CLI tool.

Software pattern: Side-by-side data comparison with scoring

Load two datasets, align them on a shared key (locus name), compare values, score and visualize. This pattern works for any field comparison: test results vs. reference ranges, actual vs. budget, observed vs. expected.

🔍Domain Primer: Key forensic DNA terms

New to forensic DNA analysis? Here are the key terms you will encounter:

  • STR (Short Tandem Repeat) — A region of DNA where a short sequence (2-6 base pairs) repeats in tandem. The number of repeats varies between individuals, making STRs useful for identification. Think of it like a genetic barcode where each “bar” has a different width.
  • Allele — A specific variant at a genetic locus. For STRs, the allele is the number of repeats (e.g., allele “12” means 12 repeats). Each person has two alleles per locus (one from each parent).
  • Locus (plural: loci) — A specific location on a chromosome where STR typing is performed. The FBI’s CODIS system uses 20 core loci, though older profiles may have only 13.
  • CODIS (Combined DNA Index System) — The FBI’s national DNA database system. The “CODIS loci” are the standardized STR markers that all U.S. forensic labs type, enabling cross-laboratory comparison. The current expanded CODIS core includes 20 autosomal loci. This tool uses 15 of them for simplicity. The five omitted loci (D1S1656, D2S441, D10S1248, D12S391, D22S1045) can be added as a customization.
  • Electropherogram — The graphical output of capillary electrophoresis showing DNA fragment peaks. Each peak represents an allele, and its position indicates the fragment size (which corresponds to repeat count).
  • Amelogenin (AMEL) — A sex-determining marker. Males show X,Y peaks; females show X,X. It is always included in STR typing kits.
  • Degraded DNA — DNA that has been damaged by time, heat, moisture, or microbial activity. Degraded samples produce partial profiles because larger STR loci (longer DNA fragments) fail to amplify.
  • Partial profile — An STR profile where some loci did not produce results. Common with old skeletal remains. The fewer loci that amplify, the less statistical power for identification.
  • Reference sample — DNA collected from a known individual (usually a family member of a missing person) for comparison against evidence profiles.
  • Exclusion/inclusion — In direct parent-child comparisons, if even one locus shows alleles that are impossible given the reference, the reference is generally excluded (barring rare mutations at ~0.1-0.3% per locus per generation). For more distant relationships (siblings, uncle-nephew, grandparent-grandchild), single-locus exclusions are expected due to independent assortment and do not rule out relatedness. If all typed loci are consistent, the reference is included (but inclusion is not identification without statistical analysis).

You do not need to be an expert in forensic genetics — the AI tool will handle the implementation. You just need to know what the tool is comparing and what the results mean.

Who this is for

  • Forensic DNA analysts who want a quick visual triage tool for partial profile comparisons.
  • Forensic anthropology students learning how STR profiles are used in identification.
  • Lab coordinators who want to train new analysts on the comparison workflow.

The showcase

Here is what the finished matcher looks like once you open the HTML file in a browser:

  • Profile input panel with pre-loaded sample data for an evidence profile and two reference profiles.
  • Locus-by-locus comparison table showing allele calls at each CODIS locus, with color-coded status: green (full match), yellow (partial match — one allele shared), red (exclusion — no shared alleles), gray (no data — locus did not amplify).
  • Grouped bar chart (Chart.js) showing allele sizes at each locus for evidence and reference profiles side-by-side.
  • Electropherogram-style peak view for a selected locus, showing stylized peaks at the allele positions.
  • Match statistics panel — loci compared, full matches, partial matches, exclusions, and overall consistency assessment.
  • Profile selector to switch between reference profiles for comparison.

Everything runs client-side. No DNA data leaves the browser.


The prompt

Open your terminal Terminal The app where you type commands. Mac: Cmd+Space, type "Terminal". Windows: open WSL (Ubuntu) from the Start menu. Full lesson → , navigate to a project folder project folder A directory on your computer where the tool lives. Create one with "mkdir my-project && cd my-project". Full lesson → , start your AI CLI tool AI CLI tool Claude Code, Gemini CLI, or Codex CLI — a command-line AI that reads files, writes code, and runs commands. Full lesson → (e.g., by typing claude), and paste this prompt:

Build a single self-contained HTML file called str-matcher.html that serves as
a forensic STR profile comparison and visualization tool. Requirements:
1. PRELOADED SAMPLE DATA (embed as JS objects on page load)
Use the 15 CODIS core loci plus Amelogenin. Each profile has allele pairs.
"NR" means no result (locus failed to amplify).
Evidence Profile (Case DPAA-2024-0147, left femur):
D3S1358: [15, 16], vWA: [17, 18], D16S539: [11, 12], CSF1PO: [10, 12],
TPOX: [8, 11], D8S1179: [13, 14], D21S11: [29, 30], D18S51: NR,
D5S818: [11, 12], FGA: NR, D13S317: [11, 11], D7S820: [10, 11],
TH01: [7, 9.3], D19S433: [13, 14], D2S1338: NR, AMEL: ["X", "Y"]
Reference Profile A (Family Reference - biological son):
D3S1358: [15, 17], vWA: [17, 19], D16S539: [11, 13], CSF1PO: [10, 11],
TPOX: [8, 8], D8S1179: [13, 15], D21S11: [29, 31.2], D18S51: [14, 18],
D5S818: [11, 13], FGA: [21, 24], D13S317: [11, 12], D7S820: [10, 12],
TH01: [7, 8], D19S433: [13, 15.2], D2S1338: [19, 23], AMEL: ["X", "Y"]
Reference Profile B (Family Reference - unrelated candidate):
D3S1358: [14, 18], vWA: [15, 16], D16S539: [9, 13], CSF1PO: [11, 13],
TPOX: [9, 10], D8S1179: [10, 12], D21S11: [28, 32.2], D18S51: [12, 15],
D5S818: [9, 13], FGA: [20, 22], D13S317: [8, 12], D7S820: [8, 12],
TH01: [6, 8], D19S433: [12, 16], D2S1338: [17, 25], AMEL: ["X", "Y"]
2. COMPARISON TABLE
- Table with columns: Locus, Evidence Allele 1, Evidence Allele 2,
Reference Allele 1, Reference Allele 2, Match Status
- Match status logic:
* If evidence locus is NR: gray cell, "No Data" label
* If both alleles match (order-independent): green cell, "Full Match"
* If exactly one allele is shared: yellow cell, "Partial (1 shared)"
Note: sharing one allele at a locus is common between unrelated
individuals in the same population and is not evidence of relatedness
on its own. The statistical weight of each shared allele depends on
its frequency in the relevant population.
* If no alleles shared: red cell, "Exclusion"
- Show match summary below: X of Y loci compared, N full matches,
N partial matches, N exclusions
- If ANY locus shows exclusion in a parent-child comparison context,
display a prominent note: "Exclusion detected — profiles are inconsistent"
- If all compared loci show full or partial match:
"No exclusions — profiles are consistent (statistical analysis required)"
3. ALLELE SIZE VISUALIZATION
- Chart.js grouped bar chart with loci on the x-axis
- For each locus, show 4 bars side by side: evidence allele 1, evidence allele 2,
reference allele 1, reference allele 2
- Evidence bars in blue (#3b82f6), reference bars in amber (#f59e0b)
- NR loci show no bars (gap in the chart)
- Y-axis label: "Allele (repeat count)"
- Locus names on x-axis, rotated 45 degrees for readability
4. ELECTROPHEROGRAM-STYLE PEAK VIEW
- Below the bar chart, show a stylized electropherogram for a selected locus
- Draw Gaussian-shaped peaks at the allele positions using canvas
- Evidence peaks in blue, reference peaks in amber (semi-transparent overlay)
- X-axis: fragment size range appropriate for the locus
- Y-axis: relative fluorescence units (arbitrary height)
- Click any locus name in the table to update the peak view for that locus
- Show locus name and allele values as labels above each peak
5. PROFILE INPUT / EDITING
- Dropdown to select which reference profile to compare (A or B)
- "Edit Evidence Profile" button that reveals a form with a row per locus,
two allele input fields each, and a "No Result" checkbox
- "Add Reference Profile" button that adds a blank profile form
- Input validation: alleles must be numeric (or X/Y for AMEL), range 3-50
6. DESIGN
- Dark theme: background #0f172a, cards #1e293b, text #e2e8f0, accent #10b981
- Clean sans-serif font (Inter from Google Fonts CDN)
- Responsive single-column layout
- Match status colors: green #22c55e, yellow #eab308, red #ef4444, gray #64748b
- Status badges with rounded corners and the status text inside
7. TECHNICAL
- Pure HTML/CSS/JS in one file, no build step
- Chart.js loaded from CDN (https://cdn.jsdelivr.net/npm/chart.js)
- Peak visualization drawn on an HTML5 canvas element
- Allele comparison is order-independent (e.g., [15,16] matches [16,15])
- Handle the special case of homozygous loci (e.g., [8,8]) correctly
💡Copy-paste ready

That entire block is the prompt. Paste it as-is. The specificity is deliberate — the more precise you are about requirements, the closer the first output will be to what you actually want. Vague prompts produce vague tools.


What you get

After the LLM finishes (typically 60-90 seconds), you will have a single file: str-matcher.html. Open it in any browser.

Expected output structure

str-matcher.html (~600-900 lines)

You should see:

  1. The comparison table loads immediately with the evidence profile vs. Reference A.
  2. Most loci show green (full match) or yellow (partial match) — Reference A is a biological son, so every compared locus should share at least one allele (the obligate paternal allele).
  3. Three loci (D18S51, FGA, D2S1338) show gray “No Data” because the evidence profile had no result at those loci.
  4. Switch the dropdown to Reference B and the table turns mostly red — Reference B is an unrelated individual with different alleles.
  5. The grouped bar chart shows evidence (blue) and reference (amber) bars side-by-side. Where bars are the same height, the alleles match.
  6. Click any locus name to see the electropherogram-style peak view with overlapping blue and amber peaks.

If something is off

ProblemFollow-up prompt
Allele comparison says “Exclusion” when alleles do matchThe allele comparison is treating [15,16] as different from [16,15]. Can you make the match logic order-independent? Sort both allele pairs before comparing.
NR loci show as zero instead of being skippedLoci marked as NR are showing allele values of 0 in the chart. Can you skip NR loci entirely — no bars in the chart, gray cell in the table, and exclude them from match statistics?
Peak view canvas is blankThe electropherogram canvas is empty. Make sure the canvas element has explicit width and height attributes, and that the peak drawing function is called after the canvas is added to the DOM.

🔧

When Things Go Wrong

Use the Symptom → Evidence → Request pattern: describe what you see, paste the error, then ask for a fix.

Symptom
All loci show 'Partial Match' even when both alleles are identical
Evidence
Evidence shows [15, 16] at D3S1358 and Reference A also shows [15, 17], but the status says 'Partial' for loci where both alleles truly match like TPOX [8, 11] vs [8, 8]
What to ask the AI
"The match logic is not correctly identifying full matches. A full match means both evidence alleles appear in the reference alleles (order-independent). Can you fix the comparison to: 1) sort both pairs, 2) check if sorted arrays are identical for full match, 3) check if any single allele is shared for partial match, 4) no shared alleles = exclusion?"
Symptom
Chart.js bar chart shows alleles as strings instead of numbers
Evidence
The bars for allele '9.3' at TH01 and '31.2' at D21S11 are not rendering, and the y-axis scale looks wrong
What to ask the AI
"Some STR alleles have decimal values (microvariant alleles like 9.3, 31.2, 15.2). Can you parse all allele values as floats instead of integers? The y-axis should be a linear numeric scale that handles decimals."
Symptom
Amelogenin locus crashes the chart because alleles are X and Y, not numbers
Evidence
The chart throws an error when trying to plot Amelogenin. Console shows: Cannot read property of undefined
What to ask the AI
"Amelogenin uses X and Y as allele values instead of numbers. Can you handle AMEL as a special case? In the comparison table, show X/Y as text with match logic (X,Y vs X,Y = match). In the bar chart, either skip AMEL or use a separate section that shows sex marker comparison as text instead of bars."
Symptom
Electropherogram peaks are too narrow or too wide for certain loci
Evidence
The peaks for D21S11 (alleles around 28-32) look reasonable but the peaks for TPOX (alleles around 8-11) are squeezed together and overlapping
What to ask the AI
"The peak width and x-axis range need to adapt to each locus. Can you set the x-axis range based on the typical allele range for each locus (e.g., TPOX ranges from 5-15, D21S11 ranges from 24-38)? Scale the peak width proportionally so peaks are readable at all loci."

How it works (the 2-minute explanation)

You do not need to understand every line of the generated code, but here is the mental model:

  1. STR profiles are represented as JavaScript objects with locus names as keys and two-element arrays as values. The special value null or "NR" marks loci that did not amplify.
  2. Allele comparison sorts both pairs and checks for overlap. Two alleles matching = full match. One allele matching = partial match (in a parent-child comparison, every locus should share at least one allele — the obligate allele inherited from the parent). Zero alleles matching = exclusion. This is a simplified version of what forensic software does.
  3. Chart.js grouped bars place four bars at each locus position — two for the evidence profile, two for the reference. When bars are the same height, the allele repeat counts match. This gives an instant visual assessment before you read the table.
  4. Canvas-based peaks simulate an electropherogram. Each allele is drawn as a Gaussian curve centered at the allele value. Overlapping blue and amber peaks show where profiles agree. This is not a real electropherogram (those come from capillary electrophoresis instruments), but it builds intuition for how raw data looks.
🔍For Researchers: Why partial profiles dominate forensic casework

DNA extracted from skeletal remains that have been buried for 70-80 years is severely degraded. The DNA strands break into short fragments, and longer STR loci (which require amplifying longer DNA fragments) are the first to fail. This is why degraded samples typically lose loci like FGA, D18S51, and D2S1338 first — they require longer amplicons. The “No Result” loci in our sample data follow this realistic degradation pattern. Understanding which loci drop out first helps analysts assess sample quality and plan extraction strategies.


Customize it

The base matcher handles two-profile comparison, but real casework involves more complex scenarios. Each of these is a single follow-up prompt:

Add batch comparison mode

Add a "Batch Compare" tab where I can load multiple reference profiles and
compare them all against the evidence profile at once. Show a summary table
with one row per reference, columns for: Reference ID, Loci Compared, Full
Matches, Partial Matches, Exclusions, and a Consistency column (Yes/No). Sort
by number of exclusions ascending so the most consistent references appear
first. Highlight rows with zero exclusions in green.

Add degradation pattern analysis

Add a "Degradation Analysis" panel that shows which loci failed to amplify
and explains the pattern. Order loci by typical amplicon size (shortest to
longest). Highlight the dropout pattern — if small-amplicon loci amplified but
large-amplicon loci did not, display "Pattern consistent with degradation."
If the pattern is random, display "Dropout pattern atypical — possible
inhibition or mixed sample." Include a horizontal bar chart showing each
locus colored by amplicon size (short=green, medium=yellow, long=red) with
NR loci marked.

Add profile export report

Add an "Export Report" button that generates a printable comparison report.
Include: case number, evidence profile table, reference profile table,
comparison results with match status per locus, match statistics summary,
a disclaimer stating this is a triage tool and not a substitute for
statistical analysis. Use CSS @media print for clean formatting. Include
the bar chart as a static image (Chart.js toBase64Image).
The customization loop

Notice the pattern: you start with a working tool, then add features one prompt at a time. Each prompt builds on what already exists. This is how all the tools in this track are built — iteratively, starting from a solid foundation. You never need to plan the entire tool upfront.


Try it yourself

  1. Open your CLI tool in an empty folder.
  2. Paste the main prompt from above.
  3. Open the generated str-matcher.html in your browser.
  4. Review the default comparison (Evidence vs. Reference A — should show consistency).
  5. Switch to Reference B — observe the exclusions.
  6. Click different locus names to see the electropherogram-style peaks.
  7. Try editing the evidence profile — mark an additional locus as “No Result” and see how it affects the match statistics.
  8. Pick one customization from the list above and add it.

Key takeaways

  • One prompt, one tool: a detailed, specific prompt produces a working STR profile comparison tool in under 2 minutes.
  • Degraded DNA produces partial profiles — some loci fail to amplify, and the tool must handle missing data gracefully (gray cells, excluded from statistics) rather than treating it as zero.
  • In direct parent-child comparisons, a single exclusion at any locus is generally definitive (barring rare mutations at ~0.1-0.3% per locus per generation) — if the evidence and reference have no shared alleles at even one locus, the reference is excluded. For more distant relationships (siblings, uncle-nephew, grandparent-grandchild), single-locus exclusions are expected due to independent assortment and do not rule out relatedness.
  • Match percentage is not identification — real forensic identification requires statistical likelihood ratios calculated from population allele frequencies. The triage statistics in this tool help prioritize which comparisons deserve full analysis.
  • Allele comparison must be order-independent — a person with alleles [15, 16] is the same as [16, 15]. Building this into the prompt prevents a common bug.

KNOWLEDGE CHECK

Why do degraded skeletal remains produce partial STR profiles with some loci missing?

KNOWLEDGE CHECK

An evidence profile and a reference profile share at least one allele at every compared locus (no exclusions). What can you conclude?


What’s next

In the next lesson, you will build an eDNA Contamination QC Checker — a tool that compares negative control samples against field samples to flag shared OTUs and assess contamination risk in environmental DNA studies. Same pattern: one prompt, one working tool, then customize.