STEP 1: Data Collection (GenBank, NOT SRA)
🎯 Goal:
Get complete SVV genomes (assembled sequences)
🔧 How:
- Go to NCBI Nucleotide database
Search:
- Apply filters:
- Sequence length ~7,200–7,500 bp
- Remove partial sequences
- Download:
- FASTA format
- Metadata (collection year, country)
Split into 2 groups:
- Pre-2015
- Post-2015
STEP 2: Sequence Quality Control
🎯 Goal:
Remove bad data
🔧 Do:
- Remove:
- Sequences with missing regions
- Too many “N” bases
- Duplicate isolates
👉 Tools:
- Basic filtering in R / Python
- Or manual curation (acceptable for class project
STEP 3: Multiple Sequence Alignment
🎯 Goal:
Align genomes to compare positions
🔧 Tool:
MACSE v2
💡 Why MACSE (important for your writeup):
- Keeps codons intact
- Handles frameshifts (important for RNA viruses)
ommand (example):
📌 Output:
- Codon-aware alignment file
STEP 4: Mutation / Genomic Variation Analysis
⚠️ Important:
Call this NOT SNP analysis
🎯 Goal:
Find differences vs reference
🔧 Steps:
- Use SVV-001 (2002) as reference
- Compare each genome to reference
- Count:
- Number of nucleotide differences
- Differences per region (P1, P2, P3, UTRs)
ools:
- R (Biostrings)
- Python (Biopython)
📊 Output:
- Mutation counts per genome
- Mutation density (mutations per kb)
STEP 5: Phylogenetic Analysis
🎯 Goal:
Understand evolutionary relationships
🔧 Tool:
IQ-TREE 2
🔧 Command:
📌 Output:
- Phylogenetic tree file (.treefile)
📊 Visualization:
Use:
💡 What to look for:
- Clustering of pre vs post-2015
- Emergence of new lineages
🧬 STEP 6: Selection Pressure Analysis
🎯 Goal:
Find adaptive evolution
🔧 Tool:
HyPhy
Methods:
- FEL (Fixed Effects Likelihood)
🔧 Output:
- Sites under:
- Positive selection (adaptive)
- Negative selection (conserved)
💡 Interpretation:
- Positive selection → adaptation
- Negative selection → functional importance
🧬 STEP 7: Codon Usage Analysis
🎯 Goal:
See how virus adapts to host
🔧 Metrics:
- RSCU (Relative Synonymous Codon Usage)
- ENC (Effective Number of Codons)
🧰 Tools:
- CodonW
- R packages (seqinr)
📊 Output:
- Codon bias patterns
- Compare pre vs post-2015
📊 STEP 8: Visualization
🎯 Goal:
Make results clear and publishable
🔧 Use:
- ggplot2 → mutation plots
- ggtree → phylogenetic trees
- Heatmaps → codon usage
- Boxplots → variation comparison
🧠 How Everything Connects (BIG PICTURE)
| Step | What it tells you |
|---|
| Alignment | Where sequences differ |
| Mutation analysis | How much virus changed |
| Phylogeny | Evolutionary relationships |
| Selection analysis | Why changes occurred |
| Codon usage | Host adaptation |
All analyses will be performed on codon-aware alignments to preserve reading frame integrity and ensure biologically meaningful interpretation of mutations
1. Recombination Analysis (VERY IMPRESSIVE)
💡 Why this matters:
RNA viruses (especially picornaviruses) often evolve through recombination, not just mutation.
🎯 What you test:
Did SVV evolve by mixing genomes over time?
🔧 Tools:
- RDP5 (Recombination Detection Program)
- GARD (via HyPhy)
📌 What you add to methods:
“Recombination analysis will be performed using RDP5 and/or GARD to identify potential recombination breakpoints and assess their contribution to SVV evolution.”
🔥 Why PI will like this:
- Shows deep evolutionary thinking
- Goes beyond “basic alignment + tree”
🧬 2. Population Genetics / Diversity Analysis
🎯 What you measure:
- Genetic diversity across time
🔧 Metrics:
- Nucleotide diversity (π)
- Tajima’s D (optional but strong)
🧰 Tools:
📌 Add this:
“Genetic diversity metrics, including nucleotide diversity (π), will be calculated to compare variability between pre- and post-2015 populations.”
🔥 Why it impresses:
You’re now doing population-level inference, not just sequence comparison.
🧬 3. Sliding Window Analysis (Mutation Hotspots — Cleaner Version)
You already hinted at this — now make it more rigorous.
🎯 Goal:
Find where in the genome evolution is happening
🔧 Method:
- Sliding window (e.g., 200 bp window, 50 bp step)
📌 Add:
“A sliding window analysis will be performed to assess regional variation in mutation density across the genome.”
🔥 Why better:
More quantitative and publishable than simple SNP counts
🧬 4. Protein-Level Impact Analysis (VERY SMART ADDITION)
🎯 Question:
Do mutations actually affect proteins?
🔧 What to do:
- Translate sequences → proteins
- Compare amino acid changes
Optional:
- Map mutations to:
- structural proteins (P1)
- non-structural (P2, P3)
📌 Add:
“Nucleotide changes will be translated into amino acid sequences to assess potential functional impacts across viral proteins.”
🔥 Why PI likes this:
You move from:
👉 “data analysis” → biological interpretation
🧬 5. Entropy Analysis (You already mentioned — KEEP it but refine)
🎯 Goal:
Measure variability per position
📌 Improve wording:
“Shannon entropy will be calculated for each genomic position to quantify sequence variability across time periods.”
⚠️ What NOT to Add (Important)
Avoid these (your PI will not like):
- ❌ Variant calling pipelines (requires raw reads)
- ❌ Overly complex ML models
- ❌ Too many tools with no clear purpose
🧠 Best Strategy (THIS is the sweet spot)
Keep your core + add ONLY 2 upgrades:
✅ Final “Impressive but Clean” Methods:
- Alignment (MACSE)
- Phylogeny (IQ-TREE)
- Selection (HyPhy)
- Codon usage
- Mutation profiling
- ✅ Recombination analysis
- ✅ Genetic diversity (π)
🔥 Killer Sentence to Add (This shows maturity)
“Together, these analyses provide a comprehensive view of SVV evolution by integrating sequence variation, selection pressures, recombination, and population-level diversity using assembled genomes.”
✅ 1. Heatmaps — YES (but use correctly)
🔥 Good uses:
- Codon usage (RSCU heatmap)
- Mutation density across genome
- Amino acid changes across strains
📌 Add to methods:
“Heatmaps will be used to visualize codon usage bias and mutation patterns across genomes.”
👉 ✔ Clean
👉 ✔ Visual
👉 ✔ Interpretable
✅ 2. Codon Usage (RSCU, ENC) — KEEP (Core strength)
You already have this — just make it stronger:
🔥 Upgrade:
- Compare pre vs post-2015
- Link to host adaptation
📌 Add:
“RSCU values will be visualized using heatmaps to compare codon preference shifts across time.”
✅ 3. Viral Entropy — YES (VERY GOOD)
🎯 Why:
Measures variability per position → stronger than “mutation counts”
Tools:
- R / Bio3D / custom scripts
📌 Add:
“Shannon entropy will be calculated to quantify positional variability across the genome.”
👉 This is PhD-level clean analysis
✅ 4. Selection Pressure (FEL + SLAC) — STRONG UPGRADE
You already used FEL — now refine:
🔥 Best combo:
- FEL → site-level selection
- SLAC → faster, confirmatory
📌 Add:
“Selection pressure will be assessed using both FEL and SLAC methods to ensure robustness of inferred selective signals.”
👉 ✔ Shows depth
👉 ✔ Shows method awareness
✅ 5. Recombination (GARD) — KEEP (High impact)
Already discussed — definitely include.
⚠️ 6. Protein Structure Prediction — CAREFUL (but can impress if done right)
❗ Reality:
Yes, you can use AlphaFold + NGL Viewer, BUT:
- Your project is genome-level, not structural biology
- Doing this broadly = too much
✅ SMART WAY to include it:
ONLY do this:
- Identify key mutations under positive selection
- Pick 1 protein (e.g., capsid protein VP1)
- Map mutations onto structure
📌 Add:
“Selected amino acid changes identified under positive selection will be mapped onto predicted protein structures (e.g., AlphaFold models) to assess potential structural and functional impacts.”
👉 This is a killer addition if kept small
❌ 7. SnpEff — NOT APPROPRIATE
❗ Why:
- Designed for variant calling pipelines (raw reads)
- You are using assembled genomes
👉 Your PI will immediately catch this
❌ 8. Kraken2 / Kaiju / Krona — NO
❗ Why:
- These are for metagenomics classification
- You already know your virus (SVV)
👉 Including this = shows misunderstanding
❌ 9. DESeq2 — NO
❗ Why:
- Used for RNA-seq differential expression
- Not relevant to genome comparison
🧬 Final “Deep but Correct” Pipeline
This is what will impress your PI the most:
🔥 Core + Advanced (Perfect Balance)
1. Data collection (GenBank genomes)
2. Alignment (MACSE — justified)
3. Phylogenetics (IQ-TREE)
4. Mutation / variation profiling
5. Entropy analysis (variability) ✅
6. Selection pressure (FEL + SLAC) ✅
7. Codon usage (RSCU + heatmaps) ✅
8. Recombination (GARD) ✅
9. Genetic diversity (π) ✅
10. (Optional) Structural mapping of key mutations ⭐