STEP 1: Data Collection (GenBank, NOT SRA)

🎯 Goal:

Get complete SVV genomes (assembled sequences)

🔧 How:

Go to NCBI Nucleotide database

Search:


"Senecavirus A"[Organism] AND complete genome

Apply filters:
- Sequence length ~7,200–7,500 bp
- Remove partial sequences
Download:
- FASTA format
- Metadata (collection year, country)
Split into 2 groups:
Pre-2015
Post-2015

STEP 2: Sequence Quality Control
🎯 Goal:
Remove bad data
🔧 Do:
Remove:
- Sequences with missing regions
- Too many “N” bases
- Duplicate isolates

👉 Tools:

Basic filtering in R / Python
Or manual curation (acceptable for class project

STEP 3: Multiple Sequence Alignment

🎯 Goal:

Align genomes to compare positions

🔧 Tool:

MACSE v2

💡 Why MACSE (important for your writeup):

Keeps codons intact
Handles frameshifts (important for RNA viruses)

ommand (example):


java -jar macse_v2.jar -prog alignSequences -seq input.fasta -out aligned.fasta

📌 Output:

Codon-aware alignment file

STEP 4: Mutation / Genomic Variation Analysis

⚠️ Important:

Call this NOT SNP analysis

🎯 Goal:

Find differences vs reference

🔧 Steps:

Use SVV-001 (2002) as reference
Compare each genome to reference
Count:
- Number of nucleotide differences
- Differences per region (P1, P2, P3, UTRs)
ools:
R (Biostrings)
Python (Biopython)

📊 Output:

Mutation counts per genome
Mutation density (mutations per kb)

STEP 5: Phylogenetic Analysis

🎯 Goal:

Understand evolutionary relationships

🔧 Tool:

IQ-TREE 2

🔧 Command:


iqtree2 -s aligned.fasta -m MFP -bb 1000 -nt AUTO

📌 Output:

Phylogenetic tree file (.treefile)

📊 Visualization:

Use:

ggtree (R)
FigTree

💡 What to look for:

Clustering of pre vs post-2015
Emergence of new lineages

🧬 STEP 6: Selection Pressure Analysis

🎯 Goal:

Find adaptive evolution

🔧 Tool:

HyPhy

Methods:

FEL (Fixed Effects Likelihood)

🔧 Output:

Sites under:
- Positive selection (adaptive)
- Negative selection (conserved)

💡 Interpretation:

Positive selection → adaptation
Negative selection → functional importance

🧬 STEP 7: Codon Usage Analysis

🎯 Goal:

See how virus adapts to host

🔧 Metrics:

RSCU (Relative Synonymous Codon Usage)
ENC (Effective Number of Codons)

🧰 Tools:

CodonW
R packages (seqinr)

📊 Output:

Codon bias patterns
Compare pre vs post-2015

📊 STEP 8: Visualization

🎯 Goal:

Make results clear and publishable

🔧 Use:

ggplot2 → mutation plots
ggtree → phylogenetic trees
Heatmaps → codon usage
Boxplots → variation comparison

🧠 How Everything Connects (BIG PICTURE)

Step	What it tells you
Alignment	Where sequences differ
Mutation analysis	How much virus changed
Phylogeny	Evolutionary relationships
Selection analysis	Why changes occurred
Codon usage	Host adaptation

All analyses will be performed on codon-aware alignments to preserve reading frame integrity and ensure biologically meaningful interpretation of mutations

1. Recombination Analysis (VERY IMPRESSIVE)

💡 Why this matters:

RNA viruses (especially picornaviruses) often evolve through recombination, not just mutation.

🎯 What you test:

Did SVV evolve by mixing genomes over time?

🔧 Tools:

RDP5 (Recombination Detection Program)
GARD (via HyPhy)

📌 What you add to methods:

“Recombination analysis will be performed using RDP5 and/or GARD to identify potential recombination breakpoints and assess their contribution to SVV evolution.”

🔥 Why PI will like this:

Shows deep evolutionary thinking
Goes beyond “basic alignment + tree”

🧬 2. Population Genetics / Diversity Analysis

🎯 What you measure:

Genetic diversity across time

🔧 Metrics:

Nucleotide diversity (π)
Tajima’s D (optional but strong)

🧰 Tools:

DnaSP
R (pegas, ape)

📌 Add this:

“Genetic diversity metrics, including nucleotide diversity (π), will be calculated to compare variability between pre- and post-2015 populations.”

🔥 Why it impresses:

You’re now doing population-level inference, not just sequence comparison.

🧬 3. Sliding Window Analysis (Mutation Hotspots — Cleaner Version)

You already hinted at this — now make it more rigorous.

🎯 Goal:

Find where in the genome evolution is happening

🔧 Method:

Sliding window (e.g., 200 bp window, 50 bp step)

📌 Add:

“A sliding window analysis will be performed to assess regional variation in mutation density across the genome.”

🔥 Why better:

More quantitative and publishable than simple SNP counts

🧬 4. Protein-Level Impact Analysis (VERY SMART ADDITION)

🎯 Question:

Do mutations actually affect proteins?

🔧 What to do:

Translate sequences → proteins
Compare amino acid changes

Optional:

Map mutations to:
- structural proteins (P1)
- non-structural (P2, P3)

📌 Add:

“Nucleotide changes will be translated into amino acid sequences to assess potential functional impacts across viral proteins.”

🔥 Why PI likes this:

You move from:
👉 “data analysis” → biological interpretation

🧬 5. Entropy Analysis (You already mentioned — KEEP it but refine)

🎯 Goal:

Measure variability per position

📌 Improve wording:

“Shannon entropy will be calculated for each genomic position to quantify sequence variability across time periods.”

⚠️ What NOT to Add (Important)

Avoid these (your PI will not like):

❌ Variant calling pipelines (requires raw reads)
❌ Overly complex ML models
❌ Too many tools with no clear purpose

🧠 Best Strategy (THIS is the sweet spot)

Keep your core + add ONLY 2 upgrades:

✅ Final “Impressive but Clean” Methods:

Alignment (MACSE)
Phylogeny (IQ-TREE)
Selection (HyPhy)
Codon usage
Mutation profiling
✅ Recombination analysis
✅ Genetic diversity (π)

🔥 Killer Sentence to Add (This shows maturity)

“Together, these analyses provide a comprehensive view of SVV evolution by integrating sequence variation, selection pressures, recombination, and population-level diversity using assembled genomes.”

✅ 1. Heatmaps — YES (but use correctly)

🔥 Good uses:

Codon usage (RSCU heatmap)
Mutation density across genome
Amino acid changes across strains

📌 Add to methods:

“Heatmaps will be used to visualize codon usage bias and mutation patterns across genomes.”

👉 ✔ Clean
👉 ✔ Visual
👉 ✔ Interpretable

✅ 2. Codon Usage (RSCU, ENC) — KEEP (Core strength)

You already have this — just make it stronger:

🔥 Upgrade:

Compare pre vs post-2015
Link to host adaptation

📌 Add:

“RSCU values will be visualized using heatmaps to compare codon preference shifts across time.”

✅ 3. Viral Entropy — YES (VERY GOOD)

🎯 Why:

Measures variability per position → stronger than “mutation counts”

Tools:

R / Bio3D / custom scripts

📌 Add:

“Shannon entropy will be calculated to quantify positional variability across the genome.”

👉 This is PhD-level clean analysis

✅ 4. Selection Pressure (FEL + SLAC) — STRONG UPGRADE

You already used FEL — now refine:

🔥 Best combo:

FEL → site-level selection
SLAC → faster, confirmatory

📌 Add:

“Selection pressure will be assessed using both FEL and SLAC methods to ensure robustness of inferred selective signals.”

👉 ✔ Shows depth
👉 ✔ Shows method awareness

✅ 5. Recombination (GARD) — KEEP (High impact)

Already discussed — definitely include.

⚠️ 6. Protein Structure Prediction — CAREFUL (but can impress if done right)

❗ Reality:

Yes, you can use AlphaFold + NGL Viewer, BUT:

Your project is genome-level, not structural biology
Doing this broadly = too much

✅ SMART WAY to include it:

ONLY do this:

Identify key mutations under positive selection
Pick 1 protein (e.g., capsid protein VP1)
Map mutations onto structure

📌 Add:

“Selected amino acid changes identified under positive selection will be mapped onto predicted protein structures (e.g., AlphaFold models) to assess potential structural and functional impacts.”

Alignment Ideas for Beginners

STEP 1: Data Collection (GenBank, NOT SRA)

🎯 Goal:

🔧 How:

STEP 2: Sequence Quality Control

🎯 Goal:

🔧 Do:

STEP 3: Multiple Sequence Alignment

🎯 Goal:

🔧 Tool:

💡 Why MACSE (important for your writeup):

ommand (example):

📌 Output:

STEP 4: Mutation / Genomic Variation Analysis

⚠️ Important:

🎯 Goal:

🔧 Steps:

ools:

📊 Output:

STEP 5: Phylogenetic Analysis

🎯 Goal:

🔧 Tool:

🔧 Command:

📌 Output:

📊 Visualization:

💡 What to look for:

🧬 STEP 6: Selection Pressure Analysis

🎯 Goal:

🔧 Tool:

Methods:

🔧 Output:

💡 Interpretation:

🧬 STEP 7: Codon Usage Analysis

🎯 Goal:

🔧 Metrics:

🧰 Tools:

📊 Output:

📊 STEP 8: Visualization

🎯 Goal:

🔧 Use:

🧠 How Everything Connects (BIG PICTURE)

1. Recombination Analysis (VERY IMPRESSIVE)

💡 Why this matters:

🎯 What you test:

🔧 Tools:

📌 What you add to methods:

🔥 Why PI will like this:

🧬 2. Population Genetics / Diversity Analysis

🎯 What you measure:

🔧 Metrics:

🧰 Tools:

📌 Add this:

🔥 Why it impresses:

🧬 3. Sliding Window Analysis (Mutation Hotspots — Cleaner Version)

🎯 Goal:

🔧 Method:

📌 Add:

🔥 Why better:

🧬 4. Protein-Level Impact Analysis (VERY SMART ADDITION)

🎯 Question:

🔧 What to do:

Optional:

📌 Add:

🔥 Why PI likes this:

🧬 5. Entropy Analysis (You already mentioned — KEEP it but refine)

🎯 Goal:

📌 Improve wording:

⚠️ What NOT to Add (Important)

🧠 Best Strategy (THIS is the sweet spot)

✅ Final “Impressive but Clean” Methods:

🔥 Killer Sentence to Add (This shows maturity)

✅ 1. Heatmaps — YES (but use correctly)

🔥 Good uses:

📌 Add to methods:

✅ 2. Codon Usage (RSCU, ENC) — KEEP (Core strength)

🔥 Upgrade:

📌 Add:

✅ 3. Viral Entropy — YES (VERY GOOD)

🎯 Why:

Tools: