Notification texts go here Contact Us Buy Now!
Posts

Raw reading sequences vs Assembled genomes

Please wait 0 seconds...
Scroll Down and click on Go to Link for destination
Congrats! Link is Generated

  Raw sequencing reads are the “messy, unprocessed data straight from the sequencer,” while assembled genomes are the “clean, reconstructed, near‑complete genomes built from those reads.”

 

🧬 Differences

Raw reads = millions of short fragments Assembled genome = those fragments stitched together into long, continuous sequences (contigs/scaffolds)

 

🧪 1. What raw sequencing reads from SRA are

These are the original FASTQ files generated by Illumina, Nanopore, etc.

They contain:

  • Short reads (e.g., 150 bp paired‑end Illumina)

  • Quality scores for each base

  • Sequencing errors

  • Adapters, contaminants, host reads (sometimes)

  • No genome structure — just fragments

Think of raw reads as:

A box of shredded book pages. You know the content is there, but nothing is in order.

Why raw reads matter

- You can reassemble the genome yourself

You can check quality (coverage, contamination)

You can run variant calling, SNP analysis, phylogenomics, etc.

They are the ground truth data

 

FASTA files (.fna, .fa) produced after bioinformatics processing.

Assemblies contain:

  • Contigs/scaffolds (long sequences)

  • No quality scores

  • Errors corrected

  • Adapters removed

  • Genome structure reconstructed

Think of an assembly as:

The shredded book reassembled into readable chapters, though some pages may still be missing or out of order.

Why assemblies matter

  • Ready for annotation (Prokka, Bakta)

  • Ready for AMRFinderPlus

  • Ready for pangenome tools (Panaroo, Roary)

  • Much smaller and easier to work with

  • No need for heavy computation

     

     

     

    FeatureRaw Reads (FASTQ)Assembled Genome (FASTA)
    File type.fastq.gz.fna, .fa
    DataShort fragmentsLong contigs/scaffolds
    Quality scoresYesNo
    ErrorsManyMostly corrected
    SizeVery large (GBs)Small (KB–MB)
    Ready for AMRFinderPlus?❌ No✅ Yes
    Ready for pangenome tools?❌ No✅ Yes
    Requires assembly?YesNo
    Useful for variant calling?YesLimited

     

     

    Why SRA often gives you raw reads, not assemblies

    Because:

  • Assemblies depend on software choice, parameters, quality filtering, etc.

  • Different labs produce different assemblies from the same reads.

  • SRA stores the rawest possible data so anyone can reproduce the analysis.

ENA sometimes provides assemblies automatically, which is why downloading from ENA is easier for your AMRFinderPlus assignment.





SRA = Sequence Read Archive, a massive public database run by NCBI that stores raw sequencing data.

Think of it as:

The global warehouse where labs deposit the original, unprocessed sequencing reads from their experiments.

It contains:

  • Raw FASTQ files (short reads + quality scores)

  • Data from Illumina, Nanopore, PacBio, etc.

  • Metadata about the sample (host, tissue, study, etc.)

  • Sometimes multiple runs per sample

SRA is the rawest level of data — nothing assembled, nothing annotated.

🧬 Why SRA exists

Because journals and funders require researchers to deposit the original sequencing data so others can:

  • Reproduce the analysis

  • Reassemble genomes with different tools

  • Perform new analyses (variant calling, pangenomes, AMR detection, etc.)

  • Check for contamination or quality issues

It’s the scientific equivalent of keeping the “source code” of an experiment.

🧬 How SRA relates to ENA and DDBJ

These three databases mirror each other:

  • NCBI SRA (USA)

  • ENA (Europe)

  • DDBJ (Japan)

If a dataset is uploaded to one, it eventually appears in the others.

BUT: ENA often provides assembled genomes automatically, which is why ENA is easier for your AMRFinderPlus assignment.

🧬 SRA vs. assembled genomes (quick contrast)

FeatureSRAAssembled Genome
Data typeRaw readsContigs/scaffolds
File formatFASTQFASTA
Quality scoresYesNo
ErrorsManyMostly corrected
Ready for AMRFinderPlus?NoYes
Ready for pangenome tools?NoYes
SizeHugeSmall
Requires assembly?YesNo

🧬 Why your instructor gave you SRA accessions

Because:

  • SRA is the standard for raw sequencing data

  • It teaches you how to trace samples back to their source

  • It forces you to understand the difference between raw reads and assemblies

  • ENA provides the assemblies you need, but SRA is the “official” accession

For your assignment, you only need the assembled genomes, not the raw reads.

If you want, I can also show you:

  • How to download SRA FASTQ files

  • How to convert SRA → FASTQ using fasterq-dump

  • How to assemble them (SPAdes, Unicycler)

  • How to check assembly quality (N50, coverage, BUSCO)

Just tell me what direction you want to go next.

what is variant calling ? we can run this using fastq or fasta file ???

Here’s the clean, intuitive way to understand variant calling, and exactly which file types you can use — this is one of those concepts that suddenly “clicks” once you see the logic behind it.

🧬 What is Variant Calling?

Variant calling = detecting differences (mutations) between your sample and a reference genome.

These differences can be:

  • SNPs (single‑nucleotide polymorphisms)

  • Insertions

  • Deletions

  • Structural variants (depending on method)

Think of it like:

You have a reference book (reference genome) and a shredded copy of the same book (your sequencing reads). Variant calling = comparing them to find typos, missing words, or extra words.

🧬 What data do you need for variant calling?

This is the key point:

Variant calling requires FASTQ files (raw reads)

You cannot perform variant calling using FASTA assemblies

Let’s break down why.

🧪 Why variant calling needs FASTQ (raw reads)

FASTQ files contain:

  • Millions of short reads

  • Quality scores

  • Sequencing depth information

  • Evidence for each base

Variant callers (bcftools, GATK, FreeBayes, etc.) need this because they:

  1. Align reads to a reference genome

  2. Count how many reads support each allele

  3. Use quality scores to judge confidence

  4. Call variants based on statistical models

Without raw reads, you cannot know:

  • Whether a mutation is real

  • Whether it’s supported by enough reads

  • Whether it’s a sequencing error

  • Whether it’s heterozygous/homozygous (in diploid organisms)

🧬 Why you CANNOT call variants from FASTA assemblies

A FASTA assembly:

  • Has no quality scores

  • Has no read depth

  • Has already been error‑corrected

  • Has gaps and misassemblies

  • Represents one consensus sequence

So you lose all the evidence needed to detect variants.

It’s like:

Someone already reconstructed the shredded book into a clean version. You can’t see the original mistakes anymore.

Assemblies can be compared to each other (alignment-based comparison), but that is not variant calling — it’s genome alignment.

🧬 Summary Table

TaskFASTQFASTA
Variant callingYesNo
Genome assemblyYesNo
AMRFinderPlusNoYes
Pangenome analysisNoYes
SNP phylogenyYesNo
AnnotationNoYes

🧬 Where this fits into your coursework

Given your assignments:

  • SRA = FASTQ = raw reads → used for variant calling

  • ENA = FASTA = assembled genomes → used for AMRFinderPlus, pangenomes

Your instructor is teaching you the difference between:

  • Raw data (FASTQ)

  • Processed data (FASTA)

  • Downstream analyses (AMR detection, variant calling, annotation)

If you want, I can also explain:

  • How variant calling actually works step‑by‑step

  • The exact bcftools pipeline (align → sort → index → call → filter)

  • How to interpret a VCF file

  • How variant calling differs between viruses, bacteria, and eukaryotes

Just tell me what direction you want to go next.

what is ENA

Here’s the clear, lab‑friendly explanation you need — especially since you’re working with SRA, ENA, AMRFinderPlus, and pangenomes.

🧬 What is ENA?

ENA = European Nucleotide Archive It is the European equivalent of NCBI’s SRA, run by the EMBL‑EBI (European Bioinformatics Institute).

Think of it as:

A huge international database that stores DNA/RNA sequencing data, assembled genomes, annotations, and metadata — similar to SRA, but often easier to use.

ENA, SRA, and DDBJ form a global data‑sharing network. If a dataset is uploaded to one, it eventually appears in all three.

🧬 What ENA contains

ENA stores:

  • Raw sequencing reads (FASTQ)

  • Assembled genomes (FASTA)

  • Annotations (GFF, GBK)

  • Metadata (host, sample type, location, etc.)

  • Project and study information

This is why ENA is extremely convenient for your assignments.

🧬 Why ENA is often easier than SRA

This is the part that matters for your AMRFinderPlus and pangenome work.

1. ENA provides FASTA assemblies directly

Many datasets on ENA include:

  • .fna assembled genomes

  • .gff annotation files

SRA usually gives you only raw FASTQ reads, which require assembly.

2. ENA allows direct download without special tools

You can download:

  • FASTA

  • FASTQ

  • GFF

  • Metadata

…with a simple click or wget command.

3. ENA pages are cleaner and easier to navigate

You can see:

  • All runs

  • All assemblies

  • All metadata

  • Links to SRA and BioSample

…in one place.

🧬 How ENA relates to SRA

FeatureSRA (NCBI)ENA (Europe)
Raw readsYesYes
Assembled genomesRareOften available
Annotation filesRareOften available
Download methodRequires fastq-dumpDirect download
InterfaceMore complexCleaner
Mirrors data?YesYes

So ENA is basically:

SRA + assemblies + annotations + easier downloads.

🧬 Why your instructor uses ENA for your assignment

Because for AMRFinderPlus and pangenome analysis, you need assembled genomes, not raw reads.

ENA gives you:

  • .fna files → ready for AMRFinderPlus

  • .gff files → ready for annotation tools

  • Clean metadata → easier to track samples

This avoids the heavy step of assembling FASTQ files yourself.


 

 

 

Post a Comment

Cookie Consent
We serve cookies on this site to analyze traffic, remember your preferences, and optimize your experience.
Oops!
It seems there is something wrong with your internet connection. Please connect to the internet and start browsing again.
AdBlock Detected!
We have detected that you are using adblocking plugin in your browser.
The revenue we earn by the advertisements is used to manage this website, we request you to whitelist our website in your adblocking plugin.
Site is Blocked
Sorry! This site is not available in your country.