Import FASTQ
files
- List FASTQ files and derive sample IDs
r1_suffix <- "_R1.fastq.gz"
r2_suffix <- "_R2.fastq.gz"
r1_files <- sort(list.files(
file.path(base_dir, "FASTQ"),
pattern = paste0(r1_suffix, "$"),
full.names = TRUE
))
r2_files <- sort(list.files(
file.path(base_dir, "FASTQ"),
pattern = paste0(r2_suffix, "$"),
full.names = TRUE
))
sample_ids <- sapply(str_split(basename(r1_files), "_"), "[[", 1)
names(r1_files) <- sample_ids
names(r2_files) <- sample_ids
Read
quality
Raw reads
- Per-base quality profiles from the original demultiplexed FASTQs,
before any trimming or filtering.
plotQualityProfile() shows the distribution of base
quality at each read position across many reads.
- X-axis: Base position along the read (cycle).
- Y-axis: Phred quality score (higher = better base-call
confidence).
- Green line (approx. mean quality): Average trend across
positions.
- Orange line (approx. median quality): Central tendency trend.
p_raw_R1 <- plotQualityProfile(r1_files, aggregate = TRUE) + labs(subtitle = "Read 1")
p_raw_R2 <- plotQualityProfile(r2_files, aggregate = TRUE) + labs(subtitle = "Read 2")
pdf(file.path(base_dir, "01_read_quality.pdf"), width = 10, height = 5)
print(p_raw_R1 + p_raw_R2)
dev.off()

After
filterAndTrim
- Post-filter quality profiles: These plots show only the reads that
passed DADA2
filterAndTrim() and were retained for
denoising.
- Trims low-quality sequence ends (truncLen): Cuts
reads to fixed lengths based on quality profiles (e.g., forward = 240
bp) to remove low-quality tail regions.
- Applies expected-error filtering (maxEE): Removes
reads whose total expected errors exceed the set threshold (e.g., maxEE
= c(2,2)), reducing likely sequencing-error reads.
- Removes ambiguous reads (maxN): Discards reads
containing ambiguous bases (N).
filter_trim_dir <- file.path(base_dir, "filter_and_trim")
dir.create(filter_trim_dir)
filt_out_r1 <- file.path(filter_trim_dir, paste0(sample_ids, r1_suffix))
filt_out_r2 <- file.path(filter_trim_dir, paste0(sample_ids, r2_suffix))
# Truncates read to a fixed final length (keeps first N bases, drops the rest)
filt_truncLen <- 0 # no trimming
# Trims bases from the start (left/5′ end) of each read
filt_trimLeft <- 0 # no trimming
# Removes reads likely to contain too many sequencing errors
filt_maxEE <- Inf # no expected-error filtering
# Removes reads containing more than the allowed number of ambiguous bases ("N")
filt_maxN <- 0 # allows no ambiguous bases
truncLen_vec <- c(as.integer(filt_truncLen), as.integer(filt_truncLen)) # R1 and R2
trimLeft_vec <- c(as.integer(filt_trimLeft), as.integer(filt_trimLeft)) # R1 and R2
maxEE_vec <- c(filt_maxEE, filt_maxEE) # R1 and R2
filt_stats <- filterAndTrim(
fwd = r1_files,
rev = r2_files,
filt = filt_out_r1,
filt.rev = filt_out_r2,
truncLen = truncLen_vec,
trimLeft = trimLeft_vec,
maxN = filt_maxN,
maxEE = maxEE_vec
)
filt_dt <- as.data.table(filt_stats, keep.rownames = "sample_id")
p_filt_R1 <- plotQualityProfile(filt_out_r1, aggregate = TRUE) + labs(subtitle = "After filterAndTrim — Read 1")
p_filt_R2 <- plotQualityProfile(filt_out_r2, aggregate = TRUE) + labs(subtitle = "After filterAndTrim — Read 2")
pdf(file.path(base_dir, "02_filter_and_trim.pdf"), width = 10, height = 5)
print(p_filt_R1 + p_filt_R2)
dev.off()

Learn the error
rates
- How error learning works: With
learnErrors(), DADA2 pools a large set of quality-filtered
bases, compares observed substitutions (e.g., A→G, C→T) across quality
scores, and fits an empirical model of error probability for each
possible transition as a function of Phred quality.
- Why this matters for denoising: During sample
inference, DADA2 uses these learned probabilities to test whether a
low-abundance sequence is better explained as an error from a more
abundant sequence or as a true biological variant (ASV).
- What the plot shows: The
plotErrors()
output visualizes learned substitution error rates versus nominal
quality-score expectations; close agreement indicates a well-fit model,
while large deviations can suggest insufficient data or atypical run
quality.
- Each panel corresponds to a specific substitution type (e.g., A→G),
showing how often one base is misread as another
- Axes and elements:
- X-axis: Phred quality score (Q). Higher Q means higher
expected base-call accuracy
- Y-axis: Estimated probability of that substitution error on
a log scale
- Points: Empirical error estimates derived from the observed
data used to learn errors
- Lines: The fitted error model that DADA2 will use during
denoising
- How to interpret a “good” plot:
- Monotonic trend: Error rates should generally
decrease as Q increases (higher quality, fewer errors)
- Close agreement: Points and lines should be
reasonably aligned, indicating the model fits the empirical signal
- Reasonable magnitude: At high Q scores, error rates
should be very low; at lower Q scores, they should be higher
# How much data is sampled to fit error rates
# - More bases: usually more stable/accurate error estimates, but longer runtime and higher memory use
# - Fewer bases: faster, but potentially noisier error estimates
# Rule of thumb: nbases = 1e8 (100,000,000 bp) is a “robust/default-like” choice for many datasets
learn_nbases <- 1e8
errF <- learnErrors(filt_out_r1, nbases = learn_nbases)
errR <- learnErrors(filt_out_r2, nbases = learn_nbases)
p_errF <- plotErrors(errF, nominalQ = TRUE) + labs(subtitle = "Forward (Read 1)")
p_errR <- plotErrors(errR, nominalQ = TRUE) + labs(subtitle = "Reverse (Read 2)")
pdf(file.path(base_dir, "03_error_rates.pdf"), width = 10, height = 5)
print(p_errF + p_errR)
dev.off()
## 29029302 total bases in 115711 reads from 12 samples will be used for learning the error rates.
## 28969435 total bases in 115711 reads from 12 samples will be used for learning the error rates.

Dereplication and
denoising
- Dereplication: Compresses each sample into a set of
unique sequences and counts, which reduces redundancy and preserves
abundance information needed for inference.
- Efficiency: Operating on unique sequences (rather than all reads)
substantially reduces runtime and memory requirements.
- Accuracy: Abundance-aware, error-model–based inference improves
resolution beyond OTU clustering by separating closely related true
variants from error-derived sequences.
- Denoising (the primary DADA2 step): Using the
learned error model, DADA2 determines whether sequence differences are
likely sequencing errors or true biological variants, then outputs a
per-sample ASV count table (samples × ASVs).
derepF <- derepFastq(filt_out_r1)
names(derepF) <- sub(paste0(r1_suffix, "$"), "", basename(filt_out_r1))
dadaF <- dada(derepF, err = errF)
## Sample 1 - 5829 reads in 2379 unique sequences.
## Sample 2 - 19489 reads in 5539 unique sequences.
## Sample 3 - 6726 reads in 2093 unique sequences.
## Sample 4 - 6033 reads in 2185 unique sequences.
## Sample 5 - 2403 reads in 859 unique sequences.
## Sample 6 - 9725 reads in 3183 unique sequences.
## Sample 7 - 6599 reads in 1951 unique sequences.
## Sample 8 - 1723 reads in 557 unique sequences.
## Sample 9 - 32645 reads in 7640 unique sequences.
## Sample 10 - 4911 reads in 1688 unique sequences.
## Sample 11 - 7159 reads in 2371 unique sequences.
## Sample 12 - 12469 reads in 3772 unique sequences.
derepR <- derepFastq(filt_out_r2)
names(derepR) <- sub(paste0(r2_suffix, "$"), "", basename(filt_out_r2))
dadaR <- dada(derepR, err = errR)
## Sample 1 - 5829 reads in 3868 unique sequences.
## Sample 2 - 19489 reads in 12787 unique sequences.
## Sample 3 - 6726 reads in 4727 unique sequences.
## Sample 4 - 6033 reads in 4671 unique sequences.
## Sample 5 - 2403 reads in 1997 unique sequences.
## Sample 6 - 9725 reads in 7697 unique sequences.
## Sample 7 - 6599 reads in 4465 unique sequences.
## Sample 8 - 1723 reads in 1366 unique sequences.
## Sample 9 - 32645 reads in 17903 unique sequences.
## Sample 10 - 4911 reads in 3602 unique sequences.
## Sample 11 - 7159 reads in 4759 unique sequences.
## Sample 12 - 12469 reads in 9264 unique sequences.
Merge paired
reads
- Combines denoised forward and reverse reads within each
sample by aligning their overlapping regions.
- Reconstructs a single high-confidence amplicon sequence per variant
when overlap is sufficient and concordant.
- Improves accuracy by requiring agreement across the overlap; read
pairs with poor overlap or conflicting bases are not merged.
merge_args <- list(dadaF = dadaF, derepF = derepF, dadaR = dadaR, derepR = derepR, verbose = TRUE)
mergers <- do.call(mergePairs, merge_args)
seqtab <- makeSequenceTable(mergers)
asv_seqs <- colnames(seqtab)
asv_ids <- sprintf("ASV%03d", seq_along(asv_seqs))
asv_map_dt <- data.table(
asv_id = asv_ids,
asv_sequence = asv_seqs,
asv_length = nchar(asv_seqs),
total_count = as.integer(colSums(seqtab))
)
to_preview <- seqtab
colnames(to_preview) <- asv_ids
head(to_preview[, c(1:5)])
## ASV001 ASV002 ASV003 ASV004 ASV005
## F3D1 416 364 146 79 203
## F3D2 3625 1677 361 499 1324
## F3D3 1024 627 409 214 407
## F8D148 421 310 326 477 400
## F8D149 194 195 137 232 151
## F8D150 864 562 298 726 316
dim(seqtab)
## [1] 12 202
colnames(seqtab)[1] # ASV001
## [1] "TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGAAGATCAAGTCAGCGGTAAAATTGAGAGGCTCAACCTCTTCGAGCCGTTGAAACTGGTTTTCTTGAGTGAGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCTCAACTGACGCTCATGCACGAAAGTGTGGGTATCGAACAGG"
table(nchar(getSequences(seqtab)))
##
## 251 252 253 254 255
## 1 30 165 5 1
Read retention
overview
Statistics tracking
and export
- Record the number of reads removed at each step and export stats
with sample metadata.
getN <- function(x) sum(getUniques(x))
filt_dt$sample_id <- sapply(str_split(filt_dt$sample_id, "_"), "[[", 1)
track_dt <- data.table(
filt_dt,
denoised_F = vapply(dadaF, getN, integer(1)),
denoised_R = vapply(dadaR, getN, integer(1)),
merged = rowSums(seqtab, na.rm = TRUE),
nonchim = rowSums(seqtab_nochim, na.rm = TRUE)
)
track_dt[, pct_remain := 100 * (nonchim / reads.in)]
track_dt[, sex := fifelse(
grepl("^F", sample_id), "Female",
fifelse(grepl("^M", sample_id), "Male", NA_character_)
)]
track_dt[, day := as.integer(sub(".*D", "", sample_id))]
track_dt[, stage := fifelse(
day <= 100L, "Early",
fifelse(day >= 100L, "Late", NA_character_)
)]
track_dt[, group := factor(
paste(stage, sex, sep = "+"),
levels = c("Early+Male", "Early+Female", "Late+Male", "Late+Female")
)]
head(track_dt)
## sample_id reads.in reads.out denoised_F denoised_R merged nonchim pct_remain
## <char> <num> <num> <int> <int> <num> <num> <num>
## 1: F3D1 5869 5829 5647 5428 5033 5033 85.75567
## 2: F3D2 19620 19489 19210 18194 17191 16952 86.40163
## 3: F3D3 6758 6726 6519 5961 4781 4578 67.74194
## 4: F8D148 6067 6033 5850 5170 4149 4149 68.38635
## 5: F8D149 2415 2403 2302 1892 1480 1480 61.28364
## 6: F8D150 9786 9725 9502 8361 6832 6832 69.81402
## sex day stage group
## <char> <int> <char> <fctr>
## 1: Female 1 Early Early+Female
## 2: Female 2 Early Early+Female
## 3: Female 3 Early Early+Female
## 4: Female 148 Late Late+Female
## 5: Female 149 Late Late+Female
## 6: Female 150 Late Late+Female
Mean read retention
summary
- Mean read retention from raw input to the final non-chimeric ASV
table, reported as percent retained versus excluded across all
samples
mean_remaining <- mean(track_dt$pct_remain, na.rm = TRUE)
mean_excluded <- 100 - mean_remaining
bar1_dt <- data.table(
label = "Raw to final",
status = factor(c("Remaining", "Excluded"), levels = c("Remaining", "Excluded")),
pct = c(mean_remaining, mean_excluded)
)
barplot <- ggplot(bar1_dt, aes(x = label, y = pct, fill = status)) +
geom_col(width = 0.6) +
coord_flip() +
geom_text(
aes(label = sprintf("%.1f%%", pct)),
position = position_stack(vjust = 0.5),
size = 4
) +
scale_fill_manual(values = c("Remaining" = "steelblue", "Excluded" = "grey60")) +
scale_y_continuous(limits = c(0, 100), expand = expansion(mult = c(0, 0.02))) +
labs(
title = NULL,
x = NULL,
y = "Percentage of reads",
fill = NULL
) +
theme_classic(base_size = 12) +
theme(
legend.position = "bottom",
text = element_text(size = 12)
)
pdf(file.path(base_dir, "04_mean_read_retention_summary.pdf"), width = 8, height = 3)
print(barplot)
dev.off()

Read retention by
time and group
- Mean remaining reads at each major checkpoint, summarized across the
four groups (Early/Late × Female/Male) to highlight where reads are
primarily lost and whether retention differs by group.
steps_to_plot <- c("reads.in", "reads.out", "merged", "nonchim")
step_labels <- c(
reads.in = "Raw reads",
reads.out = "After filterAndTrim",
merged = "After merging pairs",
nonchim = "After chimera removal"
)
plot_long <- melt(
track_dt[, c("sample_id", "group", steps_to_plot), with = FALSE],
id.vars = c("sample_id", "group"),
variable.name = "step",
value.name = "remaining_reads"
)
plot_long[, step := factor(step, levels = steps_to_plot, labels = unname(step_labels[steps_to_plot]))]
sum_dt <- plot_long[, .(remaining_reads = mean(remaining_reads, na.rm = TRUE)), by = .(step, group)]
y_max <- max(sum_dt$remaining_reads, na.rm = TRUE)
step_cols <- brewer.pal(n = 4, name = "Dark2")
names(step_cols) <- levels(sum_dt$step)
barplot2 <- ggplot(sum_dt, aes(x = step, y = remaining_reads, fill = step)) +
geom_col(width = 0.75) +
facet_wrap(~group, ncol = 2, scales = "fixed") +
scale_y_continuous(
limits = c(0, y_max),
expand = expansion(mult = c(0, 0.05))
) +
scale_fill_manual(values = step_cols) +
labs(
title = NULL,
x = NULL,
y = "Remaining reads",
fill = NULL
) +
theme_classic(base_size = 12) +
theme(
legend.position = "bottom",
text = element_text(size = 12),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
pdf(file.path(base_dir, "05_number_of_reads_per_group.pdf"), width = 8, height = 9)
print(barplot2)
dev.off()

ASV composition
snapshot
ASV prevalence and
abundance
- Core: All remaining ASVs outside the
low-abundance/low-prevalence region. These are comparatively
well-supported variants detected in more samples and/or at higher total
counts, and therefore most strongly influence downstream community
summaries.
- Intermediate: ASVs in the lowest
10% of both abundance and prevalence, excluding the
“Rare” set. These are low-to-moderate support variants that occur in a
limited subset of samples and may reflect condition-specific taxa,
low-level carryover, or biologically uncommon members.
- Rare: ASVs in the lowest 5% of
both total abundance (summed counts across all samples) and prevalence
(number of samples detected). These represent low-support, sparsely
observed variants that contribute minimally to the overall read
pool.
prev_abund_dt <- data.table(
asv_sequence = colnames(seqtab_nochim),
abundance_total_count = as.integer(colSums(seqtab_nochim)),
prevalence_n_samples = as.integer(colSums(seqtab_nochim >= 1))
)
asv_map_sub <- unique(asv_map_dt[, .(asv_id, asv_sequence, asv_length)])
setkey(asv_map_sub, asv_sequence)
setkey(prev_abund_dt, asv_sequence)
prev_abund_dt <- asv_map_sub[prev_abund_dt]
prevalence_cutoff_rare <- as.numeric(quantile(
prev_abund_dt$prevalence_n_samples,
probs = 0.05, na.rm = TRUE, type = 7
))
prevalence_cutoff_intermediate <- as.numeric(quantile(
prev_abund_dt$prevalence_n_samples,
probs = 0.1, na.rm = TRUE, type = 7
))
abundance_cutoff_rare <- as.numeric(quantile(
prev_abund_dt$abundance_total_count,
probs = 0.05, na.rm = TRUE, type = 7
))
abundance_cutoff_intermediate <- as.numeric(quantile(
prev_abund_dt$abundance_total_count,
probs = 0.1, na.rm = TRUE, type = 7
))
prevalence_cutoff_rare_i <- floor(prevalence_cutoff_rare)
prevalence_cutoff_intermediate_i <- floor(prevalence_cutoff_intermediate)
abundance_cutoff_rare_i <- floor(abundance_cutoff_rare)
abundance_cutoff_intermediate_i <- floor(abundance_cutoff_intermediate)
prev_abund_dt[, class := "Core"]
prev_abund_dt[
prevalence_n_samples <= prevalence_cutoff_intermediate_i &
abundance_total_count <= abundance_cutoff_intermediate_i,
class := "Intermediate"
]
prev_abund_dt[
prevalence_n_samples <= prevalence_cutoff_rare_i &
abundance_total_count <= abundance_cutoff_rare_i,
class := "Rare"
]
prev_abund_dt[, class := factor(class, levels = c("Rare", "Intermediate", "Core"))]
summary_sentence <- sprintf(
"Total ASVs: %d in %d samples. Rare: %d; Intermediate: %d; Core: %d",
ncol(seqtab_nochim),
nrow(seqtab_nochim),
sum(prev_abund_dt$class == "Rare"),
sum(prev_abund_dt$class == "Intermediate"),
sum(prev_abund_dt$class == "Core")
)
color_map <- c("Rare" = "grey60", "Intermediate" = "goldenrod2", "Core" = "dodgerblue3")
abundance_plot <- ggplot(
prev_abund_dt,
aes(
x = abundance_total_count,
y = prevalence_n_samples,
color = class
)
) +
geom_point() +
scale_color_manual(values = color_map) +
scale_x_log10(name = "Abundance (total count; log10)") +
scale_y_continuous(
name = "Prevalence (number of samples)",
breaks = seq(2, 12, by = 2),
limits = c(2, 12)
) +
labs(title = summary_sentence, color = NULL) +
theme_minimal() +
theme(
legend.position = "bottom",
legend.direction = "horizontal",
legend.justification = "center",
plot.title = element_text(hjust = 0.5)
) +
guides(color = guide_legend(nrow = 1))
pdf(file.path(base_dir, "06_asv_prevalence_abundance.pdf"), width = 6, height = 6)
print(abundance_plot)
dev.off()

Export
- Summary statistics: Exported key run-level and
sample-level processing summaries (e.g., reads retained across steps)
for QC documentation and reporting.
- ASV representative sequences (FASTA): Exported the
unique inferred ASV nucleotide sequences from seqtab into
asv_sequences.fasta.
asv_seqs <- colnames(seqtab)
asv_ids <- sprintf("ASV%03d", seq_along(asv_seqs))
xset <- DNAStringSet(asv_seqs)
names(xset) <- asv_ids
writeXStringSet(
xset,
filepath = file.path(base_dir, "asv_sequences.fasta"),
format = "fasta",
width = 80
)
xset
## DNAStringSet object of length 202:
## width seq names
## [1] 252 TACGGAGGATGCGAGCGTTATC...GAAAGTGTGGGTATCGAACAGG ASV001
## [2] 252 TACGGAGGATGCGAGCGTTATC...GAAAGCGTGGGTATCGAACAGG ASV002
## [3] 253 TACGGAGGATCCGAGCGTTATC...GAAAGTGTGGGTATCAAACAGG ASV003
## [4] 252 TACGGAGGATGCGAGCGTTATC...GAAAGTGCGGGGATCGAACAGG ASV004
## [5] 253 TACGGAGGATTCAAGCGTTATC...GAAAGCGTGGGGAGCAAACAGG ASV005
## ... ... ...
## [198] 252 TACGTAGGTGGCGAGCGTTATC...GAAAGCGTGGGGAGCAAATAGG ASV198
## [199] 252 TACGTAGGTGGCGAGCGTTATC...GAAAGCGTGGGGAGCAAATAGG ASV199
## [200] 253 TACGTAGGGAGCAAGCGTTATC...GAAAGTGTGGGGAGCAAACAGG ASV200
## [201] 253 TACGTAGGGGGCGAGCGTTGTC...GAAAGCGTGGGGAGCAAACAGG ASV201
## [202] 252 TACGTAGGGGGCGAGCGTTATC...GAAAGCGTGGGGAGCAAATAGG ASV202