[Part 1] Quality control, denoising, and amplicon sequence variant (ASV) inference.

1 Introduction

1.1 Background

This workshop offers a hands-on, guided introduction to microbiome bioinformatics within the HUB Workshop Series, focusing on how raw sequencing data are transformed into analysis-ready inputs.
Using a 16S rRNA sequencing workflow as a toy/example project, students will practice essential preprocessing steps, including data organization, file handling, filtering, trimming, and foundational quality-control (QC) checks.
The project emphasizes practical decision-making in early-stage microbiome analysis, helping students understand how preprocessing choices affect downstream reliability, interpretation, and biological conclusions.

1.2 Aims

Build core technical fluency in microbiome data handling (metadata organization, reproducible folder/workflow structure).
Apply a consensus preprocessing pipeline for 16S rRNA data, including filtering and trimming of low-quality or non-informative reads.
Evaluate dataset quality using key QC metrics and summarize whether data are suitable for downstream analysis.
Strengthen analytical reasoning by linking preprocessing parameters to data quality outcomes and potential biases.

1.3 Deliverables

Filtering and trimming record: Documented preprocessing parameters and a short rationale for each key choice.
Quality control summary: Pre-/post-processing QC evidence showing data quality improvement and reliability.
Export analysis-ready ASVs: Generate and export the ASV abundance table (samples × ASVs) and representative ASV sequences.

1.4 Materials

Longitudinal mouse 16S rRNA amplicon reads provided in the mothur MiSeq SOP tutorial materials.
For computational efficiency during the workshop, we subsetted 12 samples to demonstrate the QC workflow.
The curated workshop materials are available for download here: 16S_rRNA.zip (27 MB).
Samples include Early, (Day 1 [i.e. the day of weaning], 2, 3) and Late (Day 148, 149, 150) time points from two male mice (Mouse #1 and #6) and two female mice (Mouse #3 and #8), N=12 in total.

1.5 References

This session uses an adapted version of the DADA2 (Divisive Amplicon Denoising Algorithm) tutorial, revised to align with the learning goals and workflow of this course session.
The detailed methods underlying this workflow are described in the original DADA2 publication: DADA2: High-resolution sample inference from Illumina amplicon data.

2 Environment

To support reproducible research, we provide a Docker-based analysis environment that you can adopt as a workflow for your own data analyses.
OS: Ubuntu 24.04.3 LTS
Platform: aarch64-linux-gnu
Software: R (v4.5.2)
R pakcages:

Biostrings (v2.76.0)
dada2 (v1.36.0)
data.table (v1.18.2.1)
dplyr (v1.2.0)
ggplot2 (v4.0.2)
htmltools (v0.5.9)
htmlwidgets (v1.6.4)
patchwork (v1.3.2)
phyloseq (v1.52.0)
RColorBrewer (v1.1-3)
sankeyD3 (v0.3.2)
ShortRead (v1.66.0)
tibble (v3.3.1)
tidyr (v1.3.2)

3 Preprocessing

3.1 Preparation in the R environment

Set the working directory to import FASTQ files and save all outputs generated in this report
We use a Docker container with all required packages pre-installed to ensure a consistent and reproducible computing environment.
Your working directory should be set to /home/hub, which points to a designated project folder inside the container.

base_dir <- "/home/hub"
setwd(base_dir)
dir()

## [1] "FASTQ"

3.2 Load R packages

Required packages are listed in Section 2

library(dada2)
library(Biostrings)
library(data.table)
library(stringr)
library(ggplot2)
library(patchwork)
library(RColorBrewer)

3.3 Import FASTQ files

List FASTQ files and derive sample IDs

r1_suffix <- "_R1.fastq.gz"
r2_suffix <- "_R2.fastq.gz"

r1_files <- sort(list.files(
        file.path(base_dir, "FASTQ"),
        pattern = paste0(r1_suffix, "$"),
        full.names = TRUE
))
r2_files <- sort(list.files(
        file.path(base_dir, "FASTQ"),
        pattern = paste0(r2_suffix, "$"),
        full.names = TRUE
))

sample_ids <- sapply(str_split(basename(r1_files), "_"), "[[", 1)
names(r1_files) <- sample_ids
names(r2_files) <- sample_ids

3.4 Read quality

3.4.1 Raw reads

Per-base quality profiles from the original demultiplexed FASTQs, before any trimming or filtering.
plotQualityProfile() shows the distribution of base quality at each read position across many reads.
- X-axis: Base position along the read (cycle).
- Y-axis: Phred quality score (higher = better base-call confidence).
- Green line (approx. mean quality): Average trend across positions.
- Orange line (approx. median quality): Central tendency trend.

p_raw_R1 <- plotQualityProfile(r1_files, aggregate = TRUE) + labs(subtitle = "Read 1")
p_raw_R2 <- plotQualityProfile(r2_files, aggregate = TRUE) + labs(subtitle = "Read 2")
pdf(file.path(base_dir, "01_read_quality.pdf"), width = 10, height = 5)
print(p_raw_R1 + p_raw_R2)
dev.off()

3.4.2 After filterAndTrim

Post-filter quality profiles: These plots show only the reads that passed DADA2 filterAndTrim() and were retained for denoising.
- Trims low-quality sequence ends (truncLen): Cuts reads to fixed lengths based on quality profiles (e.g., forward = 240 bp) to remove low-quality tail regions.
- Applies expected-error filtering (maxEE): Removes reads whose total expected errors exceed the set threshold (e.g., maxEE = c(2,2)), reducing likely sequencing-error reads.
- Removes ambiguous reads (maxN): Discards reads containing ambiguous bases (N).

filter_trim_dir <- file.path(base_dir, "filter_and_trim")
dir.create(filter_trim_dir)

filt_out_r1 <- file.path(filter_trim_dir, paste0(sample_ids, r1_suffix))
filt_out_r2 <- file.path(filter_trim_dir, paste0(sample_ids, r2_suffix))

# Truncates read to a fixed final length (keeps first N bases, drops the rest)
filt_truncLen <- 0 # no trimming
# Trims bases from the start (left/5′ end) of each read
filt_trimLeft <- 0 # no trimming
# Removes reads likely to contain too many sequencing errors
filt_maxEE <- Inf # no expected-error filtering
# Removes reads containing more than the allowed number of ambiguous bases ("N")
filt_maxN <- 0 # allows no ambiguous bases

truncLen_vec <- c(as.integer(filt_truncLen), as.integer(filt_truncLen)) # R1 and R2
trimLeft_vec <- c(as.integer(filt_trimLeft), as.integer(filt_trimLeft)) # R1 and R2
maxEE_vec <- c(filt_maxEE, filt_maxEE) # R1 and R2

filt_stats <- filterAndTrim(
        fwd      = r1_files,
        rev      = r2_files,
        filt     = filt_out_r1,
        filt.rev = filt_out_r2,
        truncLen = truncLen_vec,
        trimLeft = trimLeft_vec,
        maxN     = filt_maxN,
        maxEE    = maxEE_vec
)
filt_dt <- as.data.table(filt_stats, keep.rownames = "sample_id")

p_filt_R1 <- plotQualityProfile(filt_out_r1, aggregate = TRUE) + labs(subtitle = "After filterAndTrim — Read 1")
p_filt_R2 <- plotQualityProfile(filt_out_r2, aggregate = TRUE) + labs(subtitle = "After filterAndTrim — Read 2")
pdf(file.path(base_dir, "02_filter_and_trim.pdf"), width = 10, height = 5)
print(p_filt_R1 + p_filt_R2)
dev.off()

3.5 Learn the error rates

How error learning works: With learnErrors(), DADA2 pools a large set of quality-filtered bases, compares observed substitutions (e.g., A→G, C→T) across quality scores, and fits an empirical model of error probability for each possible transition as a function of Phred quality.
Why this matters for denoising: During sample inference, DADA2 uses these learned probabilities to test whether a low-abundance sequence is better explained as an error from a more abundant sequence or as a true biological variant (ASV).
What the plot shows: The plotErrors() output visualizes learned substitution error rates versus nominal quality-score expectations; close agreement indicates a well-fit model, while large deviations can suggest insufficient data or atypical run quality.
- Each panel corresponds to a specific substitution type (e.g., A→G), showing how often one base is misread as another
- Axes and elements:
  - X-axis: Phred quality score (Q). Higher Q means higher expected base-call accuracy
  - Y-axis: Estimated probability of that substitution error on a log scale
  - Points: Empirical error estimates derived from the observed data used to learn errors
  - Lines: The fitted error model that DADA2 will use during denoising
- How to interpret a “good” plot:
  - Monotonic trend: Error rates should generally decrease as Q increases (higher quality, fewer errors)
  - Close agreement: Points and lines should be reasonably aligned, indicating the model fits the empirical signal
  - Reasonable magnitude: At high Q scores, error rates should be very low; at lower Q scores, they should be higher

# How much data is sampled to fit error rates
# - More bases: usually more stable/accurate error estimates, but longer runtime and higher memory use
# - Fewer bases: faster, but potentially noisier error estimates
# Rule of thumb: nbases = 1e8 (100,000,000 bp) is a “robust/default-like” choice for many datasets
learn_nbases <- 1e8

errF <- learnErrors(filt_out_r1, nbases = learn_nbases)
errR <- learnErrors(filt_out_r2, nbases = learn_nbases)

p_errF <- plotErrors(errF, nominalQ = TRUE) + labs(subtitle = "Forward (Read 1)")
p_errR <- plotErrors(errR, nominalQ = TRUE) + labs(subtitle = "Reverse (Read 2)")
pdf(file.path(base_dir, "03_error_rates.pdf"), width = 10, height = 5)
print(p_errF + p_errR)
dev.off()

## 29029302 total bases in 115711 reads from 12 samples will be used for learning the error rates.

## 28969435 total bases in 115711 reads from 12 samples will be used for learning the error rates.

3.6 Dereplication and denoising

Dereplication: Compresses each sample into a set of unique sequences and counts, which reduces redundancy and preserves abundance information needed for inference.
- Efficiency: Operating on unique sequences (rather than all reads) substantially reduces runtime and memory requirements.
- Accuracy: Abundance-aware, error-model–based inference improves resolution beyond OTU clustering by separating closely related true variants from error-derived sequences.
Denoising (the primary DADA2 step): Using the learned error model, DADA2 determines whether sequence differences are likely sequencing errors or true biological variants, then outputs a per-sample ASV count table (samples × ASVs).

derepF <- derepFastq(filt_out_r1)
names(derepF) <- sub(paste0(r1_suffix, "$"), "", basename(filt_out_r1))
dadaF <- dada(derepF, err = errF)

## Sample 1 - 5829 reads in 2379 unique sequences.
## Sample 2 - 19489 reads in 5539 unique sequences.
## Sample 3 - 6726 reads in 2093 unique sequences.
## Sample 4 - 6033 reads in 2185 unique sequences.
## Sample 5 - 2403 reads in 859 unique sequences.
## Sample 6 - 9725 reads in 3183 unique sequences.
## Sample 7 - 6599 reads in 1951 unique sequences.
## Sample 8 - 1723 reads in 557 unique sequences.
## Sample 9 - 32645 reads in 7640 unique sequences.
## Sample 10 - 4911 reads in 1688 unique sequences.
## Sample 11 - 7159 reads in 2371 unique sequences.
## Sample 12 - 12469 reads in 3772 unique sequences.

derepR <- derepFastq(filt_out_r2)
names(derepR) <- sub(paste0(r2_suffix, "$"), "", basename(filt_out_r2))
dadaR <- dada(derepR, err = errR)

## Sample 1 - 5829 reads in 3868 unique sequences.
## Sample 2 - 19489 reads in 12787 unique sequences.
## Sample 3 - 6726 reads in 4727 unique sequences.
## Sample 4 - 6033 reads in 4671 unique sequences.
## Sample 5 - 2403 reads in 1997 unique sequences.
## Sample 6 - 9725 reads in 7697 unique sequences.
## Sample 7 - 6599 reads in 4465 unique sequences.
## Sample 8 - 1723 reads in 1366 unique sequences.
## Sample 9 - 32645 reads in 17903 unique sequences.
## Sample 10 - 4911 reads in 3602 unique sequences.
## Sample 11 - 7159 reads in 4759 unique sequences.
## Sample 12 - 12469 reads in 9264 unique sequences.

3.7 Merge paired reads

Combines denoised forward and reverse reads within each sample by aligning their overlapping regions.
- Reconstructs a single high-confidence amplicon sequence per variant when overlap is sufficient and concordant.
- Improves accuracy by requiring agreement across the overlap; read pairs with poor overlap or conflicting bases are not merged.

merge_args <- list(dadaF = dadaF, derepF = derepF, dadaR = dadaR, derepR = derepR, verbose = TRUE)
mergers <- do.call(mergePairs, merge_args)
seqtab <- makeSequenceTable(mergers)

asv_seqs <- colnames(seqtab)
asv_ids <- sprintf("ASV%03d", seq_along(asv_seqs))

asv_map_dt <- data.table(
        asv_id       = asv_ids,
        asv_sequence = asv_seqs,
        asv_length   = nchar(asv_seqs),
        total_count  = as.integer(colSums(seqtab))
)

to_preview <- seqtab
colnames(to_preview) <- asv_ids
head(to_preview[, c(1:5)])

##        ASV001 ASV002 ASV003 ASV004 ASV005
## F3D1      416    364    146     79    203
## F3D2     3625   1677    361    499   1324
## F3D3     1024    627    409    214    407
## F8D148    421    310    326    477    400
## F8D149    194    195    137    232    151
## F8D150    864    562    298    726    316

dim(seqtab)

## [1]  12 202

colnames(seqtab)[1] # ASV001

## [1] "TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGAAGATCAAGTCAGCGGTAAAATTGAGAGGCTCAACCTCTTCGAGCCGTTGAAACTGGTTTTCTTGAGTGAGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCTCAACTGACGCTCATGCACGAAAGTGTGGGTATCGAACAGG"

table(nchar(getSequences(seqtab)))

## 
## 251 252 253 254 255 
##   1  30 165   5   1

3.8 Remove chimeras

DADA2 identifies a sequence as chimeric when it can be reconstructed as a two-parent (bimera) combination of more abundant ASVs in the dataset, with breakpoints consistent with recombination.
Chimera checking is done de novo from the inferred ASVs (no external reference required), typically using sample-consensus logic to flag sequences repeatedly identified as bimeric across samples.
Removes likely PCR chimeras (artificial sequences formed by recombination of two true templates during amplification) from the ASV table using removeBimeraDenovo().

seqtab_nochim <- removeBimeraDenovo(seqtab)
saveRDS(seqtab_nochim, file.path(base_dir, "seqtab_nochim.rds"))
dim(seqtab_nochim)

## [1]  12 196

3.9 Read retention overview

3.9.1 Statistics tracking and export

Record the number of reads removed at each step and export stats with sample metadata.

getN <- function(x) sum(getUniques(x))

filt_dt$sample_id <- sapply(str_split(filt_dt$sample_id, "_"), "[[", 1)
track_dt <- data.table(
        filt_dt,
        denoised_F = vapply(dadaF, getN, integer(1)),
        denoised_R = vapply(dadaR, getN, integer(1)),
        merged     = rowSums(seqtab, na.rm = TRUE),
        nonchim    = rowSums(seqtab_nochim, na.rm = TRUE)
)
track_dt[, pct_remain := 100 * (nonchim / reads.in)]
track_dt[, sex := fifelse(
        grepl("^F", sample_id), "Female",
        fifelse(grepl("^M", sample_id), "Male", NA_character_)
)]
track_dt[, day := as.integer(sub(".*D", "", sample_id))]
track_dt[, stage := fifelse(
        day <= 100L, "Early",
        fifelse(day >= 100L, "Late", NA_character_)
)]
track_dt[, group := factor(
        paste(stage, sex, sep = "+"),
        levels = c("Early+Male", "Early+Female", "Late+Male", "Late+Female")
)]
head(track_dt)

##    sample_id reads.in reads.out denoised_F denoised_R merged nonchim pct_remain
##       <char>    <num>     <num>      <int>      <int>  <num>   <num>      <num>
## 1:      F3D1     5869      5829       5647       5428   5033    5033   85.75567
## 2:      F3D2    19620     19489      19210      18194  17191   16952   86.40163
## 3:      F3D3     6758      6726       6519       5961   4781    4578   67.74194
## 4:    F8D148     6067      6033       5850       5170   4149    4149   68.38635
## 5:    F8D149     2415      2403       2302       1892   1480    1480   61.28364
## 6:    F8D150     9786      9725       9502       8361   6832    6832   69.81402
##       sex   day  stage        group
##    <char> <int> <char>       <fctr>
## 1: Female     1  Early Early+Female
## 2: Female     2  Early Early+Female
## 3: Female     3  Early Early+Female
## 4: Female   148   Late  Late+Female
## 5: Female   149   Late  Late+Female
## 6: Female   150   Late  Late+Female

3.9.2 Mean read retention summary

Mean read retention from raw input to the final non-chimeric ASV table, reported as percent retained versus excluded across all samples

mean_remaining <- mean(track_dt$pct_remain, na.rm = TRUE)
mean_excluded <- 100 - mean_remaining

bar1_dt <- data.table(
        label  = "Raw to final",
        status = factor(c("Remaining", "Excluded"), levels = c("Remaining", "Excluded")),
        pct    = c(mean_remaining, mean_excluded)
)

barplot <- ggplot(bar1_dt, aes(x = label, y = pct, fill = status)) +
        geom_col(width = 0.6) +
        coord_flip() +
        geom_text(
                aes(label = sprintf("%.1f%%", pct)),
                position = position_stack(vjust = 0.5),
                size = 4
        ) +
        scale_fill_manual(values = c("Remaining" = "steelblue", "Excluded" = "grey60")) +
        scale_y_continuous(limits = c(0, 100), expand = expansion(mult = c(0, 0.02))) +
        labs(
                title = NULL,
                x = NULL,
                y = "Percentage of reads",
                fill = NULL
        ) +
        theme_classic(base_size = 12) +
        theme(
                legend.position = "bottom",
                text = element_text(size = 12)
        )
pdf(file.path(base_dir, "04_mean_read_retention_summary.pdf"), width = 8, height = 3)
print(barplot)
dev.off()

3.9.3 Read retention by time and group

Mean remaining reads at each major checkpoint, summarized across the four groups (Early/Late × Female/Male) to highlight where reads are primarily lost and whether retention differs by group.

steps_to_plot <- c("reads.in", "reads.out", "merged", "nonchim")
step_labels <- c(
        reads.in = "Raw reads",
        reads.out = "After filterAndTrim",
        merged = "After merging pairs",
        nonchim = "After chimera removal"
)

plot_long <- melt(
        track_dt[, c("sample_id", "group", steps_to_plot), with = FALSE],
        id.vars = c("sample_id", "group"),
        variable.name = "step",
        value.name = "remaining_reads"
)
plot_long[, step := factor(step, levels = steps_to_plot, labels = unname(step_labels[steps_to_plot]))]

sum_dt <- plot_long[, .(remaining_reads = mean(remaining_reads, na.rm = TRUE)), by = .(step, group)]
y_max <- max(sum_dt$remaining_reads, na.rm = TRUE)

step_cols <- brewer.pal(n = 4, name = "Dark2")
names(step_cols) <- levels(sum_dt$step)

barplot2 <- ggplot(sum_dt, aes(x = step, y = remaining_reads, fill = step)) +
        geom_col(width = 0.75) +
        facet_wrap(~group, ncol = 2, scales = "fixed") +
        scale_y_continuous(
                limits = c(0, y_max),
                expand = expansion(mult = c(0, 0.05))
        ) +
        scale_fill_manual(values = step_cols) +
        labs(
                title = NULL,
                x = NULL,
                y = "Remaining reads",
                fill = NULL
        ) +
        theme_classic(base_size = 12) +
        theme(
                legend.position = "bottom",
                text = element_text(size = 12),
                axis.text.x = element_blank(),
                axis.ticks.x = element_blank()
        )
pdf(file.path(base_dir, "05_number_of_reads_per_group.pdf"), width = 8, height = 9)
print(barplot2)
dev.off()

3.10 ASV composition snapshot

3.10.1 ASV prevalence and abundance

Core: All remaining ASVs outside the low-abundance/low-prevalence region. These are comparatively well-supported variants detected in more samples and/or at higher total counts, and therefore most strongly influence downstream community summaries.
Intermediate: ASVs in the lowest 10% of both abundance and prevalence, excluding the “Rare” set. These are low-to-moderate support variants that occur in a limited subset of samples and may reflect condition-specific taxa, low-level carryover, or biologically uncommon members.
Rare: ASVs in the lowest 5% of both total abundance (summed counts across all samples) and prevalence (number of samples detected). These represent low-support, sparsely observed variants that contribute minimally to the overall read pool.

prev_abund_dt <- data.table(
        asv_sequence = colnames(seqtab_nochim),
        abundance_total_count = as.integer(colSums(seqtab_nochim)),
        prevalence_n_samples = as.integer(colSums(seqtab_nochim >= 1))
)

asv_map_sub <- unique(asv_map_dt[, .(asv_id, asv_sequence, asv_length)])
setkey(asv_map_sub, asv_sequence)
setkey(prev_abund_dt, asv_sequence)
prev_abund_dt <- asv_map_sub[prev_abund_dt]

prevalence_cutoff_rare <- as.numeric(quantile(
        prev_abund_dt$prevalence_n_samples,
        probs = 0.05, na.rm = TRUE, type = 7
))
prevalence_cutoff_intermediate <- as.numeric(quantile(
        prev_abund_dt$prevalence_n_samples,
        probs = 0.1, na.rm = TRUE, type = 7
))
abundance_cutoff_rare <- as.numeric(quantile(
        prev_abund_dt$abundance_total_count,
        probs = 0.05, na.rm = TRUE, type = 7
))
abundance_cutoff_intermediate <- as.numeric(quantile(
        prev_abund_dt$abundance_total_count,
        probs = 0.1, na.rm = TRUE, type = 7
))

prevalence_cutoff_rare_i <- floor(prevalence_cutoff_rare)
prevalence_cutoff_intermediate_i <- floor(prevalence_cutoff_intermediate)
abundance_cutoff_rare_i <- floor(abundance_cutoff_rare)
abundance_cutoff_intermediate_i <- floor(abundance_cutoff_intermediate)

prev_abund_dt[, class := "Core"]
prev_abund_dt[
        prevalence_n_samples <= prevalence_cutoff_intermediate_i &
                abundance_total_count <= abundance_cutoff_intermediate_i,
        class := "Intermediate"
]
prev_abund_dt[
        prevalence_n_samples <= prevalence_cutoff_rare_i &
                abundance_total_count <= abundance_cutoff_rare_i,
        class := "Rare"
]
prev_abund_dt[, class := factor(class, levels = c("Rare", "Intermediate", "Core"))]

summary_sentence <- sprintf(
        "Total ASVs: %d in %d samples. Rare: %d; Intermediate: %d; Core: %d",
        ncol(seqtab_nochim), 
        nrow(seqtab_nochim), 
        sum(prev_abund_dt$class == "Rare"), 
        sum(prev_abund_dt$class == "Intermediate"), 
        sum(prev_abund_dt$class == "Core")
)
color_map <- c("Rare" = "grey60", "Intermediate" = "goldenrod2", "Core" = "dodgerblue3")

abundance_plot <- ggplot(
                prev_abund_dt,
                aes(
                        x = abundance_total_count,
                        y = prevalence_n_samples,
                        color = class
                )
        ) +
        geom_point() +
        scale_color_manual(values = color_map) +
        scale_x_log10(name = "Abundance (total count; log10)") +
        scale_y_continuous(
                name = "Prevalence (number of samples)",
                breaks = seq(2, 12, by = 2),
                limits = c(2, 12)
        ) +
        labs(title = summary_sentence, color = NULL) +
        theme_minimal() +
        theme(
                legend.position = "bottom",
                legend.direction = "horizontal",
                legend.justification = "center",
                plot.title = element_text(hjust = 0.5)
        ) +
        guides(color = guide_legend(nrow = 1))

pdf(file.path(base_dir, "06_asv_prevalence_abundance.pdf"), width = 6, height = 6)
print(abundance_plot)
dev.off()

3.11 Export

Summary statistics: Exported key run-level and sample-level processing summaries (e.g., reads retained across steps) for QC documentation and reporting.
ASV representative sequences (FASTA): Exported the unique inferred ASV nucleotide sequences from seqtab into asv_sequences.fasta.

asv_seqs <- colnames(seqtab)
asv_ids <- sprintf("ASV%03d", seq_along(asv_seqs))

xset <- DNAStringSet(asv_seqs)
names(xset) <- asv_ids

writeXStringSet(
        xset,
        filepath = file.path(base_dir, "asv_sequences.fasta"),
        format = "fasta",
        width = 80
)

xset

## DNAStringSet object of length 202:
##       width seq                                             names               
##   [1]   252 TACGGAGGATGCGAGCGTTATC...GAAAGTGTGGGTATCGAACAGG ASV001
##   [2]   252 TACGGAGGATGCGAGCGTTATC...GAAAGCGTGGGTATCGAACAGG ASV002
##   [3]   253 TACGGAGGATCCGAGCGTTATC...GAAAGTGTGGGTATCAAACAGG ASV003
##   [4]   252 TACGGAGGATGCGAGCGTTATC...GAAAGTGCGGGGATCGAACAGG ASV004
##   [5]   253 TACGGAGGATTCAAGCGTTATC...GAAAGCGTGGGGAGCAAACAGG ASV005
##   ...   ... ...
## [198]   252 TACGTAGGTGGCGAGCGTTATC...GAAAGCGTGGGGAGCAAATAGG ASV198
## [199]   252 TACGTAGGTGGCGAGCGTTATC...GAAAGCGTGGGGAGCAAATAGG ASV199
## [200]   253 TACGTAGGGAGCAAGCGTTATC...GAAAGTGTGGGGAGCAAACAGG ASV200
## [201]   253 TACGTAGGGGGCGAGCGTTGTC...GAAAGCGTGGGGAGCAAACAGG ASV201
## [202]   252 TACGTAGGGGGCGAGCGTTATC...GAAAGCGTGGGGAGCAAATAGG ASV202

4 Research reproducibility

4.1 Parameters used in this analysis

Parameter	Description	This analysis	Default
filt_truncLen	Truncate reads to a fixed length	0	0
filt_trimLeft	Remove a fixed number of bases from the 5′ end	0	0
filt_maxEE	Discard reads with expected errors exceeding this threshold	Inf	Inf
filt_maxN	Discard reads with more than this number of ambiguous bases (N)	0	0
learn_nbases	Number of bases sampled for error rate estimation	1E+08	1E+08

4.2 Session info

## ─ Session info ─────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.5.2 (2025-10-31)
##  os       Ubuntu 24.04.3 LTS
##  system   aarch64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_CA.UTF-8
##  ctype    en_CA.UTF-8
##  tz       America/Edmonton
## 
## ─ Packages ─────────────────────────────────────────────────────────────────────────────────────
##  package      * version  date (UTC) lib source
## BiocGenerics * 0.54.1   2025-10-12 [1] Bioconductor 3.21 (R 4.5.2)
## Biostrings   * 2.76.0   2025-04-15 [1] Bioconductor 3.21 (R 4.5.2)
## dada2        * 1.36.0   2025-04-15 [1] Bioconductor 3.21 (R 4.5.2)
## data.table   * 1.18.2.1 2026-01-27 [1] RSPM (R 4.5.0)
## generics     * 0.1.4    2025-05-09 [1] RSPM (R 4.5.0)
## GenomeInfoDb * 1.44.3   2025-09-21 [1] Bioconductor 3.21 (R 4.5.2)
## ggplot2      * 4.0.2    2026-02-03 [1] RSPM (R 4.5.0)
## IRanges      * 2.42.0   2025-04-15 [1] Bioconductor 3.21 (R 4.5.2)
## patchwork    * 1.3.2    2025-08-25 [1] RSPM (R 4.5.0)
## RColorBrewer * 1.1-3    2022-04-03 [1] RSPM (R 4.5.0)
## Rcpp         * 1.1.0    2025-07-02 [1] RSPM (R 4.5.2)
## S4Vectors    * 0.46.0   2025-04-15 [1] Bioconductor 3.21 (R 4.5.2)
## stringr      * 1.6.0    2025-11-04 [1] RSPM (R 4.5.0)
## XVector      * 0.48.0   2025-04-15 [1] Bioconductor 3.21 (R 4.5.2)
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library
## 
## ────────────────────────────────────────────────────────────────────────────────────────────────

HUB (Hands-on Understanding of Bioinformatics) Workshop Microbiome Data Analysis by Bioinformatics Hub

[Part 1] Quality control, denoising, and amplicon sequence variant (ASV) inference.

1 Introduction

1.1 Background

1.2 Aims

1.3 Deliverables

1.4 Materials

1.5 References

2 Environment

3 Preprocessing

3.1 Preparation in the R environment

3.2 Load R packages

3.3 Import FASTQ files

3.4 Read quality

3.4.1 Raw reads

3.4.2 After filterAndTrim

3.5 Learn the error rates

3.6 Dereplication and denoising

3.7 Merge paired reads

3.8 Remove chimeras

3.9 Read retention overview

3.9.1 Statistics tracking and export

3.9.2 Mean read retention summary

3.9.3 Read retention by time and group

3.10 ASV composition snapshot

3.10.1 ASV prevalence and abundance

3.11 Export

4 Research reproducibility

4.1 Parameters used in this analysis

4.2 Session info

HUB (Hands-on Understanding of Bioinformatics) Workshop
Microbiome Data Analysis
by Bioinformatics Hub