[Part 2] Taxonomy assignment, analysis, and visualization.

1 Taxonomy assignment

  • This is Part 2 of the microbiome data analysis workshop, focusing on taxonomic classification and downstream analysis.
  • We will use the seqtab_nochim.rds file generated in Part 1; alternatively, you may download here: seqtab_nochim.rds.

1.1 Load R packages

library(phyloseq)
library(dplyr)
library(tidyr)
library(tibble)
library(htmlwidgets)
library(htmltools)
library(sankeyD3)

1.2 Import input file

if (!exists("seqtab_nochim")) { # If seqtab_nochim is not found (e.g., Part 1 session was lost)
        library(dada2)
        library(Biostrings)
        library(ggplot2)
        library(data.table)

        base_dir <- "/home/hub1"
        seqtab_nochim <- readRDS(file.path(base_dir, "seqtab_nochim.rds"))
}
dim(seqtab_nochim)
## [1]  12 196

1.3 DADA2’s taxonomy assignment

  • The DADA2 function assignTaxonomy() provides a built-in implementation of the naïve Bayesian classifier for taxonomic assignment.
  • DADA2-formatted training fasta file was downloaded from Taxonomic reference data
    • SILVA: Broad coverage across Bacteria, Archaea, and Eukaryota (SSU rRNA (16S/18S) and LSU rRNA)
    • RDP: rRNA sequences plus the RDP Classifier ecosystem (16S (Bacteria/Archaea) and fungal 28S)
    • GreenGenes2: Unifies 16S and genomic data in one reference tree (16S integrated with genome phylogeny)
# Set the file paths to your taxonomy reference files
tax_train_fasta <- "/home/dbs/silva_nr99_v138.2_toGenus_trainset.fa.gz"
tax_species_fasta <- "/home/dbs/silva_v138.2_assignSpecies.fa.gz"
rdp_train_fasta <- "/home/dbs/rdp_19_toGenus_trainset.fa.gz"
gg2_train_fasta <- "/home/dbs/gg2_2024_09_toGenus_trainset.fa.gz"
tax_mat <- assignTaxonomy(
        seqtab_nochim,
        refFasta = tax_train_fasta
)
tax_mat <- addSpecies(tax_mat, refFasta = tax_species_fasta, allowMultiple = TRUE)
saveRDS(tax_mat, file.path(base_dir, "DADA2_taxonomy_matrix_SILVA.rds"))

to_preview <- tax_mat # Removing sequence rownames for display only
rownames(to_preview) <- NULL
head(to_preview)
##      Kingdom    Phylum         Class         Order           Family          
## [1,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
## [2,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
## [3,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Bacteroidaceae"
## [4,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
## [5,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Rikenellaceae" 
## [6,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
##      Genus         Species                  
## [1,] NA            NA                       
## [2,] NA            NA                       
## [3,] "Bacteroides" "acidifaciens/caecimuris"
## [4,] NA            NA                       
## [5,] "Alistipes"   NA                       
## [6,] NA            NA
rdp_mat <- assignTaxonomy(
        seqtab_nochim,
        refFasta = rdp_train_fasta
)
saveRDS(rdp_mat, file.path(base_dir, "DADA2_taxonomy_matrix_RDP.rds"))

gg2_mat <- assignTaxonomy(
        seqtab_nochim,
        refFasta = gg2_train_fasta
)
saveRDS(gg2_mat, base::file.path(base_dir, "DADA2_taxonomy_matrix_GG2.rds"))

2 Analysis

  • The phyloseq R package provides a robust framework for downstream microbiome analysis and visualization, and the scripts used here were adapted from the phyloseq tutorial.
  • Results and visualizations are reported based on SILVA for consistency; however, taxonomy assignments from the other databases may be loaded to reproduce the same workflow using the scripts provided below.

2.1 Format conversion

  • First, the DADA2 output object (i.e., seqtab_nochim) needs to be converted into a phyloseq object.
sample_ids <- rownames(seqtab_nochim)
subject_raw <- sapply(strsplit(sample_ids, "D"), `[`, 1)
sex <- substr(subject_raw, 1, 1)
day <- as.integer(sapply(strsplit(sample_ids, "D"), `[`, 2))

sample_df <- data.frame(
        Subject = sample_ids,
        Sex = factor(sex, levels = c("M", "F")),
        Day = day
)
sample_df$Group <- factor(ifelse(sample_df$Day > 100, "Late", "Early"),
        levels = c("Early", "Late"),
        labels = c("E", "L")
)
sample_df$Category <- factor(paste0(sample_df$Group, "_", sample_df$Sex),
        levels = c("E_M", "E_F", "L_M", "L_F")
)
rownames(sample_df) <- sample_ids

phylo_obj <- phyloseq(
        otu_table(seqtab_nochim, taxa_are_rows = FALSE),
        sample_data(sample_df),
        tax_table(tax_mat)
)

dna <- DNAStringSet(taxa_names(phylo_obj))
names(dna) <- taxa_names(phylo_obj)
phylo_obj <- merge_phyloseq(phylo_obj, dna)
taxa_names(phylo_obj) <- sprintf("ASV%03d", seq(ntaxa(phylo_obj)))
saveRDS(phylo_obj, file.path(base_dir, "DADA2_phyloseq_obj.rds"))
phylo_obj
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 196 taxa and 12 samples ]
## sample_data() Sample Data:       [ 12 samples by 5 sample variables ]
## tax_table()   Taxonomy Table:    [ 196 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 196 reference sequences ]

2.2 Alpha diversity over time

  • Visualize within-sample diversity using Shannon and Simpson indices across sampling Day, with points colored by Group (Early/Late) and shaped by Sex (Male/Female).

2.2.1 Group highlighted by color

group_cols <- c("E" = "#009988", "L" = "#EE7733") # teal and orange
rich_plot <- plot_richness(phylo_obj,
        x = "Day", measures = c("Shannon", "Simpson"),
        color = "Group", shape = "Sex"
) +
        theme_bw() +
        scale_color_manual(values = group_cols) +
        geom_point(size = 3)
pdf(file.path(base_dir, "07_alpha_diversity_by_group.pdf"), width = 6, height = 6)
print(rich_plot)
dev.off()

2.2.2 Sex highlighted by color

sex_cols <- c("M" = "#0077BB", "F" = "#BB5566") # blue and red
rich_plot <- plot_richness(phylo_obj,
        x = "Day", measures = c("Shannon", "Simpson"),
        shape = "Group", color = "Sex"
) +
        theme_bw() +
        scale_color_manual(values = sex_cols) +
        geom_point(size = 3)
pdf(file.path(base_dir, "08_alpha_diversity_by_sex.pdf"), width = 6, height = 6)
print(rich_plot)
dev.off()

2.3 Beta diversity

  • Assess between-sample community differences using Bray–Curtis distances on relative-abundance data and visualize them with NMDS, with points colored by Group (Early/Late) and shaped by Sex (Male/Female)
ps_nz <- prune_samples(sample_sums(phylo_obj) > 0, phylo_obj)
otu <- as(otu_table(ps_nz), "matrix")
if (taxa_are_rows(ps_nz)) otu <- t(otu)
otu[!is.finite(otu)] <- 0
otu_table(ps_nz) <- otu_table(otu, taxa_are_rows = FALSE)

ps_prop <- transform_sample_counts(ps_nz, function(x) x / sum(x))
nmds_bray <- ordinate(ps_prop, method = "NMDS", distance = "bray", trace = 0)

ordi_plot <- plot_ordination(ps_prop, nmds_bray, color = "Group", shape = "Sex") +
        theme_bw() +
        scale_color_manual(values = group_cols)
pdf(file.path(base_dir, "09_beta_diversity.pdf"), width = 6, height = 6)
print(ordi_plot)
dev.off()

2.4 Taxonomic relative abundance

  • Plot the top 20 most abundant ASVs as relative abundance bar charts, aggregated and colored by Taxa, and faceted by Group (Early/Late) and Sex (Male/Female) across sampling days.
top <- names(sort(taxa_sums(phylo_obj), decreasing = TRUE))[1:20]

ps_top <- transform_sample_counts(phylo_obj, function(OTU) OTU / sum(OTU))
ps_top <- prune_taxa(top, ps_top)

df_plot <- psmelt(ps_top)
df_plot$Family <- as.character(df_plot$Family)
df_plot$Family[is.na(df_plot$Family) | df_plot$Family == ""] <- "Unassigned"

dt <- as.data.table(df_plot)
dt_sum <- dt[, .(Abundance = mean(Abundance, na.rm = TRUE)), by = .(Category, Family)]
dt_sum[, Abundance := Abundance / sum(Abundance), by = Category]

top20_taxa <- ggplot(dt_sum, aes(x = Category, y = Abundance, fill = Family)) +
        geom_col(color = NA) +
        scale_y_continuous(limits = c(0, 1), labels = scales::percent) +
        scale_fill_brewer(palette = "Set2", drop = FALSE) +
        theme_bw() +
        theme(
                legend.position = "bottom",
                axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1)
        ) +
        coord_flip()

pdf(file.path(base_dir, "10_top20_taxa_by_group.pdf"), width = 8, height = 5)
print(top20_taxa)
dev.off()

2.5 Taxonomic lineage profiles by category

  • Taxonomic composition is summarized as lineage-level flows, i.e., Sankey chart, from higher to lower ranks for each category, enabling rapid comparison of dominant clades and where group-specific differences emerge along the taxonomic hierarchy.
source("/home/rstudio/R/generate_sankey_plot.R")
meta_df <- data.frame(phyloseq::sample_data(phylo_obj), stringsAsFactors = FALSE)
meta_df$Sample <- rownames(meta_df)
tt_cols <- colnames(phyloseq::tax_table(phylo_obj))

# E_M (Early + Male), E_F (Early + Female), L_M (Late + Male), L_F (Late + Female)
cat_value <- "E_M"
        
dat <- build_sankey_data(ps = phylo_obj, meta_df = meta_df, category_value = cat_value)
title_text <- sprintf("%s(#Samples=%d, #Taxa=%d, #Reads=%d)", cat_value, dat$nsamples, dat$ntaxa, dat$total_reads)
w <- render_one_sankey(dat, title_text)

htmlwidgets::saveWidget(w, file = file.path(base_dir, paste0("sankey_", cat_value, ".html")))

3 Research reproducibility

3.1 Session info

## ─ Session info ─────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.5.2 (2025-10-31)
##  os       Ubuntu 24.04.3 LTS
##  system   aarch64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_CA.UTF-8
##  ctype    en_CA.UTF-8
##  tz       America/Edmonton
## 
## ─ Packages ─────────────────────────────────────────────────────────────────────────────────────
## package      * version  date (UTC) lib source
## BiocGenerics * 0.54.1   2025-10-12 [1] Bioconductor 3.21 (R 4.5.2)
## Biostrings   * 2.76.0   2025-04-15 [1] Bioconductor 3.21 (R 4.5.2)
## dada2        * 1.36.0   2025-04-15 [1] Bioconductor 3.21 (R 4.5.2)
## data.table   * 1.18.2.1 2026-01-27 [1] RSPM (R 4.5.0)
## dplyr        * 1.2.0    2026-02-03 [1] RSPM (R 4.5.0)
## generics     * 0.1.4    2025-05-09 [1] RSPM (R 4.5.0)
## GenomeInfoDb * 1.44.3   2025-09-21 [1] Bioconductor 3.21 (R 4.5.2)
## ggplot2      * 4.0.2    2026-02-03 [1] RSPM (R 4.5.0)
## htmltools    * 0.5.9    2025-12-04 [1] RSPM (R 4.5.0)
## htmlwidgets  * 1.6.4    2023-12-06 [1] RSPM (R 4.5.0)
## IRanges      * 2.42.0   2025-04-15 [1] Bioconductor 3.21 (R 4.5.2)
## phyloseq     * 1.52.0   2025-04-15 [1] Bioconductor 3.21 (R 4.5.2)
## Rcpp         * 1.1.1    2026-01-10 [1] RSPM (R 4.5.0)
## S4Vectors    * 0.46.0   2025-04-15 [1] Bioconductor 3.21 (R 4.5.2)
## sankeyD3     * 0.3.2    2026-02-15 [1] Github (fbreitwieser/sankeyD3@fd50a74)
## tibble       * 3.3.1    2026-01-11 [1] RSPM (R 4.5.0)
## tidyr        * 1.3.2    2025-12-19 [1] RSPM (R 4.5.0)
## XVector      * 0.48.0   2025-04-15 [1] Bioconductor 3.21 (R 4.5.2)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library
## 
## ────────────────────────────────────────────────────────────────────────────────────────────────