[Part 2] Taxonomy assignment, analysis, and visualization.

1 Taxonomy assignment

This is Part 2 of the microbiome data analysis workshop, focusing on taxonomic classification and downstream analysis.
We will use the seqtab_nochim.rds file generated in Part 1; alternatively, you may download here: seqtab_nochim.rds.

1.1 Load R packages

library(phyloseq)
library(dplyr)
library(tidyr)
library(tibble)
library(htmlwidgets)
library(htmltools)
library(sankeyD3)

1.2 Import input file

if (!exists("seqtab_nochim")) { # If seqtab_nochim is not found (e.g., Part 1 session was lost)
        library(dada2)
        library(Biostrings)
        library(ggplot2)
        library(data.table)

        base_dir <- "/home/hub1"
        seqtab_nochim <- readRDS(file.path(base_dir, "seqtab_nochim.rds"))
}
dim(seqtab_nochim)

## [1]  12 196

1.3 DADA2’s taxonomy assignment

The DADA2 function assignTaxonomy() provides a built-in implementation of the naïve Bayesian classifier for taxonomic assignment.
DADA2-formatted training fasta file was downloaded from Taxonomic reference data
- SILVA: Broad coverage across Bacteria, Archaea, and Eukaryota (SSU rRNA (16S/18S) and LSU rRNA)
- RDP: rRNA sequences plus the RDP Classifier ecosystem (16S (Bacteria/Archaea) and fungal 28S)
- GreenGenes2: Unifies 16S and genomic data in one reference tree (16S integrated with genome phylogeny)

# Set the file paths to your taxonomy reference files
tax_train_fasta <- "/home/dbs/silva_nr99_v138.2_toGenus_trainset.fa.gz"
tax_species_fasta <- "/home/dbs/silva_v138.2_assignSpecies.fa.gz"
rdp_train_fasta <- "/home/dbs/rdp_19_toGenus_trainset.fa.gz"
gg2_train_fasta <- "/home/dbs/gg2_2024_09_toGenus_trainset.fa.gz"

tax_mat <- assignTaxonomy(
        seqtab_nochim,
        refFasta = tax_train_fasta
)
tax_mat <- addSpecies(tax_mat, refFasta = tax_species_fasta, allowMultiple = TRUE)
saveRDS(tax_mat, file.path(base_dir, "DADA2_taxonomy_matrix_SILVA.rds"))

to_preview <- tax_mat # Removing sequence rownames for display only
rownames(to_preview) <- NULL
head(to_preview)

##      Kingdom    Phylum         Class         Order           Family          
## [1,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
## [2,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
## [3,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Bacteroidaceae"
## [4,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
## [5,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Rikenellaceae" 
## [6,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
##      Genus         Species                  
## [1,] NA            NA                       
## [2,] NA            NA                       
## [3,] "Bacteroides" "acidifaciens/caecimuris"
## [4,] NA            NA                       
## [5,] "Alistipes"   NA                       
## [6,] NA            NA

rdp_mat <- assignTaxonomy(
        seqtab_nochim,
        refFasta = rdp_train_fasta
)
saveRDS(rdp_mat, file.path(base_dir, "DADA2_taxonomy_matrix_RDP.rds"))

gg2_mat <- assignTaxonomy(
        seqtab_nochim,
        refFasta = gg2_train_fasta
)
saveRDS(gg2_mat, base::file.path(base_dir, "DADA2_taxonomy_matrix_GG2.rds"))

2 Analysis

The phyloseq R package provides a robust framework for downstream microbiome analysis and visualization, and the scripts used here were adapted from the phyloseq tutorial.
Results and visualizations are reported based on SILVA for consistency; however, taxonomy assignments from the other databases may be loaded to reproduce the same workflow using the scripts provided below.

2.1 Format conversion

First, the DADA2 output object (i.e., seqtab_nochim) needs to be converted into a phyloseq object.

sample_ids <- rownames(seqtab_nochim)
subject_raw <- sapply(strsplit(sample_ids, "D"), `[`, 1)
sex <- substr(subject_raw, 1, 1)
day <- as.integer(sapply(strsplit(sample_ids, "D"), `[`, 2))

sample_df <- data.frame(
        Subject = sample_ids,
        Sex = factor(sex, levels = c("M", "F")),
        Day = day
)
sample_df$Group <- factor(ifelse(sample_df$Day > 100, "Late", "Early"),
        levels = c("Early", "Late"),
        labels = c("E", "L")
)
sample_df$Category <- factor(paste0(sample_df$Group, "_", sample_df$Sex),
        levels = c("E_M", "E_F", "L_M", "L_F")
)
rownames(sample_df) <- sample_ids

phylo_obj <- phyloseq(
        otu_table(seqtab_nochim, taxa_are_rows = FALSE),
        sample_data(sample_df),
        tax_table(tax_mat)
)

dna <- DNAStringSet(taxa_names(phylo_obj))
names(dna) <- taxa_names(phylo_obj)
phylo_obj <- merge_phyloseq(phylo_obj, dna)
taxa_names(phylo_obj) <- sprintf("ASV%03d", seq(ntaxa(phylo_obj)))
saveRDS(phylo_obj, file.path(base_dir, "DADA2_phyloseq_obj.rds"))
phylo_obj

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 196 taxa and 12 samples ]
## sample_data() Sample Data:       [ 12 samples by 5 sample variables ]
## tax_table()   Taxonomy Table:    [ 196 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 196 reference sequences ]

2.2 Alpha diversity over time

Visualize within-sample diversity using Shannon and Simpson indices across sampling Day, with points colored by Group (Early/Late) and shaped by Sex (Male/Female).

2.2.1 Group highlighted by color

group_cols <- c("E" = "#009988", "L" = "#EE7733") # teal and orange
rich_plot <- plot_richness(phylo_obj,
        x = "Day", measures = c("Shannon", "Simpson"),
        color = "Group", shape = "Sex"
) +
        theme_bw() +
        scale_color_manual(values = group_cols) +
        geom_point(size = 3)
pdf(file.path(base_dir, "07_alpha_diversity_by_group.pdf"), width = 6, height = 6)
print(rich_plot)
dev.off()

2.2.2 Sex highlighted by color

sex_cols <- c("M" = "#0077BB", "F" = "#BB5566") # blue and red
rich_plot <- plot_richness(phylo_obj,
        x = "Day", measures = c("Shannon", "Simpson"),
        shape = "Group", color = "Sex"
) +
        theme_bw() +
        scale_color_manual(values = sex_cols) +
        geom_point(size = 3)
pdf(file.path(base_dir, "08_alpha_diversity_by_sex.pdf"), width = 6, height = 6)
print(rich_plot)
dev.off()

2.3 Beta diversity

Assess between-sample community differences using Bray–Curtis distances on relative-abundance data and visualize them with NMDS, with points colored by Group (Early/Late) and shaped by Sex (Male/Female)

ps_nz <- prune_samples(sample_sums(phylo_obj) > 0, phylo_obj)
otu <- as(otu_table(ps_nz), "matrix")
if (taxa_are_rows(ps_nz)) otu <- t(otu)
otu[!is.finite(otu)] <- 0
otu_table(ps_nz) <- otu_table(otu, taxa_are_rows = FALSE)

ps_prop <- transform_sample_counts(ps_nz, function(x) x / sum(x))
nmds_bray <- ordinate(ps_prop, method = "NMDS", distance = "bray", trace = 0)

ordi_plot <- plot_ordination(ps_prop, nmds_bray, color = "Group", shape = "Sex") +
        theme_bw() +
        scale_color_manual(values = group_cols)
pdf(file.path(base_dir, "09_beta_diversity.pdf"), width = 6, height = 6)
print(ordi_plot)
dev.off()

2.4 Taxonomic relative abundance

Plot the top 20 most abundant ASVs as relative abundance bar charts, aggregated and colored by Taxa, and faceted by Group (Early/Late) and Sex (Male/Female) across sampling days.

top <- names(sort(taxa_sums(phylo_obj), decreasing = TRUE))[1:20]

ps_top <- transform_sample_counts(phylo_obj, function(OTU) OTU / sum(OTU))
ps_top <- prune_taxa(top, ps_top)

df_plot <- psmelt(ps_top)
df_plot$Family <- as.character(df_plot$Family)
df_plot$Family[is.na(df_plot$Family) | df_plot$Family == ""] <- "Unassigned"

dt <- as.data.table(df_plot)
dt_sum <- dt[, .(Abundance = mean(Abundance, na.rm = TRUE)), by = .(Category, Family)]
dt_sum[, Abundance := Abundance / sum(Abundance), by = Category]

top20_taxa <- ggplot(dt_sum, aes(x = Category, y = Abundance, fill = Family)) +
        geom_col(color = NA) +
        scale_y_continuous(limits = c(0, 1), labels = scales::percent) +
        scale_fill_brewer(palette = "Set2", drop = FALSE) +
        theme_bw() +
        theme(
                legend.position = "bottom",
                axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1)
        ) +
        coord_flip()

pdf(file.path(base_dir, "10_top20_taxa_by_group.pdf"), width = 8, height = 5)
print(top20_taxa)
dev.off()

2.5 Taxonomic lineage profiles by category

Taxonomic composition is summarized as lineage-level flows, i.e., Sankey chart, from higher to lower ranks for each category, enabling rapid comparison of dominant clades and where group-specific differences emerge along the taxonomic hierarchy.

source("/home/rstudio/R/generate_sankey_plot.R")
meta_df <- data.frame(phyloseq::sample_data(phylo_obj), stringsAsFactors = FALSE)
meta_df$Sample <- rownames(meta_df)
tt_cols <- colnames(phyloseq::tax_table(phylo_obj))

# E_M (Early + Male), E_F (Early + Female), L_M (Late + Male), L_F (Late + Female)
cat_value <- "E_M"
        
dat <- build_sankey_data(ps = phylo_obj, meta_df = meta_df, category_value = cat_value)
title_text <- sprintf("%s(#Samples=%d, #Taxa=%d, #Reads=%d)", cat_value, dat$nsamples, dat$ntaxa, dat$total_reads)
w <- render_one_sankey(dat, title_text)

htmlwidgets::saveWidget(w, file = file.path(base_dir, paste0("sankey_", cat_value, ".html")))

3 Research reproducibility

3.1 Session info

## ─ Session info ─────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.5.2 (2025-10-31)
##  os       Ubuntu 24.04.3 LTS
##  system   aarch64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_CA.UTF-8
##  ctype    en_CA.UTF-8
##  tz       America/Edmonton
## 
## ─ Packages ─────────────────────────────────────────────────────────────────────────────────────
## package      * version  date (UTC) lib source
## BiocGenerics * 0.54.1   2025-10-12 [1] Bioconductor 3.21 (R 4.5.2)
## Biostrings   * 2.76.0   2025-04-15 [1] Bioconductor 3.21 (R 4.5.2)
## dada2        * 1.36.0   2025-04-15 [1] Bioconductor 3.21 (R 4.5.2)
## data.table   * 1.18.2.1 2026-01-27 [1] RSPM (R 4.5.0)
## dplyr        * 1.2.0    2026-02-03 [1] RSPM (R 4.5.0)
## generics     * 0.1.4    2025-05-09 [1] RSPM (R 4.5.0)
## GenomeInfoDb * 1.44.3   2025-09-21 [1] Bioconductor 3.21 (R 4.5.2)
## ggplot2      * 4.0.2    2026-02-03 [1] RSPM (R 4.5.0)
## htmltools    * 0.5.9    2025-12-04 [1] RSPM (R 4.5.0)
## htmlwidgets  * 1.6.4    2023-12-06 [1] RSPM (R 4.5.0)
## IRanges      * 2.42.0   2025-04-15 [1] Bioconductor 3.21 (R 4.5.2)
## phyloseq     * 1.52.0   2025-04-15 [1] Bioconductor 3.21 (R 4.5.2)
## Rcpp         * 1.1.1    2026-01-10 [1] RSPM (R 4.5.0)
## S4Vectors    * 0.46.0   2025-04-15 [1] Bioconductor 3.21 (R 4.5.2)
## sankeyD3     * 0.3.2    2026-02-15 [1] Github (fbreitwieser/sankeyD3@fd50a74)
## tibble       * 3.3.1    2026-01-11 [1] RSPM (R 4.5.0)
## tidyr        * 1.3.2    2025-12-19 [1] RSPM (R 4.5.0)
## XVector      * 0.48.0   2025-04-15 [1] Bioconductor 3.21 (R 4.5.2)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library
## 
## ────────────────────────────────────────────────────────────────────────────────────────────────

HUB (Hands-on Understanding of Bioinformatics) Workshop Microbiome Data Analysis by Bioinformatics Hub

[Part 2] Taxonomy assignment, analysis, and visualization.

1 Taxonomy assignment

1.1 Load R packages

1.2 Import input file

1.3 DADA2’s taxonomy assignment

2 Analysis

2.1 Format conversion

2.2 Alpha diversity over time

2.2.1 Group highlighted by color

2.2.2 Sex highlighted by color

2.3 Beta diversity

2.4 Taxonomic relative abundance

2.5 Taxonomic lineage profiles by category

3 Research reproducibility

3.1 Session info

HUB (Hands-on Understanding of Bioinformatics) Workshop
Microbiome Data Analysis
by Bioinformatics Hub