Western diet drives gut resistome expansion in mice: a shotgun metagenomics analysis
{group}_{animal}_{type} — three fields separated by underscores
SD (Standard Diet) or WD (Western Diet)01, 15, 26)C: cecum, collected post-mortem at day 49S0: stool, collected non-invasively at day 0 (baseline, before diet intervention)S49: stool, collected non-invasively at day 49 (end-point, after 7 weeks on diet)SD_15_S49 — Standard Diet, animal 15, stool at day 49prefetch + fasterq-dump--very-sensitive) against GRCm38; only concordantly unmapped read pairs retained--min-contig-len 200); contigs < 500 bp discarded--minid 80 --mincov 90, matching Keskey et al. 2025); hits with alignment length < 100 bp discardedjgi_summarize_bam_contig_depths-f 12 -F 256 to extract both-unmapped read pairs--minid 80 --mincov 90#!/bin/bash
BASE_DIR="WORKING_DIRECTORY"
FASTQ_DIR="$BASE_DIR/fastq"
mkdir -p "$FASTQ_DIR"
# Example for a single accession; repeat for all SRR IDs
SRR="SRR12345678"
prefetch "$SRR" --output-directory "$FASTQ_DIR"
fasterq-dump "$FASTQ_DIR/${SRR}/${SRR}.sra" \
--split-files \
--outdir "$FASTQ_DIR" \
--temp "$FASTQ_DIR" \
--threads 8
gzip "$FASTQ_DIR/${SRR}_1.fastq" "$FASTQ_DIR/${SRR}_2.fastq"
#!/usr/bin/env bash
# The following script is a template that was used to parallelize jobs in an HPC environment
bowtie2 -x "$BT2_INDEX" \
-1 "$SCRATCH/${SAMPLE_ID}_R1.fastq.gz" \
-2 "$SCRATCH/${SAMPLE_ID}_R2.fastq.gz" \
--very-sensitive -p "$THREADS" \
2> "$BASE_DIR/logs/host_rm/${SAMPLE_ID}_bt2.log" \
| samtools view -f 12 -F 256 -b \
| samtools sort -n -@ 4 -T "$SCRATCH/${SAMPLE_ID}" \
| samtools fastq \
-1 "$BASE_DIR/host_rm/${SAMPLE_ID}_R1.fastq.gz" \
-2 "$BASE_DIR/host_rm/${SAMPLE_ID}_R2.fastq.gz" \
-0 /dev/null -s /dev/null -@ 4
#!/usr/bin/env bash
# The following script is a template that was used to parallelize jobs in an HPC environment
kraken2 \
--db "$DB" \
--paired --gzip-compressed \
--threads "$THREADS" \
--confidence 0.1 \
--output "$BASE_DIR/kraken2/${SAMPLE_ID}.kraken2" \
--report "$BASE_DIR/kraken2/${SAMPLE_ID}.kreport" \
--report-minimizer-data \
"$BASE_DIR/trimmed/${SAMPLE_ID}_R1.fastq.gz" \
"$BASE_DIR/trimmed/${SAMPLE_ID}_R2.fastq.gz" \
2> "$BASE_DIR/logs/kraken2/${SAMPLE_ID}.log"
bracken \
-d "$DB" \
-i "$BASE_DIR/kraken2/${SAMPLE_ID}.kreport" \
-o "$BASE_DIR/bracken/${SAMPLE_ID}.bracken" \
-r 150 \
-l S \
-t 10
#!/usr/bin/env bash
# The following script is a template that was used to parallelize jobs in an HPC environment
# Step 1: Assembly
megahit \
-1 "$BASE_DIR/trimmed/${SAMPLE_ID}_R1.fastq.gz" \
-2 "$BASE_DIR/trimmed/${SAMPLE_ID}_R2.fastq.gz" \
--out-dir "$SCRATCH/${SAMPLE_ID}_megahit" \
--out-prefix "$SAMPLE_ID" \
--min-contig-len 200 \
--num-cpu-threads "$THREADS" \
--memory 0.85
mkdir -p "$BASE_DIR/assembly/${SAMPLE_ID}"
cp "$SCRATCH/${SAMPLE_ID}_megahit/${SAMPLE_ID}.contigs.fa" \
"$BASE_DIR/assembly/${SAMPLE_ID}/final.contigs.fa"
# Step 2: Filter contigs >= 500 bp
awk '/^>/ { if (seq != "" && length(seq) >= 500) print header"\n"seq;
header = $0; seq = ""; next }
{ seq = seq $0 }
END { if (seq != "" && length(seq) >= 500) print header"\n"seq }' \
"$BASE_DIR/assembly/${SAMPLE_ID}/final.contigs.fa" \
> "$BASE_DIR/assembly/${SAMPLE_ID}/contigs_500bp.fa"
--minid 80 --mincov 90); hits with alignment length < 100 bp discarded#!/usr/bin/env bash
# The following script is a template that was used to parallelize jobs in an HPC environment
# Step 3: ARG detection (Abricate v12 against CARD)
abricate \
--db card \
--minid 80 \
--mincov 90 \
--threads "$THREADS" \
"$BASE_DIR/assembly/${SAMPLE_ID}/contigs_500bp.fa" \
> "$BASE_DIR/abricate/${SAMPLE_ID}.tsv"
# Do not run — set this to your local project folder before re-running
base_dir <- "WORKING_DIRECTORY"
setwd(base_dir)
library(ANCOMBC)
library(coin)
library(dplyr)
library(ggplot2)
library(ggrepel)
library(jsonlite)
library(plotly)
library(purrr)
library(readr)
library(scales)
library(tibble)
library(tidyr)
library(TreeSummarizedExperiment)
library(vegan)
format_pval <- function(p) {
if (is.na(p)) return("p = NA")
if (p < 0.001) return("p < 0.001")
paste0("p = ", formatC(p, format = "f", digits = 3))
}
theme_hub <- function(base_size = 11) {
theme_bw(base_size = base_size) +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
strip.background = element_rect(fill = "grey95", colour = NA),
strip.text = element_text(face = "bold")
)
}
Summary: All 24 samples passed per-sample fastp processing. Two samples — SD_01_C (host fraction > P95) and WD_26_C (trimmed depth < P5) — were flagged as QC outliers and excluded from differential abundance analyses. All other samples were retained.
WD_26_C was excluded due to low sequencing depth after host read removal and quality trimming, falling below the threshold required for diversity and differential abundance analysesSD_01_C was excluded due to an abnormally high host (mouse) read fraction, indicating the sample was dominated by GRCm38-mapping reads rather than microbial reads, leaving insufficient metagenomic depth for reliable community profilingqc <- readRDS(file.path(base_dir, "data", "qc_metrics.rds"))
head(qc)
## # A tibble: 6 × 14
## sample_id group sample_type trimmed_depth_pairs pass_rate_pct dup_rate_pct
## <chr> <fct> <fct> <dbl> <dbl> <dbl>
## 1 SD_01_C SD Cecum 18050994 92.8 1.43
## 2 SD_02_C SD Cecum 36899802 96.8 1.58
## 3 SD_03_C SD Cecum 38152851 97.7 1.57
## 4 SD_04_C SD Cecum 38642357 97.5 1.59
## 5 SD_08_S0 SD Stool_0 39128630 97.6 1.61
## 6 SD_09_S0 SD Stool_0 27373568 96.9 0.829
## # ℹ 8 more variables: host_frac_pct <dbl>, kraken2_rate_pct <dbl>,
## # flag_depth <lgl>, flag_pass <lgl>, flag_dup <lgl>, flag_host <lgl>,
## # flag_kraken <lgl>, flagged <lgl>
Summary: Bracken output from 24 samples (28 files including multi-run pairs) was merged into a species × sample count matrix of estimated read counts at the species level.
.bracken files parsed and read counts aggregated; multi-run samples (SD_15_S49, SD_16_S49, WD_13_S0, WD_14_S0) summed across runs per taxon before matrix constructionbc <- readRDS(file.path(base_dir, "data", "bracken_counts.rds"))
counts_mat <- bc$counts
meta <- bc$metadata
head(counts_mat[, 1:5])
## SD_01_C SD_02_C SD_03_C SD_04_C SD_08_S0
## Treponema phagedenis 26 64 37 0 43
## Campylobacter jejuni 4041 491 4113 493 0
## Proteus mirabilis 1478 397 0 0 1705
## Bacteroides fragilis 4540 16826 12597 14905 10652
## Bacteroides thetaiotaomicron 21245 28437 16924 17357 30483
## Bacteroides uniformis 66122 131599 174168 169067 175557
Summary: WD mice showed significantly lower alpha diversity and near-complete community separation from SD mice along the first beta diversity axis. The reduction was driven primarily by loss of evenness rather than species elimination — Shannon, Simpson, and inverse Simpson were all significant (p < 0.001) while observed richness and Chao1 were not (p = 0.057), indicating that WD allowed a small number of taxa to dominate rather than eliminating species outright. SD_01_C and WD_26_C are shown with black labels but included in the diversity figures; all 24 samples are retained here as diversity serves as a QC step.
vegan::rrarefy() before alpha diversity calculation to equalise sequencing effortexclude_samples <- c() # diversity uses all samples
qc_excluded <- c("SD_01_C", "WD_26_C")
div <- readRDS(file.path(base_dir, "data", "diversity.rds"))
alpha_tbl <- div$alpha
alpha_tests <- div$alpha_tests
pcoa_tbl <- div$pcoa
perm <- div$permanova
head(alpha_tbl)
## # A tibble: 6 × 7
## sample_id richness shannon simpson invsimpson chao1 group
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 SD_01_C 296 3.97 0.967 30.1 296 SD
## 2 SD_02_C 335 3.64 0.947 18.7 335 SD
## 3 SD_03_C 335 3.81 0.961 25.7 335 SD
## 4 SD_04_C 331 3.84 0.962 26.1 331 SD
## 5 SD_08_S0 378 3.70 0.950 20.0 378 SD
## 6 SD_09_S0 340 3.54 0.935 15.4 340 SD
metric_order <- c("richness", "chao1", "shannon", "simpson", "invsimpson")
alpha_long <- alpha_tbl |>
pivot_longer(c(richness, shannon, simpson, invsimpson, chao1),
names_to = "metric", values_to = "value") |>
mutate(
group = factor(group, levels = group_order),
metric = factor(metric, levels = metric_order)
)
alpha_labels_red <- alpha_long |> filter(sample_id %in% exclude_qc)
pval_labels <- alpha_long |>
group_by(metric) |>
summarise(y_pos = max(value) * 1.05, .groups = "drop") |>
left_join(
alpha_tests |>
mutate(label = ifelse(p_value < 0.001, "p < 0.001",
paste0("p = ", formatC(p_value, digits = 3, format = "f")))),
by = "metric"
) |>
mutate(group = "WD", metric = factor(metric, levels = metric_order))
pos_jitter <- position_jitter(width = 0.15, seed = 42)
ggplot(alpha_long, aes(x = group, y = value, colour = group)) +
geom_boxplot(outlier.shape = NA, width = 0.5) +
geom_jitter(position = pos_jitter, size = 2, alpha = 0.7) +
geom_text_repel(
data = alpha_labels_red,
aes(label = sample_id),
colour = "black",
position = pos_jitter,
size = 4,
min.segment.length = 0.2,
box.padding = 0.3,
show.legend = FALSE
) +
geom_text(data = pval_labels, aes(x = 1.5, y = y_pos, label = label),
inherit.aes = FALSE, size = 5, colour = "black") +
scale_colour_manual(values = group_cols) +
facet_wrap(~metric, scales = "free_y", nrow = 1) +
labs(x = NULL, y = "Value") +
theme_hub(12) +
theme(legend.position = "none", strip.text = element_text(size = 14))
cmdscale(); diet-group separation confirmed with PERMANOVA (999 permutations)perm_label <- sprintf(
"PERMANOVA: R² = %.3f, p = %.3f",
perm$R2[1], perm$`Pr(>F)`[1]
)
p_pcoa <- ggplot(
pcoa_tbl |>
mutate(
group = factor(group, levels = group_order),
sample_type = factor(sub("^[A-Z]+_[0-9]+_", "", sample_id),
levels = c("C", "S0", "S49"))
),
aes(x = PC1, y = PC2, colour = group, shape = sample_type, text = sample_id)
) +
geom_point(size = 3, alpha = 0.8) +
scale_colour_manual(values = group_cols) +
scale_shape_manual(values = c("C" = 16, "S0" = 17, "S49" = 15),
labels = c("C" = "Cecum", "S0" = "Stool day 0", "S49" = "Stool day 49"),
name = "Sample type") +
annotate("text", x = Inf, y = -Inf, label = perm_label,
hjust = 1.05, vjust = -0.5, size = 3, colour = "black") +
labs(colour = "Group") +
theme_hub(12) +
theme(legend.text = element_text(size = 11))
ggplotly(p_pcoa, tooltip = c("text", "colour"))
Summary: Mean relative abundance of the top 20 species across the six sample-type × diet-group strata, illustrating community composition shifts driven by Western Diet.
counts_obj <- readRDS(file.path(base_dir, "data", "bracken_counts.rds"))
counts_mat <- counts_obj$counts
meta_taxa <- counts_obj$metadata |>
filter(group %in% group_order) |>
mutate(
sample_type = sub(".*_", "", sample_id),
strata = factor(
paste0(sample_type, " — ", group),
levels = c("C — SD", "C — WD", "S0 — SD", "S0 — WD", "S49 — SD", "S49 — WD")
)
)
rel_mat <- sweep(counts_mat, 2, colSums(counts_mat), "/")
top20_names <- names(sort(rowMeans(rel_mat), decreasing = TRUE)[1:20])
taxa_long <- rel_mat[, meta_taxa$sample_id, drop = FALSE] |>
as.data.frame() |>
rownames_to_column("taxon") |>
mutate(taxon_label = if_else(taxon %in% top20_names, taxon, "Other")) |>
pivot_longer(-c(taxon, taxon_label), names_to = "sample_id", values_to = "rel_abund") |>
group_by(taxon_label, sample_id) |>
summarise(rel_abund = sum(rel_abund), .groups = "drop") |>
left_join(meta_taxa |> select(sample_id, strata), by = "sample_id") |>
group_by(taxon_label, strata) |>
summarise(mean_abund = mean(rel_abund), .groups = "drop") |>
mutate(taxon_label = factor(taxon_label, levels = c("Other", rev(top20_names))))
taxa_colors <- c(Other = "grey85", setNames(hue_pal()(20), top20_names))
ggplot(taxa_long, aes(x = strata, y = mean_abund, fill = taxon_label)) +
geom_col(width = 0.8) +
coord_flip() +
scale_fill_manual(values = taxa_colors, name = NULL) +
scale_y_continuous(labels = label_percent()) +
labs(x = NULL, y = "Mean relative abundance") +
guides(fill = guide_legend(ncol = 3, reverse = TRUE)) +
theme_hub(12) +
theme(legend.position = "bottom", legend.text = element_text(size = 11, face = "italic"))
Summary: ANCOM-BC2 identified two taxa meeting both the significance threshold (q < 0.05, Holm-Bonferroni) and the effect size threshold (|LFC| > 2): Lactobacillus intestinalis was depleted in WD mice (LFC = −3.70, q = 0.0058) and Bacteroides cellulosilyticus was enriched in WD mice (LFC = +2.52, q = 0.023). Samples SD_01_C and WD_26_C excluded.
fix_formula = "group" — diet group (SD is the reference level) as the sole fixed effect; positive LFC indicates enrichment in WDp_adj_method = "holm" — Holm-Bonferroni step-down correction; more conservative than Benjamini-Hochberg, appropriate for a single-contrast exploratory testprv_cut = 0.10 — taxa present in fewer than 10% of samples removed before testinglib_cut = 1000 — samples with fewer than 1,000 classified reads excluded from the modelstruc_zero = TRUE — detects taxa with structural zeros in one group and handles them separatelycounts <- bc$counts
meta <- bc$metadata |>
mutate(group = factor(group, levels = c("SD", "WD")))
counts <- counts[, !colnames(counts) %in% exclude_qc, drop = FALSE]
meta <- meta |> filter(!sample_id %in% exclude_qc)
shared <- intersect(colnames(counts), meta$sample_id)
counts_filt <- counts[, shared]
meta_filt <- meta |> filter(sample_id %in% shared) |> as.data.frame()
rownames(meta_filt) <- meta_filt$sample_id
tse <- TreeSummarizedExperiment(
assays = list(counts = counts_filt),
colData = meta_filt[shared, ]
)
ancom_out <- ancombc2(
data = tse,
assay_name = "counts",
fix_formula = "group",
p_adj_method = "holm",
prv_cut = 0.10,
lib_cut = 1000,
s0_perc = 0.05,
group = "group",
struc_zero = TRUE,
neg_lb = TRUE,
alpha = p_threshold,
n_cl = 1,
verbose = TRUE
)
res <- ancom_out$res |>
dplyr::rename(
lfc = lfc_groupWD,
se = se_groupWD,
pval = p_groupWD,
qval = q_groupWD,
diff = diff_groupWD
) |>
dplyr::arrange(qval)
saveRDS(list(results = res, ancombc2 = ancom_out),
file.path(base_dir, "data", "diff_abundance.rds"))
da <- readRDS(file.path(base_dir, "data", "diff_abundance.rds"))
res <- da$results
head(res |> select(taxon, lfc, se, pval, qval))
## taxon lfc se pval qval
## 1 Lactobacillus intestinalis -3.696999 0.6485682 1.706019e-05 0.005834585
## 2 Bacteroides cellulosilyticus 2.521534 0.5040680 6.835704e-05 0.023309751
## 3 Bacteroides maternus 1.971470 0.4301382 1.801923e-04 0.061265390
## 4 Bacteroides pyogenes -3.325667 0.7368475 2.120618e-04 0.071888966
## 5 Phocaeicola vulgatus 2.295634 0.5256901 2.984055e-04 0.100861063
## 6 Paramuribaculum intestinale -2.855719 0.6572470 3.140792e-04 0.105844674
volcano_dat <- res |>
mutate(
neg_log10_q = -log10(qval),
direction = case_when(
diff & abs(lfc) > 2 & lfc > 0 ~ "WD enriched",
diff & abs(lfc) > 2 & lfc < 0 ~ "SD enriched",
TRUE ~ "n.s."
)
)
ggplot(volcano_dat, aes(x = lfc, y = neg_log10_q, colour = direction)) +
geom_point(alpha = 0.7, size = 2) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed",
colour = "grey50", linewidth = 0.4) +
geom_vline(xintercept = c(-2, 2), linetype = "dashed",
colour = "grey50", linewidth = 0.4) +
geom_text_repel(
data = filter(volcano_dat, diff, abs(lfc) > 2),
aes(label = taxon),
size = 3.5, fontface = "italic", box.padding = 0.4,
show.legend = FALSE
) +
scale_colour_manual(
values = c("WD enriched" = "#3182BD", "SD enriched" = "#999999", "n.s." = "grey80"),
name = NULL
) +
labs(x = "Log fold change (WD vs. SD)", y = expression(-log[10](q))) +
theme_hub(12) +
theme(legend.position = "right", legend.text = element_text(size = 11))
Summary: Abricate v12 (CARD database, --minid 80 --mincov 90, alignment length ≥ 100 bp) detected ARGs across assembled contigs in all 24 samples. ARG proportions are expressed as the number of contigs carrying each ARG divided by the total assembled contigs ≥ 500 bp × 100, to account for differences in assembly size. Samples SD_01_C and WD_26_C excluded.
APH(3')-IIIa, APH(3')-VIIa — aminoglycoside phosphotransferases (aminoglycoside resistance)tet(Q), tet(O), tet(X), tet(W) — tetracycline resistance genes (ribosomal protection or enzymatic inactivation)vanD, vanH_gene_in_vanD_cluster, vanR_gene_in_vanD_cluster, vanS_gene_in_vanD_cluster, vanX_gene_in_vanD_cluster — vancomycin resistance cluster (VanD type)ErmG — erythromycin ribosome methylase (macrolide-lincosamide-streptogramin B resistance)CfxA2 — cefoxitin/cephalosporin hydrolase (beta-lactam resistance)mel — macrolide efflux pumparg_obj <- readRDS(file.path(base_dir, "data", "arg_counts.rds"))
prop_mat <- arg_obj$proportions
meta_arg <- arg_obj$metadata
head(prop_mat[, 1:5])
## SD_02_C SD_03_C SD_04_C SD_08_S0 SD_09_S0
## APH(3')-IIIa 0.0003277238 0.0000000000 0.0000000000 0.0000000000 0.0003750713
## APH(3')-VIIa 0.0003277238 0.0000000000 0.0003567135 0.0000000000 0.0000000000
## CblA-1 0.0003277238 0.0003515223 0.0003567135 0.0003175531 0.0003750713
## SAT-4 0.0003277238 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## mel 0.0003277238 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## tet(32) 0.0003277238 0.0007030445 0.0010701406 0.0006351061 0.0003750713
top15_names <- names(sort(rowMeans(prop_mat), decreasing = TRUE)[1:15])
prev_abund <- data.frame(
gene = rownames(prop_mat),
prevalence = rowMeans(prop_mat > 0),
mean_prop = rowMeans(prop_mat),
stringsAsFactors = FALSE
) |>
mutate(top15 = gene %in% top15_names)
ggplot(prev_abund, aes(x = mean_prop, y = prevalence)) +
geom_point(aes(colour = top15), size = 1.8, alpha = 0.7) +
geom_text_repel(
data = filter(prev_abund, top15),
aes(label = gene),
size = 3.5, box.padding = 0.4,
show.legend = FALSE
) +
scale_x_log10(labels = label_number(scale_cut = cut_short_scale())) +
scale_y_continuous(labels = label_percent(), limits = c(0, 1)) +
scale_colour_manual(
values = c("FALSE" = "grey60", "TRUE" = "#3182BD"),
labels = c("Other", "Top 15"), name = NULL
) +
labs(x = "Mean proportion (%, log scale)", y = "Prevalence (proportion of samples)") +
theme_hub(12) +
theme(legend.position = "bottom", legend.text = element_text(size = 11))
strip_n <- meta_arg |>
dplyr::count(sample_type, group, name = "n_samples") |>
mutate(
facet_label = paste0(sample_type, ", ", group, " (N = ", n_samples, ")"),
type_rank = match(sample_type, c("C", "S0", "S49")),
group_rank = match(group, group_order)
) |>
arrange(type_rank, group_rank)
level_order <- strip_n$facet_label
top15_long <- prop_mat |>
as.data.frame() |>
rownames_to_column("gene") |>
mutate(gene_label = if_else(gene %in% top15_names, gene, "Other")) |>
pivot_longer(-c(gene, gene_label), names_to = "sample_id", values_to = "proportion") |>
group_by(gene_label, sample_id) |>
summarise(proportion = sum(proportion), .groups = "drop") |>
left_join(meta_arg |> select(sample_id, group, sample_type), by = "sample_id") |>
left_join(strip_n |> select(sample_type, group, facet_label), by = c("sample_type", "group")) |>
mutate(
gene_label = factor(gene_label, levels = c("Other", top15_names)),
sample_id = factor(sample_id, levels = meta_arg$sample_id),
facet_label = factor(facet_label, levels = level_order)
)
gene_colors <- c(setNames(hue_pal()(15), top15_names), Other = "grey85")
ggplot(top15_long, aes(x = sample_id, y = proportion, fill = gene_label)) +
geom_col(width = 0.9) +
scale_fill_manual(values = gene_colors, name = NULL) +
scale_y_continuous(expand = c(0, 0)) +
facet_wrap(~ facet_label, nrow = 2, ncol = 3, scales = "free_x") +
labs(x = NULL, y = "ARG proportion (% of total contigs)") +
guides(fill = guide_legend(reverse = TRUE)) +
theme_hub(12) +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 7),
panel.grid = element_blank(),
legend.position = "bottom",
legend.key.size = unit(0.4, "cm"),
legend.text = element_text(size = 11),
strip.text = element_text(size = 14)
)
Summary: No ARG reached FDR q < 0.05 in any sample type under White’s non-parametric t-test with 9,999 permutations and Benjamini-Hochberg correction. All major ARGs reported in Keskey et al. 2025 — CfxA2, ErmG, tet(Q), tet(X), tet(O) — were directionally enriched in WD samples at stool day 49 (minimum q = 0.161). The lack of formal significance is consistent with reduced per-group sample size relative to the original study (cecum: 5 WD vs 3 SD; stool day 49: 3 WD vs 4 SD). Stool day 0 showed no signal (minimum q = 0.951), confirming that the WD-directional enrichment at end-point is diet-driven.
coin::oneway_test(..., distribution = approximate(nresample = 9999))set.seed(42) for reproducibilitymin_prev = 0.20)set.seed(42)
shared_build <- intersect(colnames(prop_mat), meta_arg$sample_id)
prop_mat_build <- prop_mat[, shared_build, drop = FALSE]
meta_build <- meta_arg |> filter(sample_id %in% shared_build) |>
arrange(match(sample_id, shared_build))
sample_types_build <- unique(meta_build$sample_type)
results_all <- purrr::map_dfr(sample_types_build, function(stype) {
idx <- which(meta_build$sample_type == stype)
mat_sub <- prop_mat_build[, idx, drop = FALSE]
meta_sub <- meta_build[idx, ]
sd_idx <- which(meta_sub$group == "SD")
wd_idx <- which(meta_sub$group == "WD")
prev_sd <- rowMeans(mat_sub[, sd_idx, drop = FALSE] > 0)
prev_wd <- rowMeans(mat_sub[, wd_idx, drop = FALSE] > 0)
keep <- (prev_sd >= min_prev) | (prev_wd >= min_prev)
mat_filt <- mat_sub[keep, , drop = FALSE]
purrr::map_dfr(rownames(mat_filt), function(gene) {
sd_vals <- mat_filt[gene, sd_idx]
wd_vals <- mat_filt[gene, wd_idx]
wt <- coin::oneway_test(
value ~ group,
data = data.frame(
value = c(wd_vals, sd_vals),
group = factor(c(rep("WD", length(wd_vals)),
rep("SD", length(sd_vals))),
levels = c("WD", "SD"))
),
distribution = coin::approximate(nresample = 9999)
)
tibble::tibble(
sample_type = stype,
gene = gene,
mean_WD = mean(wd_vals),
mean_SD = mean(sd_vals),
p_value = as.numeric(coin::pvalue(wt))
)
}) |>
mutate(
p_adj = p.adjust(p_value, method = "BH"),
significant = p_adj < p_threshold,
direction = dplyr::if_else(mean_WD > mean_SD, "WD > SD", "SD > WD")
) |>
arrange(p_adj)
})
saveRDS(list(results = results_all),
file.path(base_dir, "data", "arg_diff.rds"))
arg_diff_obj <- readRDS(file.path(base_dir, "data", "arg_diff.rds"))
results_all <- arg_diff_obj$results
head(results_all)
## # A tibble: 6 × 8
## sample_type gene mean_WD mean_SD p_value p_adj significant direction
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <chr>
## 1 Cecum CblA-1 6.53e-4 3.45e-4 0.0169 0.159 FALSE WD > SD
## 2 Cecum tet(40) 6.53e-4 3.45e-4 0.0171 0.159 FALSE WD > SD
## 3 Cecum tet(X) 6.53e-4 3.45e-4 0.0184 0.159 FALSE WD > SD
## 4 Cecum APH(3')-VIIa 0 2.28e-4 0.105 0.407 FALSE SD > WD
## 5 Cecum tet(O) 1.18e-3 6.91e-4 0.120 0.407 FALSE WD > SD
## 6 Cecum vanR_gene_in_… 5.41e-4 0 0.125 0.407 FALSE WD > SD
pseudo <- 1e-6
sample_type_levels <- c("Cecum", "Stool_0", "Stool_49")
volcano_df <- results_all |>
mutate(
log2fc = log2((mean_WD + pseudo) / (mean_SD + pseudo)),
neg_log10_p = -log10(p_adj),
sample_type = factor(sample_type, levels = sample_type_levels)
)
ggplot(volcano_df, aes(x = log2fc, y = neg_log10_p, colour = direction)) +
geom_point(size = 2, alpha = 0.8) +
geom_hline(yintercept = -log10(p_threshold), linetype = "dashed",
colour = "grey50", linewidth = 0.4) +
geom_text_repel(
data = filter(volcano_df, p_value < 0.05),
aes(label = gene),
size = 3.5, fontface = "italic", box.padding = 0.4,
show.legend = FALSE
) +
scale_colour_manual(
values = c("WD > SD" = "#3182BD", "SD > WD" = "#999999"),
name = NULL
) +
facet_wrap(~ sample_type) +
labs(x = expression(log[2]~"fold change (WD / SD)"),
y = expression(-log[10](q))) +
theme_hub(12) +
theme(legend.position = "bottom", legend.text = element_text(size = 11), strip.text = element_text(size = 14, face = "bold"))
Summary: ARG-carrying contigs were linked to their source taxa by chaining Abricate hits through MetaBAT2 bin membership and GTDB-Tk r226 species assignments. ErmG and tet(Q) co-occurred on Phocaeicola vulgatus MAGs in WD cecum and stool day 49. tet(Q) was additionally linked to Bacteroides acidifaciens in stool day 0, replicating the paper’s finding directly. A vanD cluster was detected in Schaedlerella arabinosiphila in WD cecum — a novel finding not reported in Keskey et al.
at <- readRDS(file.path(base_dir, "data", "arg_taxon.rds"))
arg_taxon <- at$arg_taxon
summary_tbl <- at$summary
head(summary_tbl)
## # A tibble: 6 × 5
## group sample_type gene taxon n_samples
## <fct> <fct> <chr> <chr> <int>
## 1 SD Stool_0 CblA-1 Bacteroides muris 4
## 2 SD Cecum CblA-1 Bacteroides muris 3
## 3 SD Stool_49 CblA-1 Bacteroides muris 3
## 4 SD Stool_49 tet(32) Unclassified 3
## 5 SD Cecum tet(O) Unclassified 2
## 6 SD Cecum tet(W) Unclassified Bacteria 2
top_genes <- summary_tbl |>
group_by(gene) |>
summarise(total = sum(n_samples), .groups = "drop") |>
slice_max(total, n = 15) |>
pull(gene)
top_taxa <- summary_tbl |>
group_by(taxon) |>
summarise(total = sum(n_samples), .groups = "drop") |>
slice_max(total, n = 15) |>
pull(taxon)
heatmap_df <- summary_tbl |>
filter(gene %in% top_genes, taxon %in% top_taxa) |>
mutate(
taxon = factor(taxon, levels = rev(top_taxa)),
gene = factor(gene, levels = top_genes),
facet_label = paste0(sample_type, " — ", group)
)
ggplot(heatmap_df, aes(x = gene, y = taxon, fill = n_samples)) +
geom_tile(colour = "white", linewidth = 0.4) +
scale_fill_gradient(low = "#f0f0f0", high = "#3182BD", name = "N samples") +
facet_wrap(~ facet_label, nrow = 2) +
labs(x = NULL, y = NULL) +
theme_hub(12) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, face = "italic", size = 10),
axis.text.y = element_text(face = "italic", size = 10),
legend.position = "bottom",
legend.text = element_text(size = 11)
)
| Finding | Paper | Our result | Status |
|---|---|---|---|
| Alpha diversity lower in WD | Yes (all metrics) | Shannon, Simpson, invSimpson significant; richness/Chao1 p = 0.057 | Replicated |
| Beta diversity: diet clustering | Distinct Bray-Curtis PCoA | R² = 0.389, p = 0.001 | Replicated |
| B. vulgatus / P. vulgatus WD-enriched | Top hit | LFC = +2.30, q = 0.101 — directional, not significant | Directionally consistent |
| B. ovatum SD-enriched | Yes | Not detected in top results | Not replicated |
| ErmG, CfxA2 elevated in WD cecum | Significant (Fig 4A) | WD-only, q = 0.407 | Directionally consistent |
| CfxA2, ErmG elevated in WD stool day 49 | Significant (Fig 4B) | WD > SD, q = 0.161 | Directionally consistent |
| tet(Q), tet(X), tet(O) elevated in WD stool | Significant (Fig 4B) | WD > SD, q = 0.179–0.230 | Directionally consistent |
| B. acidifaciens carries tet(Q) | Yes | Confirmed in stool day 0 WD | Replicated |
| ErmG in B. dorei / B. fragilis | Yes | P. vulgatus (cecum) — GTDB r226 reclassification | Partial |
| WD increases ARG burden overall | Central conclusion | All key ARGs trend WD > SD | Directionally consistent |
sessioninfo::session_info(pkgs = "attached")
## ─ Session info ─────────────────────────────────────────────────────────────────────────────────
## setting value
## version R version 4.5.1 (2025-06-13)
## os macOS Tahoe 26.3.1
## system aarch64, darwin20
## ui X11
## language (EN)
## collate en_CA.UTF-8
## ctype en_CA.UTF-8
##
## ─ Packages ─────────────────────────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## ANCOMBC * 2.10.1 2025-05-22 [1] Bioconductor 3.21 (R 4.5.0)
## Biobase * 2.68.0 2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
## BiocGenerics * 0.54.1 2025-10-09 [1] Bioconductor 3.21 (R 4.5.1)
## Biostrings * 2.76.0 2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
## coin * 1.4-3 2023-09-27 [1] CRAN (R 4.5.0)
## doRNG * 1.8.6.3 2026-02-05 [1] CRAN (R 4.5.2)
## dplyr * 1.2.1 2026-04-03 [1] CRAN (R 4.5.2)
## foreach * 1.5.2 2022-02-02 [1] CRAN (R 4.5.0)
## generics * 0.1.4 2025-05-09 [1] CRAN (R 4.5.0)
## GenomeInfoDb * 1.44.3 2025-09-18 [1] Bioconductor 3.21 (R 4.5.1)
## GenomicRanges * 1.60.0 2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
## ggplot2 * 4.0.3 2026-04-22 [1] CRAN (R 4.5.2)
## ggrepel * 0.9.8 2026-03-17 [1] CRAN (R 4.5.2)
## IRanges * 2.42.0 2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
## jsonlite * 2.0.0 2025-03-27 [1] CRAN (R 4.5.0)
## MatrixGenerics * 1.20.0 2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
## matrixStats * 1.5.0 2025-01-07 [1] CRAN (R 4.5.0)
## permute * 0.9-10 2026-02-06 [1] CRAN (R 4.5.2)
## plotly * 4.12.0 2026-01-24 [1] CRAN (R 4.5.2)
## purrr * 1.2.2 2026-04-10 [1] CRAN (R 4.5.2)
## readr * 2.2.0 2026-02-19 [1] CRAN (R 4.5.2)
## rngtools * 1.5.2 2021-09-20 [1] CRAN (R 4.5.0)
## S4Vectors * 0.46.0 2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
## scales * 1.4.0 2025-04-24 [1] CRAN (R 4.5.0)
## SingleCellExperiment * 1.30.1 2025-05-05 [1] Bioconductor 3.21 (R 4.5.0)
## SummarizedExperiment * 1.38.1 2025-04-28 [1] Bioconductor 3.21 (R 4.5.0)
## survival * 3.8-6 2026-01-16 [1] CRAN (R 4.5.2)
## tibble * 3.3.1 2026-01-11 [1] CRAN (R 4.5.2)
## tidyr * 1.3.2 2025-12-19 [1] CRAN (R 4.5.2)
## TreeSummarizedExperiment * 2.16.1 2025-05-08 [1] Bioconductor 3.21 (R 4.5.0)
## vegan * 2.7-3 2026-03-04 [1] CRAN (R 4.5.2)
## XVector * 0.48.0 2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
##
## [1] /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library
## * ── Packages attached to the search path.
##
## ────────────────────────────────────────────────────────────────────────────────────────────────