比较微生物组中的差异分析方法

发布时间:2025-12-09 11:50:38 浏览次数:1

在微生物组研究中我们常常需要根据某些感兴趣的表型来找到与其相关的特征(比如菌群、OTU、基因家族等等)。但微生物组学的数据结构导致了这必然是一项相当艰巨的任务,因为他们:

•高维特征集(通常超过 100 到 10,000 个特征);•高度稀疏(许多特征仅在少数样本中被发现);•特征间复杂的相关性结构;•计数的组成性(即,观察到的计数受文库大小的限制);•不同的文库大小;•过度离散的计数值,等等。

那么应该如何选择不同的差异分析方法呢?其实这个问题并没有答案,(如果有时间的话)我一般都是尝试一些对手头数据来说看似合理的模型,然后优先考虑 overlap 的差异特征集。虽然这并不完美,但至少会证明一些结果的鲁棒性,增加我们对结果的信心。

下面我将基于一个用 MetaPhlAn2 注释的公共宏基因组数据,使用五种不同算法进行差异分析。这些方法也可以应用于(也许更适用于)扩增子测序得到的 ASV 或 OTU。选择这些方法的标准如下:

•在一项或多项模拟研究中表现较好;•可以校正协变量,和多重假设检验;•包含多种标准化和建模方法;•应用相对广泛;•封装成 R 包。

符合以上条件的有很多算法,但这里我挑选了以下五个模型:

•Limma-Voom[1]•DESeq2[2]•corncob[3]•MaAsLin 2[4]•ANCOM-BC[5]

我们将使用由 curatedMetagenomicData[6] 包(关于这个包的教程可以参见我之前的笔记)提供的公共数据[7] 来识别从印度南部与印度中北部人群收集的粪便样本中的差异菌群。

相关文章:D B Dhakan, A Maji, A K Sharma, R Saxena, J Pulikkan, T Grace, A Gomez, J Scaria, K R Amato, V K Sharma, The unique composition of Indian gut microbiome, gene catalogue, and associated fecal metabolome deciphered using multi-omics approaches, GigaScience, Volume 8, Issue 3, March 2019, giz004, https://doi.org/10.1093/gigascience/giz004

安装并加载所需的 R 包

# if (!requireNamespace("BiocManager", quietly = TRUE))#   install.packages("BiocManager")# # BiocManager::install("curatedMetagenomicData")# BiocManager::install("phyloseq")# BiocManager::install("edgeR")# BiocManager::install("limma")# BiocManager::install("DEFormats")# BiocManager::install("DESeq2")# BiocManager::install("apeglm")# BiocManager::install("ANCOMBC")# BiocManager::install("Maaslin2")# # install.packages("corncob")# install.packages("tidyverse")# install.packages("ggVennDiagram")library(curatedMetagenomicData)library(tidyverse)library(phyloseq)library(edgeR)library(limma)library(DEFormats)library(DESeq2)library(apeglm)library(corncob)library(ANCOMBC)library(Maaslin2)library(ggVennDiagram)

下载公共宏基因组数据

dhakan_eset <- DhakanDB_2019.metaphlan_bugs_list.stool()
## snapshotDate(): 2020-04-27
## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation
## loading from cache
(ps <- ExpressionSet2phyloseq(dhakan_eset,                              simplify = TRUE,                              relab = FALSE,                              phylogenetictree = FALSE))
## phyloseq-class experiment-level object## otu_table()   OTU Table:         [ 768 taxa and 110 samples ]## sample_data() Sample Data:       [ 110 samples by 20 sample variables ]## tax_table()   Taxonomy Table:    [ 768 taxa by 8 taxonomic ranks ]
(ps <- subset_taxa(ps, !is.na(Species) & is.na(Strain)))
## phyloseq-class experiment-level object## otu_table()   OTU Table:         [ 286 taxa and 110 samples ]## sample_data() Sample Data:       [ 110 samples by 20 sample variables ]## tax_table()   Taxonomy Table:    [ 286 taxa by 8 taxonomic ranks ]

删除物种注释中的 “s__” 占位符:

taxa_names(ps) <- gsub("s__", "", taxa_names(ps))                        head(taxa_names(ps))
## [1] "Prevotella_copri"             "Eubacterium_rectale"         ## [3] "Butyrivibrio_crossotus"       "Prevotella_stercorea"        ## [5] "Faecalibacterium_prausnitzii" "Subdoligranulum_unclassified"

剔除仅在 <10% 样本中出现的菌种:

(ps <- filter_taxa(ps, function(x) sum(x > 0) > (0.1*length(x)), TRUE))   
## phyloseq-class experiment-level object## otu_table()   OTU Table:         [ 109 taxa and 110 samples ]## sample_data() Sample Data:       [ 110 samples by 20 sample variables ]## tax_table()   Taxonomy Table:    [ 109 taxa by 8 taxonomic ranks ]

查看数据中包括的 metadata 信息:

sample_variables(ps)
##  [1] "subjectID"               "body_site"              ##  [3] "antibiotics_current_use" "study_condition"        ##  [5] "disease"                 "age"                    ##  [7] "infant_age"              "age_category"           ##  [9] "gender"                  "BMI"                    ## [11] "country"                 "location"               ## [13] "non_westernized"         "DNA_extraction_kit"     ## [15] "number_reads"            "number_bases"           ## [17] "minimum_read_length"     "median_read_length"     ## [19] "curator"                 "NCBI_accession"
sample_data(ps)$location <- factor(sample_data(ps)$location, levels = c("Bhopal", "Kerala"))table(sample_data(ps)$location) 
## ## Bhopal Kerala ##     53     57

在过滤了低流行率的分类群后,我们现在得到了一个包含 109 个菌种的 phyloseq 对象。我一般倾向于根据总数和流行率过滤掉仅在 10% 到 50% 的样本中观察到的特征,以更好地满足模型假设,同时限制计算 power 时所付出的 FDR 惩罚。(这里总共 109 个菌种肯定是偏低的,但本文仅作示例)

Limma-Voom

常用于基因表达矩阵分析的 Limma 包也可用于菌群矩阵的差异分析。

dds <- phyloseq_to_deseq2(ps, ~ location)      #convert to DESeq2 and DGEList objects
## converting counts to integer mode
dge <- as.DGEList(dds)# 计算 TMM 归一化因子dge <- calcNormFactors(dge, method = "TMM")head(dge$samples$norm.factors)
## [1] 0.4529756 1.6721557 0.1509053 0.4569702 0.4646246 0.3362776
# 建立模型矩阵mm <- model.matrix(~ group, dge$samples)head(mm)
##          (Intercept) groupKerala## DHAK_HAK           1           0## DHAK_HAJ           1           0## DHAK_HAI           1           0## DHAK_HAH           1           0## DHAK_HAG           1           0## DHAK_HAF           1           0
table(mm[, 2])
## ##  0  1 ## 53 57
# 得到 voom 权重y <- voom(dge, mm, plot = T)

voom 的过程:1.将计数转换为 log2 CPM(counts per million reads),其中 “per million reads” 是根据我们之前计算的归一化因子定义的;2.线性模型拟合每个特征的 log2 CPM,并计算残差;3.基于平均表达量拟合平滑曲线(见上图中的红线);4.获得每个特征和样本的权重。

使用 lmFit 拟合线性模型:

fit <- lmFit(y, mm)                                   #fit lm with limma# 经验贝叶斯统计量计算(参见 https://www.degruyter.com/doi/10.2202/1544-6115.1027)fit <- eBayes(fit)head(coef(fit))
##                              (Intercept) groupKerala## Prevotella_copri              16.1383696 -4.99436530## Eubacterium_rectale           11.9161328 -0.08294369## Butyrivibrio_crossotus        -0.2110634 -0.08650380## Prevotella_stercorea           6.8200020 -6.22106379## Faecalibacterium_prausnitzii  13.6093282  1.00607350## Subdoligranulum_unclassified  10.9786786  1.16771790

提取计算结果:

limma_res_df <- data.frame(topTable(fit, coef = "groupKerala", number = Inf))    #extract resultsfdr_limma <- limma_res_df %>%    dplyr::filter(adj.P.Val < 0.05) %>%    rownames_to_column(var = "Species")dim(fdr_limma)
## [1] 15  7
head(fdr_limma)
##                    Species     logFC    AveExpr         t      P.Value## 1 Megasphaera_unclassified -8.974360  6.4247739 -5.575994 1.059729e-07## 2    Bacteroides_coprocola -5.124762 -0.8291721 -4.484538 1.409614e-05## 3     Prevotella_stercorea -6.221064  3.2149583 -3.867134 1.613340e-04## 4    Lactobacillus_ruminis -5.711738  3.7967638 -3.833832 1.826516e-04## 5       Ruminococcus_obeum  4.659230  2.7788048  3.737572 2.603217e-04## 6 Mitsuokella_unclassified -5.237836  0.7101123 -3.732304 2.653693e-04##      adj.P.Val         B## 1 1.155105e-05 6.9755057## 2 7.682395e-04 2.8793949## 3 4.820876e-03 0.6994391## 4 4.820876e-03 0.5749799## 5 4.820876e-03 0.2933741## 6 4.820876e-03 0.2931618
ggplot(fdr_limma, aes(x = Species, y = logFC, color = Species)) +  geom_point(size = 4) +  labs(y = "nLog2 Fold-Change for ACVD vs. Controls", x = "") +  theme(axis.text.x = element_text(color = "black", size = 12),        axis.text.y = element_text(color = "black", size = 12),        axis.title.y = element_text(size = 14),        axis.title.x = element_text(size = 14),        legend.text = element_text(size = 12),        legend.title = element_text(size = 12),        legend.position = "none") +  coord_flip() +  geom_hline(yintercept = 0, linetype="dotted")

可以看到根据校正后的 P 值 < 0.05,limma-voom 找到了 15 个差异菌。

DESeq2

DESeq2 将对原始计数进行建模,使用标准化因子(scale factor)来解释库深度的差异。然后估计每条 OTU 的离散度,并缩小这些估计值以生成更准确的离散度估计。最后,DESeq2 拟合负二项分布的模型,并使用 Wald 检验或似然比检验进行假设检验。

dds <- DESeq(dds, test = "Wald", fitType = "local", sfType = "poscounts")
estimating size factorsestimating dispersionsgene-wise dispersion estimatesmean-dispersion relationshipfinal dispersion estimatesfitting model and testing-- replacing outliers and refitting for 65 genes-- DESeq argument 'minReplicatesForReplace' = 7 -- original counts are preserved in counts(dds)estimating dispersionsfitting model and testing
plotDispEsts(dds)

使用 apeglm 对计数值的离散度进行校正:

res <- lfcShrink(dds, coef=2, type="apeglm") 
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for##     sequence count data: removing the noise and preserving large differences.##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
plotMA(dds)
deseq_res_df <- data.frame(res) %>%  rownames_to_column(var = "Species") %>%  dplyr::arrange(padj)                                 fdr_deseq <- deseq_res_df %>%    dplyr::filter(padj < 0.05)dim(fdr_deseq)
## [1] 5 6
head(fdr_deseq)
##                       Species   baseMean log2FoldChange     lfcSE       pvalue## 1    Bacteroides_massiliensis  2326.9291      0.1667126 0.9692252 9.311151e-32## 2       Clostridium_symbiosum   346.6204      0.5656621 1.2730261 4.669930e-32## 3       Clostridium_hathewayi   314.3198      0.4874255 1.2012990 1.764584e-28## 4 Parabacteroides_goldsteinii   256.9873      0.2564174 1.0225083 5.847951e-25## 5       Bacteroides_coprocola 19141.8394     -0.7224703 1.3784468 2.200809e-03##           padj## 1 4.934910e-30## 2 4.934910e-30## 3 6.234864e-27## 4 1.549707e-23## 5 4.665715e-02
ggplot(fdr_deseq, aes(x = Species, y = log2FoldChange, color = Species)) +    geom_point(size = 4) +    labs(y = "nLog2 Fold-Change for ACVD vs. Controls", x = "") +    theme(axis.text.x = element_text(color = "black", size = 12),          axis.text.y = element_text(color = "black", size = 12),          axis.title.y = element_text(size = 14),          axis.title.x = element_text(size = 14),          legend.text = element_text(size = 12),          legend.title = element_text(size = 12),          legend.position = "none") +    coord_flip() +    geom_hline(yintercept = 0, linetype="dotted")

根据校正后的 P 值 < 0.05,DESeq2 找到了 5 个差异菌。

Corncob

Corncob 则是基于相对丰度进行建模并检验协变量对相对丰度的影响。

•GitHub 地址:https://github.com/bryandmartin/corncob/

corn_da <- differentialTest(formula = ~ location,                            phi.formula = ~ 1,                            formula_null = ~ 1,                            phi.formula_null = ~ 1,                            data = ps,                            test = "Wald", boot = FALSE,                            fdr_cutoff = 0.05)fdr_corncob <- corn_da$significant_taxadim(data.frame(fdr_corncob))
## [1] 28  1
head(sort(corn_da$p_fdr))                    
##     Megasphaera_unclassified           Ruminococcus_obeum ##                 1.576463e-06                 1.041461e-04 ##         Prevotella_stercorea Bacteroides_thetaiotaomicron ##                 3.697782e-04                 3.965007e-04 ##        Bacteroides_uniformis        Lactobacillus_ruminis ##                 4.262191e-04                 5.608007e-04

Corncob 找到了 28 个差异特征。

MaAsLin 2

MaAsLin2 是 MaAsLin(Microbiome Multivariable Association with Linear Models) 的升级版,主要基于线性模型进行多元关联分析,分析表型和微生物特征之间的相关性。

•首页:https://huttenhower.sph.harvard.edu/maaslin/•GitHub 地址:https://github.com/biobakery/Maaslin2

mas_1 <- Maaslin2(  input_data = data.frame(otu_table(ps)),  input_metadata = data.frame(sample_data(ps)),  output = "./Maaslin2_default_output",  min_abundance = 0.0,  min_prevalence = 0.0,  normalization = "TSS",  transform = "LOG",  analysis_method = "LM",  max_significance = 0.05,  fixed_effects = "location",  correction = "BH",  standardize = FALSE,  cores = 1)
mas_res_df <- mas_1$resultsfdr_mas <- mas_res_df %>%    dplyr::filter(qval < 0.05)dim(fdr_mas)
## [1] 27 10
head(fdr_mas)
##                                       feature metadata  value       coef## locationKerala32     Megasphaera_unclassified location Kerala -0.7713398## locationKerala29           Ruminococcus_obeum location Kerala  0.6891505## locationKerala14          Ruminococcus_bromii location Kerala  0.8985006## locationKerala28 Bacteroides_thetaiotaomicron location Kerala  0.8813854## locationKerala70     Streptococcus_salivarius location Kerala  0.5409692## locationKerala3          Prevotella_stercorea location Kerala -1.0280049##                     stderr         pval           name        qval   N## locationKerala32 0.1583627 3.828477e-06 locationKerala 0.000417304 110## locationKerala29 0.1577677 2.887477e-05 locationKerala 0.001573675 110## locationKerala14 0.2122802 4.865394e-05 locationKerala 0.001767760 110## locationKerala28 0.2271755 1.801616e-04 locationKerala 0.004347755 110## locationKerala70 0.1404570 1.994383e-04 locationKerala 0.004347755 110## locationKerala3  0.2774162 3.343767e-04 locationKerala 0.005206723 110##                  N.not.zero## locationKerala32          0## locationKerala29          0## locationKerala14          0## locationKerala28          0## locationKerala70          0## locationKerala3           0

MaAsLin 2 找到了 27 个差异菌种。MaAsLin 2[8] 支持多种归一化方法和模型,我们可以用它来快速拟合不同的模型,看看这些模型对结果的影响。

ANCOM-BC

ANCOM-BC 引入了一种包含偏差校正的微生物组组成分析方法,该方法可以估计未知的抽样比例,并校正由样品之间的差异引起的偏差,绝对丰度数据使用线性回归框架建模。

•GitHub 地址:https://github.com/FrederickHuangLin/ANCOM-BC-Code-Archive

ancom_da <- ancombc(phyloseq = ps, formula = "location",               p_adj_method = "fdr", zero_cut = 0.90, lib_cut = 1000,               group = "location", struc_zero = TRUE, neg_lb = TRUE, tol = 1e-5,               max_iter = 100, conserve = TRUE, alpha = 0.05, global = FALSE)ancom_res_df <- data.frame(  Species = row.names(ancom_da$res$beta),  beta = unlist(ancom_da$res$beta),  se = unlist(ancom_da$res$se),  W = unlist(ancom_da$res$W),  p_val = unlist(ancom_da$res$p_val),  q_val = unlist(ancom_da$res$q_val),  diff_abn = unlist(ancom_da$res$diff_abn))fdr_ancom <- ancom_res_df %>%  dplyr::filter(q_val < 0.05)dim(fdr_ancom)
## [1] 22  7
head(fdr_ancom)
##                                       Species      beta        se         W## locationKerala4          Prevotella_stercorea -4.414224 1.1839905 -3.728259## locationKerala9      Mitsuokella_unclassified -3.467118 1.0176606 -3.406950## locationKerala15          Ruminococcus_bromii  3.682311 1.0497644  3.507750## locationKerala18        Lactobacillus_ruminis -3.608518 1.0441049 -3.456087## locationKerala22    Catenibacterium_mitsuokai -3.055131 0.9621105 -3.175447## locationKerala29 Bacteroides_thetaiotaomicron  3.143050 0.8682526  3.619973##                         p_val       q_val locationKerala## locationKerala4  0.0001928069 0.006229387           TRUE## locationKerala9  0.0006569326 0.007160566           TRUE## locationKerala15 0.0004519128 0.007029239           TRUE## locationKerala18 0.0005480782 0.007029239           TRUE## locationKerala22 0.0014960575 0.013589189           TRUE## locationKerala29 0.0002946343 0.006423028           TRUE

ANCOM-BC 找到了 22 个差异物种。

不同方法结果之间的 overlap

x <- list(limma = fdr_limma$Species,          deseq2 = fdr_deseq$Species,          corncob = fdr_corncob,          maaslin2 = fdr_mas$feature,          ancom = fdr_ancom$Species)          overlap_data <- process_region_data(Venn(x))overlap_data[overlap_data$id == 12345,]$item
## [[1]]## [1] "Bacteroides_coprocola"
ggVennDiagram(x,label_alpha=0) +  scale_fill_gradient(low="gray100",high = "gray95",guide="none")

在这五种方法中,只有一种菌 Bacteroides coprocola 在所有结果中都显著, FDR p 值 < 0.05。除了考虑到丰度差异外,我们还可以进一步考虑效应的大小(即倍数变化或系数的大小),看看这些被多种方法同时证实的结果是否合理,同时可进一步尝试探究不同模型方法之间的结果差异是否有明确的原因(例如,数据是否过度稀疏等等)。比如,在图中我们可以看到有 11 个菌被除 DESeq2 外的其余四种方法证实,这些菌或许就是下一步需要探究的方向。

Reference

•https://www.nicholas-ollberding.com/post/identifying-differentially-abundant-features-in-microbiome-data/

引用链接

[1] Limma-Voom: https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29[2] DESeq2: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8[3] corncob: https://projecteuclid.org/journals/annals-of-applied-statistics/volume-14/issue-1/Modeling-microbial-abundances-and-dysbiosis-with-beta-binomial-regression/10.1214/19-AOAS1283.short[4] MaAsLin 2: https://www.biorxiv.org/content/10.1101/2021.01.20.427420v1[5] ANCOM-BC: https://www.nature.com/articles/s41467-020-17041-7[6] curatedMetagenomicData: https://bioconductor.org/packages/release/data/experiment/html/curatedMetagenomicData.html[7] 数据: https://bioconductor.org/packages/release/data/experiment/html/curatedMetagenomicData.html[8] MaAsLin 2: https://github.com/biobakery/Maaslin2

corncob
需要做网站?需要网络推广?欢迎咨询客户经理 13272073477