Using R: ggplot2 可视化 RDA 结果

关于 RDA(Redundancy analysis 冗余分析)是什么,相信对于已经有可视化需求的同学来说已经不用更多的解释了。

在 R 中常用来进行 RDA 分析和绘制工作的是 veganggvegan 这两个包。然而,在实际使用中,最常遇到的问题是虽然这些包内建的 plot 等功能可以绘制出基本可用的包,但想要进一步的定制图形却没那么容易。

想要绘制出一副自己满意、编辑满意、导师满意最主要的是审稿人满意的 RDA 结果,作为最强可视化工具之一的 ggplot2 包毋庸置疑是最佳的选择。

我们需要什么样的 RDA 图

首先,我们来思考我们需要什么样的 RDA 图?按照世纪需求以及审稿人的建议:

I would recommend showing in bold the variables with significant correlations

笔者最后的目标是绘制一幅:

1. 显示物种信息(实际上是响应变量矩阵);
2. 环境变量(实际上是解释变量矩阵);
3. 在两轴上能显示各自的解释度;
4. 标记有显著性的解释变量。

以笔者对 plot.rda 以及 autoplot.rda 这两个 vegan 和 ggvegan 内建函数的浅薄了解,似乎很难完成。

如何「标记有显著性的解释变量」?

如果进行过 RDA 分析不难发现,使用 vegan 内建的 rda 是没有标记解释变量的显著性的。其实这种显著性需要 vegan 内建的 envfit 函数来得到一个及其近似的结果。通常这个方法在论文中会写作:

Monte Carlo permutation (999 permutations) was used to identify axes with significant eigenvalues and species-environment correlations.

由于是进行了 999 次(默认参数可以修改)的蒙特卡洛抽样,因此这个结果是及其近似的结果,直接用作 RDA 中解释变量的显著性是没有问题的。然而,envfit 输出的结果并非标准的 data.frame 或者类似的结果,无法方便的输出或者进行分析,后续我们还会进行结果提取的步骤,暂且按下不表。

RDA 和 ENVFIT 分析

此例,我们使用随机抽取的 150 条土壤线虫群落群落数据以及对应的环境数据,由于不影响后续的理解以及版权考量,对各参数名不再解释。

每一步的操作以及原因以注释的形式呈现。

library('vegan')
library('tidyverse')

# read Environmental Variables
  df_env <- read_csv('../assets/2020-04-28-post/df_env_smp.csv')
# read Community composition matrix
df_com <- read_csv('../assets/2020-04-28-post/df_com_smp.csv')

# View the structure of data
DT::datatable(df_env)
DT::datatable(df_com)
library('vegan')

# Performing RDA and viewing the results
res_rda <- rda(df_com, df_env)
res_rda
## Call: rda(X = df_com, Y = df_env)
## 
##                 Inertia Proportion Rank
## Total         2031.7341     1.0000     
## Constrained    895.7317     0.4409    5
## Unconstrained 1136.0024     0.5591    5
## Inertia is variance 
## 
## Eigenvalues for constrained axes:
##  RDA1  RDA2  RDA3  RDA4  RDA5 
## 754.0  88.3  45.4   6.2   1.8 
## 
## Eigenvalues for unconstrained axes:
##   PC1   PC2   PC3   PC4   PC5 
## 747.4 235.0  96.3  44.7  12.6
# Performing ENVFIT
res_envfit <- envfit(df_com, df_env)

# However, the result of ENVFIT is not in the data.frame format, we should extract useful information from it.
res_envfit
## 
## ***VECTORS
## 
##           pp       ba     r2 Pr(>r)    
## pH  -0.94935  0.31421 0.0809  0.002 ** 
## MOI  0.79143  0.61126 0.2636  0.001 ***
## TN   0.91399  0.40574 0.0887  0.011 *  
## TP   0.86358 -0.50421 0.0002  0.988    
## AP   0.99269  0.12071 0.0575  0.015 *  
## NH4  0.94495 -0.32722 0.0462  0.033 *  
## NO3  0.52786  0.84933 0.1430  0.001 ***
## SOC  0.81629  0.57764 0.1440  0.002 ** 
## MAT  0.10537 -0.99443 0.0118  0.442    
## MAP -0.18615 -0.98252 0.0073  0.570    
## PBM  0.75360  0.65733 0.0096  0.499    
## PCV  0.57508  0.81810 0.0380  0.070 .  
## PSR  0.88521 -0.46519 0.0062  0.645    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999

正如我们在注释提到的,ENVFIT 并不是以 data.frame 的形式提供数据,因此我们还需要通过一些提取的操作才能获得与结果相同的表格。由于这类操作经常使用,因此我们将其包装成 function 并命名为 envfit_to_df。

# Here, env_obj indicates the result of envfit. In this case, it's the res_envfit.
# r2_dig is the significant figure of R2
# p_dig is the significant figure of p value
envfit_to_df <- function(env_obj,
                         r2_dig = 6,
                         p_dig = 3) {
  r2_fmt <- as.character(paste('%.', r2_dig, 'f', sep = ''))
  p_fmt <- as.character(paste('%.', p_dig, 'f', sep = ''))
  tibble(
      # the name of explainary variables
    factor = names(env_obj$vectors$r),
      # list or vector of R2
    r2 = env_obj$vectors$r,
      # list or vector of p values
    pvals = env_obj$vectors$pvals
  ) %>%
    # generate significant levels by p values
    mutate(sig = case_when(
      pvals <= 0.001 ~ '***',
      pvals <= 0.01 ~ '**',
      pvals <= 0.05 ~ '*',
      TRUE ~ ' '
    )) %>%
    # format the significant figure by format definition before.
    mutate(pvals = sprintf('%.3f', pvals),
           r2 = sprintf(r2_fmt, r2))
}
# Convert result of ENVFIT to data.frame, in fact, it's a tibble.
df_env <- envfit_to_df(res_envfit, r2_dig = 3)
DT::datatable(df_env)

截止到目前,我们已经准备完成了 RDA 分析以及 ENVFIT 分析,并将数据转换成为了满足可视化需求的格式。 ## 结果可视化 接下来就是需要进行可视化作业。此例,笔者仅需要显示环境变量与物种信息,不需要显示样地信息,因此绘制中仅保留了所需的信息,如需显示样地信息,请按实际需求更改。

绘制 RDA 图形是常用的操作,因此同样将它包装成为 function 并命名为 ggRDA

# Here, rda_obj means the object which is from vegan::rda
# sp_size means the text size of species
# arrow_txt_size means the environmet variable names at the end of the arrow
# Because not every RDA plot needs indicate significant correlations, the envfit_df is optional here.
ggRDA <- function(rda_obj,sp_size = 4,arrow_txt_size = 4, envfit_df) {
    # ggplot doesn't support rda object directly, we use ggvegan::fortify function to convert the rda to data.frame
    fmod <- fortify(rda_obj)
    # to get the arrow of biplot, we plot rda by vegan::plot.rda function firstly. 
    # The arrow attributes contain in the attributes(plot_obejct$biplot)$arrow.mul
    basplot <- plot(rda_obj)
    mult <- attributes(basplot$biplot)$arrow.mul
  
    # To check if envfit_df exists or not
    # If envfit_df exists, join the fortified rda_obj and envfit to mark which variable is significant.
    if(missingArg(envfit_df)){
        bplt_df <- filter(fmod, Score == "biplot") %>%
        # If there is no requirement to mark significant variable
        # use the sytle of sinificant (black bolder solid arrow)
        # to paint the arrow
        mutate(bold = 'sig')
    } else {
        bplt_df <- filter(fmod, Score == "biplot") %>%
        left_join(envfit_df, by = c('Label' = 'factor')) %>%
        # To mark the significant variables as sig, not significant variables as ns
        # these information are stored in bold column
        mutate(bold = ifelse(str_detect(sig, fixed('*')), 'sig', 'ns'))
    }
  ggplot(fmod, aes(x = RDA1, y = RDA2)) +
    coord_fixed() +
    geom_segment(data = bplt_df,
                 # mult and RDA1/RDA2 are from fortified RDA data.frame
                 # they contain the direction and effects of every variabl
                 # their products are the direction and length of arrows
                 aes(x = 0, xend = mult * RDA1,
                     y = 0, yend = mult * RDA2,
                     # Use different arrow size to indicate the significant level
                     size = bold,
                     # Use different arrow color to indicate the significant level
                     color = bold,
                     # Use different arrow linetype to indicate the significant level
                     linetype = bold),
                 #############################
                 # Q:Why use three different attibution to control the significant levels?
                     # It is redundancy, isn't it?
                 # A: In fact, it's not easy to recongize the significant level by one kind attribution
                     # Becasue it is not delicate to indicate it with supper bold and supper thin arrow,
                     # by the same logic, high contrast colors are not delicate neither.
                     # As for the line type, some arrow are really short, it's not easy to recognize
                     # weather it is solid or dashed line at all.
                     # To sum up, we use three different attributions
                     # to indicate the same difference to avoid any misleading.
                 #############################
                 # to control the size of the header of arrow
                 arrow = arrow(length = unit(0.25, "cm")), 
                ) +
    # Add the text of variable name at the end of arrow
    geom_text(data =  subset(fmod, Score == "biplot"),
              aes(x = (mult + mult/10) * RDA1,
                   #we add 10% to the text to push it slightly out from arrows
                  y = (mult + mult/10) * RDA2,
                  label = Label),
              size = arrow_txt_size,
               #otherwise you could use hjust and vjust. I prefer this option
              hjust = 0.5) +
    # Add the text of species
    geom_text(
      data = subset(fmod, Score == "species"),
      aes(colour = "species",label = Label),
      size = sp_size
    )
}

虽然使用 ggRDA 可以直接绘制图形,但通常为了美观,对于特定的参数还需要进一步的调整。

注意,由于在 ggRDA 中使用了 vegan::plot.rda 绘制图像,所以在下面的调用 ggRDA 会首先绘制一次简易的 rda 图像之后,再显示出 ggplot2 绘制的图形,不影响后续的输出保存。

library(ggvegan)
# Get the amount of explanation by each axis
# gernerally, we choose the first two axes.
exp_by_x <- (as.list(res_rda$CCA$eig)$RDA1)/(res_rda$tot.chi) * 100
exp_by_y <- (as.list(res_rda$CCA$eig)$RDA2)/(res_rda$tot.chi) * 100
    
ggRDA(res_rda, envfit_df = df_env, sp_size = 5) +
    # Generally theme_classic is a good choice to paint a figure
    theme_classic() +
    # In general, we don't need to show the legend in RDA figure
    theme(legend.position = "none") +
    xlab(paste('RDA1 (', round(exp_by_x, 2), '%)', sep = '')) +
    ylab(paste('RDA2 (', round(exp_by_y, 2), '%)', sep = '')) +
    # scale_XXXXX_manual series provide the ability
    # to define the style of legend by variable value
    scale_size_manual(values = c('ns' = .6,
                                 'sig' = .8)) +
    # Q: What's species here? I don't remember their is a significant level which is called 'species'
    # A: Indeed, their is no significant 'species'. However, 
        # the species name in RDA which is generated from geom_text contains colour attribution.
    scale_colour_manual(values = c(
        'ns' = '#606060',
        'sig' = 'black',
        'species' = 'red'
      )) +
    scale_linetype_manual(values = c('ns' = 8, 'sig' = 1))

最后很简单了,使用 ggsave 将 RDA 图保存成你需要的格式就可以。

Tips: 自然科学期刊的投稿系统通常支持上传 PDF 格式的图片,根据实际情况也推荐使用 ggsave 输出 PDF 格式。

此时无需设置 DPI 也无法设置 DPI,这是因为,ggsave 保存的 PDF 文件会优先将图片输出为矢量图,简而言之就是图片无论如何放大,都不会变得模糊,而期刊的排版系统也能很好的处理这种矢量图。

只要给 PDF 图片设定一个合适的宽高,就无需担心图片会变得不清晰等奇奇怪怪看似玄学的问题了。

不过如果有修改字体的需求,也许 PDF 会报错,类似于字体无法嵌入,此时输出的 PDF 文件所有的文字都会消失。关于解决这个问题,后续也许会进行进一步的讨论。

ggsave('RDA.pdf',width = 7)