Phylogenetic regression

Code
# code from prior sections
using CSV, DataFrames, PhyloNetworks, PhyloPlots, RCall, StatsBase, StatsModels

Phylogenetic regression and phylogenetic ANOVA are for:

Here, we can do this when the phylogeny is a network. The residuals are assumed to be phylogenetically correlated. Residuals capture the variation in the response that deviate from the prediction using the predictor variables.

In this tutorial, we will use the following continuous traits:

We will use our calibrated network. If you did not run the previous section or started a new julia session, we can read this network from file:

net = readTopology("data/polemonium_network_calibrated.phy");

match taxon names

We start with the case when we have 1 value per species —morph in our case.

traits_morph = CSV.read("data/polemonium_traits_morph.csv", DataFrame)
select!(traits_morph, :morph, :llength, :elevation, :lat)
first(traits_morph, 4)
4×4 DataFrame
Rowmorphllengthelevationlat
String31Float64Float64Float64
1acutiflorum0.1908340.4962.26
2aff._viscosum_sp._nov.-1.3713.078544.0958
3albiflorum0.9743912.34741.4855
4apachianum0.2893642.6883633.927

Taxon names are in the column “morph”, but don’t match exactly with the names in the network —somethign quite typical.

setdiff(tipLabels(net), traits_morph.morph) # names in the network but not in trait df
7-element Vector{String}:
 "reptans_2"
 "delicatum_2"
 "carneum_2"
 "elusum_2"
 "aff._viscosum_sp._nov._2"
 "foliosissimum_2"
 "chartaceum_2"
setdiff(traits_morph.morph, tipLabels(net)) # names in trait df but not in network
20-element Vector{String}:
 "acutiflorum"
 "aff._viscosum_sp._nov."
 "albiflorum"
 "boreale"
 "californicum"
 "carneum"
 "chartaceum"
 "delicatum"
 "elusum"
 "eximium"
 "filicinum"
 "flavum"
 "foliosissimum"
 "nevadense"
 "occidentale_lacustre"
 "pulcherrimum_lindleyi"
 "pulcherrimum_pulcherrimum"
 "reptans"
 "vanbruntiae"
 "viscosum"

We notice that some taxa in the network have a suffix “_2” that is not in the trait data, and that the trait data has extra species.

Let’s use a regular expression to remove the suffix “_2” from the tip names in the network:

for tip in net.node
  tip.name = replace(tip.name, r"_2$" => s"") # s"" to substitute what was found with nothing
end
issubset(tipLabels(net), traits_morph.morph) # true: all taxa in the network have data
true

Finally, let’s reduce our data frame to morphs that are in the network, also something often needed:

subset!(traits_morph, :morph => ByRow(in(tipLabels(net))))
17×4 DataFrame
Rowmorphllengthelevationlat
String31Float64Float64Float64
1aff._viscosum_sp._nov.-1.3713.078544.0958
2apachianum0.2893642.6883633.927
3brandegeei-0.674882.743540.0324
4carneum0.756531.00843.2211
5chartaceum-1.612334.11637.6344
6confertum-0.4304463.704538.9484
7delicatum0.03162773.30738.84
8eddyense-1.219372.74441.3189
9elegans-1.060162.28746.904
10elusum0.213141.69244.7146
11foliosissimum0.6220952.682538.786
12micranthum-0.8648491.12843.7447
13occidentale_occidentale0.5424572.1442.5916
14pauciflorum0.2220392.637530.8088
15pectinatum0.8341570.5547.1283
16pulcherrimum_shastense-1.044883.1741.3842
17reptans0.7466430.24239.9954

phylogenetic regression

Let’s fit our model: a Browian motion by default. The output shows the estimated parameters, confidence intervals for the regression coefficients, and other summary.

fit_bm    = phylolm(@formula(llength ~ elevation + lat), traits_morph, net;
                    tipnames=:morph) # default model: BM
PhyloNetworkLinearModel

Formula: llength ~ 1 + elevation + lat

Model: Brownian motion

Parameter Estimates, using REML:
phylogenetic variance rate: 0.545368

Coefficients:
────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%    Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)   4.23012     1.81753     2.33    0.0355   0.331896   8.12834
elevation    -0.58782     0.160091   -3.67    0.0025  -0.93118   -0.24446
lat          -0.0805585   0.0391812  -2.06    0.0589  -0.164594   0.00347682
────────────────────────────────────────────────────────────────────────────
Log Likelihood: -18.3444303454
AIC: 44.6888606908

Pagel’s lambda model can be fitted if the phylogeny is a time-consistent network (all paths from the root to a given node have the same length) and ultrametric:

fit_pagel = phylolm(@formula(llength ~ elevation + lat), traits_morph, net;
                    model="lambda", tipnames=:morph)
[ Info: Maximum lambda value to maintain positive branch lengths: 1.30752
PhyloNetworkLinearModel

Formula: llength ~ 1 + elevation + lat

Model: Pagel's lambda

Parameter Estimates, using REML:
phylogenetic variance rate: 0.582156
Lambda: 1.1075

Coefficients:
───────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%    Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)   4.14696     1.8205     2.28    0.0389   0.242364   8.05155
elevation    -0.580068    0.160465  -3.61    0.0028  -0.924231  -0.235906
lat          -0.079338    0.039152  -2.03    0.0622  -0.163311   0.00463469
───────────────────────────────────────────────────────────────────────────
Log Likelihood: -18.3375456513
AIC: 46.6750913026
  • Which model seems best, based on AIC?
  • What is the estimate of λ, and what does it mean?
  • Which predictors seem to be correlated with log-leaflet-length, based on Wald-type tests?

accounting for within-species variation

If we have access to traits at the individual level, rather than aggregated at the species level, then we can account for within-species variation. It’s best to do so, because ignoring within-species variation (when present) can cause various biases.

We start by reading the individual trait data and removing rows for taxa not in the network.

traits_indiv = CSV.read("data/polemonium_traits_individual.csv", DataFrame)
subset!(traits_indiv, :morph => ByRow(in(tipLabels(net))))
last(traits_indiv, 4)
4×6 DataFrame
Rowmorphllengthlwidthelevationlatlarea
String31Float64Float64Float64Float64Float64
1reptans0.675492-0.2810380.24239.99540.15289
2reptans1.235470.1354050.24239.99541.12931
3reptans0.496727-0.2357220.24239.99540.0194399
4reptans0.703098-0.1450260.24239.99540.316507

Here, the predictors (elevation and latitude) have the same value for all individuals within a species: entered as the average elevation and latiture across a much much bigger sample size for each species. The model below accounts for within-species variation in the response variable, not in the predictor variables.

fit_ind = phylolm(@formula(llength ~ elevation + lat), traits_indiv, net;
                  tipnames=:morph, withinspecies_var=true)
PhyloNetworkLinearModel

Formula: llength ~ 1 + elevation + lat

Model: Brownian motion

Parameter Estimates, using REML:
phylogenetic variance rate: 0.538129
within-species variance: 0.107484

Coefficients:
────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%    Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)   4.22609     1.81331     2.33    0.0352   0.336929   8.11526
elevation    -0.585523    0.16023    -3.65    0.0026  -0.929181  -0.241865
lat          -0.0804964   0.0391184  -2.06    0.0587  -0.164397   0.00340418
────────────────────────────────────────────────────────────────────────────
Log Likelihood: -332.2511620613
AIC: 674.5023241226
  • What are the estimated variances?
  • Did the estimated regression coefficients change much compared to the previous analysis at the species level?
  • How about the strength of evidence that they correlate with log-leaflet-length, based on Wald-type tests?

likelihood ratio test

We can get p-values with a likelihood ratio test. For likelihood ratio tests, we need to use the ML criterion, not REML, to compare models with different “fixed” effects. ML tends to underestimate variances compared to REML, however.

fit_ml = phylolm(@formula(llength ~ elevation + lat), traits_indiv, net;
                reml=false, tipnames=:morph, withinspecies_var=true)
fit_ml_nolat = phylolm(@formula(llength ~ elevation), traits_indiv, net;
                reml=false, tipnames=:morph, withinspecies_var=true)
lrtest(fit_ml_nolat, fit_ml)
Likelihood-ratio test: 2 models fitted on 954 observations
──────────────────────────────────────────────────────
     DOF  ΔDOF     LogLik  Deviance   Chisq  p(>Chisq)
──────────────────────────────────────────────────────
[1]    4        -330.9506  661.9012                   
[2]    5     1  -328.7027  657.4054  4.4958     0.0340
──────────────────────────────────────────────────────

within-species variation using species summaries

The data we used above has 1 row per individual, which can lead to a very large data set (considering that we have 17 species):

nrow(traits_indiv)
954

and sometimes we don’t have this fine-grained individual-level information anyway. In fact, the method only needs to know the sample size, observed mean and variance of the response trait (log-leaflet-length) within each species. Let’s illustrate how to summarize our data with 1 row per species and extra columns: the number of sampled individuals, their mean and variance within each morph.

num_nonmissing = x -> length(.!(ismissing.(x)))
traits_meanstd = combine( groupby(traits_indiv, :morph),
  :llength => mean => :llength, # last term = new column name
  :llength => std => :llength_sd,
  :llength => num_nonmissing => :llength_n,
  :elevation => mean => :elevation,
  :lat => mean => :lat
)
17×6 DataFrame
Rowmorphllengthllength_sdllength_nelevationlat
String31Float64Float64Int64Float64Float64
1aff._viscosum_sp._nov.-1.3710.269326383.078544.0958
2apachianum0.2893640.304548622.6883633.927
3brandegeei-0.674880.353546582.743540.0324
4carneum0.756530.236791471.00843.2211
5chartaceum-1.612330.156378164.11637.6344
6confertum-0.4304460.290027103.704538.9484
7delicatum0.03162770.384536503.30738.84
8eddyense-1.219370.065292732.74441.3189
9elegans-1.060160.258922212.28746.904
10elusum0.213140.260283141.69244.7146
11foliosissimum0.6220950.3156212742.682538.786
12micranthum-0.8648490.345533751.12843.7447
13occidentale_occidentale0.5424570.395491732.1442.5916
14pauciflorum0.2220390.313219472.637530.8088
15pectinatum0.8341570.20499290.5547.1283
16pulcherrimum_shastense-1.044880.23081783.1741.3842
17reptans0.7466430.285631490.24239.9954

We can use this input data to fit our phylogenetic regression, thanks to the option y_mean_std=true. This format requires 3 columns for the response variable, containing:

  • the mean (one row per species): llength in our case
  • the standard deviation (SD, square-root of the variance) named with suffix “_sd”, llength_sd in our case,
  • the sample size, that is, the number of individuals from which the mean and SDs were calculated, named with suffix “_n”, so llength_n in our case.

Now we fit the model as above using our summarized dataset, except that we use option y_mean_std=true to let the function know about the format, and get the same results as before:

phylolm(@formula(llength ~ elevation + lat), traits_meanstd, net; # same result as fit_ind
        tipnames=:morph, withinspecies_var=true, y_mean_std=true)
PhyloNetworkLinearModel

Formula: llength ~ 1 + elevation + lat

Model: Brownian motion

Parameter Estimates, using REML:
phylogenetic variance rate: 0.538129
within-species variance: 0.107484

Coefficients:
────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%    Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)   4.22609     1.81331     2.33    0.0352   0.336929   8.11526
elevation    -0.585523    0.16023    -3.65    0.0026  -0.929181  -0.241865
lat          -0.0804964   0.0391184  -2.06    0.0587  -0.164397   0.00340418
────────────────────────────────────────────────────────────────────────────
Log Likelihood: -332.2511620613
AIC: 674.5023241226