Phylogenetic regression

Code

# code from prior sections
using CSV, DataFrames, PhyloNetworks, PhyloPlots, RCall, StatsBase, StatsModels

Phylogenetic regression and phylogenetic ANOVA are for:

a continuous response variable
using 0, 1 or more predictor variables, which can be continuous or discrete

Here, we can do this when the phylogeny is a network. The residuals are assumed to be phylogenetically correlated. Residuals capture the variation in the response that deviate from the prediction using the predictor variables.

In this tutorial, we will use the following continuous traits:

response: leaflet length or width, which were log-transformed to obtain a more normal distribution within each species morph, and a more homogeneous variance across species morphs. A combined trait, log(leaflet area) was calculated from leaflet length and width through the formula area ≈ π length/2 × width/2.
For this tutorial, we will focus on leaflet length (log-transformed) for the response variable.
possible predictors: elevation and latitude.

We will use our calibrated network. If you did not run the previous section or started a new julia session, we can read this network from file:

net = readTopology("data/polemonium_network_calibrated.phy");

match taxon names

We start with the case when we have 1 value per species —morph in our case.

traits_morph = CSV.read("data/polemonium_traits_morph.csv", DataFrame)
select!(traits_morph, :morph, :llength, :elevation, :lat)
first(traits_morph, 4)

4×4 DataFrame

Row	morph	llength	elevation	lat
	String31	Float64	Float64	Float64
1	acutiflorum	0.190834	0.49	62.26
2	aff._viscosum_sp._nov.	-1.371	3.0785	44.0958
3	albiflorum	0.974391	2.347	41.4855
4	apachianum	0.289364	2.68836	33.927

Taxon names are in the column “morph”, but don’t match exactly with the names in the network —somethign quite typical.

setdiff(tipLabels(net), traits_morph.morph) # names in the network but not in trait df

7-element Vector{String}:
 "reptans_2"
 "delicatum_2"
 "carneum_2"
 "elusum_2"
 "aff._viscosum_sp._nov._2"
 "foliosissimum_2"
 "chartaceum_2"

setdiff(traits_morph.morph, tipLabels(net)) # names in trait df but not in network

20-element Vector{String}:
 "acutiflorum"
 "aff._viscosum_sp._nov."
 "albiflorum"
 "boreale"
 "californicum"
 "carneum"
 "chartaceum"
 "delicatum"
 "elusum"
 "eximium"
 "filicinum"
 "flavum"
 "foliosissimum"
 "nevadense"
 "occidentale_lacustre"
 "pulcherrimum_lindleyi"
 "pulcherrimum_pulcherrimum"
 "reptans"
 "vanbruntiae"
 "viscosum"

We notice that some taxa in the network have a suffix “_2” that is not in the trait data, and that the trait data has extra species.

Let’s use a regular expression to remove the suffix “_2” from the tip names in the network:

for tip in net.node
  tip.name = replace(tip.name, r"_2$" => s"") # s"" to substitute what was found with nothing
end
issubset(tipLabels(net), traits_morph.morph) # true: all taxa in the network have data

true

Finally, let’s reduce our data frame to morphs that are in the network, also something often needed:

subset!(traits_morph, :morph => ByRow(in(tipLabels(net))))

17×4 DataFrame

Row	morph	llength	elevation	lat
	String31	Float64	Float64	Float64
1	aff._viscosum_sp._nov.	-1.371	3.0785	44.0958
2	apachianum	0.289364	2.68836	33.927
3	brandegeei	-0.67488	2.7435	40.0324
4	carneum	0.75653	1.008	43.2211
5	chartaceum	-1.61233	4.116	37.6344
6	confertum	-0.430446	3.7045	38.9484
7	delicatum	0.0316277	3.307	38.84
8	eddyense	-1.21937	2.744	41.3189
9	elegans	-1.06016	2.287	46.904
10	elusum	0.21314	1.692	44.7146
11	foliosissimum	0.622095	2.6825	38.786
12	micranthum	-0.864849	1.128	43.7447
13	occidentale_occidentale	0.542457	2.14	42.5916
14	pauciflorum	0.222039	2.6375	30.8088
15	pectinatum	0.834157	0.55	47.1283
16	pulcherrimum_shastense	-1.04488	3.17	41.3842
17	reptans	0.746643	0.242	39.9954

phylogenetic regression

Let’s fit our model: a Browian motion by default. The output shows the estimated parameters, confidence intervals for the regression coefficients, and other summary.

fit_bm    = phylolm(@formula(llength ~ elevation + lat), traits_morph, net;
                    tipnames=:morph) # default model: BM

PhyloNetworkLinearModel

Formula: llength ~ 1 + elevation + lat

Model: Brownian motion

Parameter Estimates, using REML:
phylogenetic variance rate: 0.545368

Coefficients:
────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%    Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)   4.23012     1.81753     2.33    0.0355   0.331896   8.12834
elevation    -0.58782     0.160091   -3.67    0.0025  -0.93118   -0.24446
lat          -0.0805585   0.0391812  -2.06    0.0589  -0.164594   0.00347682
────────────────────────────────────────────────────────────────────────────
Log Likelihood: -18.3444303454
AIC: 44.6888606908

Pagel’s lambda model can be fitted if the phylogeny is a time-consistent network (all paths from the root to a given node have the same length) and ultrametric:

fit_pagel = phylolm(@formula(llength ~ elevation + lat), traits_morph, net;
                    model="lambda", tipnames=:morph)

[ Info: Maximum lambda value to maintain positive branch lengths: 1.30752

PhyloNetworkLinearModel

Formula: llength ~ 1 + elevation + lat

Model: Pagel's lambda

Parameter Estimates, using REML:
phylogenetic variance rate: 0.582156
Lambda: 1.1075

Coefficients:
───────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%    Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)   4.14696     1.8205     2.28    0.0389   0.242364   8.05155
elevation    -0.580068    0.160465  -3.61    0.0028  -0.924231  -0.235906
lat          -0.079338    0.039152  -2.03    0.0622  -0.163311   0.00463469
───────────────────────────────────────────────────────────────────────────
Log Likelihood: -18.3375456513
AIC: 46.6750913026

Which model seems best, based on AIC?
What is the estimate of λ, and what does it mean?
Which predictors seem to be correlated with log-leaflet-length, based on Wald-type tests?

accounting for within-species variation

If we have access to traits at the individual level, rather than aggregated at the species level, then we can account for within-species variation. It’s best to do so, because ignoring within-species variation (when present) can cause various biases.

We start by reading the individual trait data and removing rows for taxa not in the network.

traits_indiv = CSV.read("data/polemonium_traits_individual.csv", DataFrame)
subset!(traits_indiv, :morph => ByRow(in(tipLabels(net))))
last(traits_indiv, 4)

4×6 DataFrame

Row	morph	llength	lwidth	elevation	lat	larea
	String31	Float64	Float64	Float64	Float64	Float64
1	reptans	0.675492	-0.281038	0.242	39.9954	0.15289
2	reptans	1.23547	0.135405	0.242	39.9954	1.12931
3	reptans	0.496727	-0.235722	0.242	39.9954	0.0194399
4	reptans	0.703098	-0.145026	0.242	39.9954	0.316507

Here, the predictors (elevation and latitude) have the same value for all individuals within a species: entered as the average elevation and latiture across a much much bigger sample size for each species. The model below accounts for within-species variation in the response variable, not in the predictor variables.

fit_ind = phylolm(@formula(llength ~ elevation + lat), traits_indiv, net;
                  tipnames=:morph, withinspecies_var=true)

PhyloNetworkLinearModel

Formula: llength ~ 1 + elevation + lat

Model: Brownian motion

Parameter Estimates, using REML:
phylogenetic variance rate: 0.538129
within-species variance: 0.107484

Coefficients:
────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%    Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)   4.22609     1.81331     2.33    0.0352   0.336929   8.11526
elevation    -0.585523    0.16023    -3.65    0.0026  -0.929181  -0.241865
lat          -0.0804964   0.0391184  -2.06    0.0587  -0.164397   0.00340418
────────────────────────────────────────────────────────────────────────────
Log Likelihood: -332.2511620613
AIC: 674.5023241226

What are the estimated variances?
Did the estimated regression coefficients change much compared to the previous analysis at the species level?
How about the strength of evidence that they correlate with log-leaflet-length, based on Wald-type tests?

likelihood ratio test

We can get p-values with a likelihood ratio test. For likelihood ratio tests, we need to use the ML criterion, not REML, to compare models with different “fixed” effects. ML tends to underestimate variances compared to REML, however.

fit_ml = phylolm(@formula(llength ~ elevation + lat), traits_indiv, net;
                reml=false, tipnames=:morph, withinspecies_var=true)
fit_ml_nolat = phylolm(@formula(llength ~ elevation), traits_indiv, net;
                reml=false, tipnames=:morph, withinspecies_var=true)
lrtest(fit_ml_nolat, fit_ml)

Likelihood-ratio test: 2 models fitted on 954 observations
──────────────────────────────────────────────────────
     DOF  ΔDOF     LogLik  Deviance   Chisq  p(>Chisq)
──────────────────────────────────────────────────────
[1]    4        -330.9506  661.9012                   
[2]    5     1  -328.7027  657.4054  4.4958     0.0340
──────────────────────────────────────────────────────

within-species variation using species summaries

The data we used above has 1 row per individual, which can lead to a very large data set (considering that we have 17 species):

nrow(traits_indiv)

and sometimes we don’t have this fine-grained individual-level information anyway. In fact, the method only needs to know the sample size, observed mean and variance of the response trait (log-leaflet-length) within each species. Let’s illustrate how to summarize our data with 1 row per species and extra columns: the number of sampled individuals, their mean and variance within each morph.

num_nonmissing = x -> length(.!(ismissing.(x)))
traits_meanstd = combine( groupby(traits_indiv, :morph),
  :llength => mean => :llength, # last term = new column name
  :llength => std => :llength_sd,
  :llength => num_nonmissing => :llength_n,
  :elevation => mean => :elevation,
  :lat => mean => :lat
)

17×6 DataFrame

Row	morph	llength	llength_sd	llength_n	elevation	lat
	String31	Float64	Float64	Int64	Float64	Float64
1	aff._viscosum_sp._nov.	-1.371	0.269326	38	3.0785	44.0958
2	apachianum	0.289364	0.304548	62	2.68836	33.927
3	brandegeei	-0.67488	0.353546	58	2.7435	40.0324
4	carneum	0.75653	0.236791	47	1.008	43.2211
5	chartaceum	-1.61233	0.156378	16	4.116	37.6344
6	confertum	-0.430446	0.290027	10	3.7045	38.9484
7	delicatum	0.0316277	0.384536	50	3.307	38.84
8	eddyense	-1.21937	0.0652927	3	2.744	41.3189
9	elegans	-1.06016	0.258922	21	2.287	46.904
10	elusum	0.21314	0.260283	14	1.692	44.7146
11	foliosissimum	0.622095	0.315621	274	2.6825	38.786
12	micranthum	-0.864849	0.345533	75	1.128	43.7447
13	occidentale_occidentale	0.542457	0.39549	173	2.14	42.5916
14	pauciflorum	0.222039	0.313219	47	2.6375	30.8088
15	pectinatum	0.834157	0.204992	9	0.55	47.1283
16	pulcherrimum_shastense	-1.04488	0.230817	8	3.17	41.3842
17	reptans	0.746643	0.285631	49	0.242	39.9954

We can use this input data to fit our phylogenetic regression, thanks to the option y_mean_std=true. This format requires 3 columns for the response variable, containing:

the mean (one row per species): llength in our case
the standard deviation (SD, square-root of the variance) named with suffix “_sd”, llength_sd in our case,
the sample size, that is, the number of individuals from which the mean and SDs were calculated, named with suffix “_n”, so llength_n in our case.

Now we fit the model as above using our summarized dataset, except that we use option y_mean_std=true to let the function know about the format, and get the same results as before:

phylolm(@formula(llength ~ elevation + lat), traits_meanstd, net; # same result as fit_ind
        tipnames=:morph, withinspecies_var=true, y_mean_std=true)

PhyloNetworkLinearModel

Formula: llength ~ 1 + elevation + lat

Model: Brownian motion

Parameter Estimates, using REML:
phylogenetic variance rate: 0.538129
within-species variance: 0.107484

Coefficients:
────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%    Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)   4.22609     1.81331     2.33    0.0352   0.336929   8.11526
elevation    -0.585523    0.16023    -3.65    0.0026  -0.929181  -0.241865
lat          -0.0804964   0.0391184  -2.06    0.0587  -0.164397   0.00340418
────────────────────────────────────────────────────────────────────────────
Log Likelihood: -332.2511620613
AIC: 674.5023241226