Code
# code from prior sections
using CSV, DataFrames, PhyloNetworks, PhyloPlots, RCall, StatsBase, StatsModels
Phylogenetic regression and phylogenetic ANOVA are for:
Here, we can do this when the phylogeny is a network. The residuals are assumed to be phylogenetically correlated. Residuals capture the variation in the response that deviate from the prediction using the predictor variables.
In this tutorial, we will use the following continuous traits:
area ≈ π length/2 × width/2
.We will use our calibrated network. If you did not run the previous section or started a new julia session, we can read this network from file:
We start with the case when we have 1 value per species —morph in our case.
traits_morph = CSV.read("data/polemonium_traits_morph.csv", DataFrame)
select!(traits_morph, :morph, :llength, :elevation, :lat)
first(traits_morph, 4)
Row | morph | llength | elevation | lat |
---|---|---|---|---|
String31 | Float64 | Float64 | Float64 | |
1 | acutiflorum | 0.190834 | 0.49 | 62.26 |
2 | aff._viscosum_sp._nov. | -1.371 | 3.0785 | 44.0958 |
3 | albiflorum | 0.974391 | 2.347 | 41.4855 |
4 | apachianum | 0.289364 | 2.68836 | 33.927 |
Taxon names are in the column “morph”, but don’t match exactly with the names in the network —somethign quite typical.
7-element Vector{String}:
"reptans_2"
"delicatum_2"
"carneum_2"
"elusum_2"
"aff._viscosum_sp._nov._2"
"foliosissimum_2"
"chartaceum_2"
20-element Vector{String}:
"acutiflorum"
"aff._viscosum_sp._nov."
"albiflorum"
"boreale"
"californicum"
"carneum"
"chartaceum"
"delicatum"
"elusum"
"eximium"
"filicinum"
"flavum"
"foliosissimum"
"nevadense"
"occidentale_lacustre"
"pulcherrimum_lindleyi"
"pulcherrimum_pulcherrimum"
"reptans"
"vanbruntiae"
"viscosum"
We notice that some taxa in the network have a suffix “_2” that is not in the trait data, and that the trait data has extra species.
Let’s use a regular expression to remove the suffix “_2” from the tip names in the network:
for tip in net.node
tip.name = replace(tip.name, r"_2$" => s"") # s"" to substitute what was found with nothing
end
issubset(tipLabels(net), traits_morph.morph) # true: all taxa in the network have data
true
Finally, let’s reduce our data frame to morphs that are in the network, also something often needed:
Row | morph | llength | elevation | lat |
---|---|---|---|---|
String31 | Float64 | Float64 | Float64 | |
1 | aff._viscosum_sp._nov. | -1.371 | 3.0785 | 44.0958 |
2 | apachianum | 0.289364 | 2.68836 | 33.927 |
3 | brandegeei | -0.67488 | 2.7435 | 40.0324 |
4 | carneum | 0.75653 | 1.008 | 43.2211 |
5 | chartaceum | -1.61233 | 4.116 | 37.6344 |
6 | confertum | -0.430446 | 3.7045 | 38.9484 |
7 | delicatum | 0.0316277 | 3.307 | 38.84 |
8 | eddyense | -1.21937 | 2.744 | 41.3189 |
9 | elegans | -1.06016 | 2.287 | 46.904 |
10 | elusum | 0.21314 | 1.692 | 44.7146 |
11 | foliosissimum | 0.622095 | 2.6825 | 38.786 |
12 | micranthum | -0.864849 | 1.128 | 43.7447 |
13 | occidentale_occidentale | 0.542457 | 2.14 | 42.5916 |
14 | pauciflorum | 0.222039 | 2.6375 | 30.8088 |
15 | pectinatum | 0.834157 | 0.55 | 47.1283 |
16 | pulcherrimum_shastense | -1.04488 | 3.17 | 41.3842 |
17 | reptans | 0.746643 | 0.242 | 39.9954 |
Let’s fit our model: a Browian motion by default. The output shows the estimated parameters, confidence intervals for the regression coefficients, and other summary.
fit_bm = phylolm(@formula(llength ~ elevation + lat), traits_morph, net;
tipnames=:morph) # default model: BM
PhyloNetworkLinearModel
Formula: llength ~ 1 + elevation + lat
Model: Brownian motion
Parameter Estimates, using REML:
phylogenetic variance rate: 0.545368
Coefficients:
────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept) 4.23012 1.81753 2.33 0.0355 0.331896 8.12834
elevation -0.58782 0.160091 -3.67 0.0025 -0.93118 -0.24446
lat -0.0805585 0.0391812 -2.06 0.0589 -0.164594 0.00347682
────────────────────────────────────────────────────────────────────────────
Log Likelihood: -18.3444303454
AIC: 44.6888606908
Pagel’s lambda model can be fitted if the phylogeny is a time-consistent network (all paths from the root to a given node have the same length) and ultrametric:
fit_pagel = phylolm(@formula(llength ~ elevation + lat), traits_morph, net;
model="lambda", tipnames=:morph)
[ Info: Maximum lambda value to maintain positive branch lengths: 1.30752
PhyloNetworkLinearModel
Formula: llength ~ 1 + elevation + lat
Model: Pagel's lambda
Parameter Estimates, using REML:
phylogenetic variance rate: 0.582156
Lambda: 1.1075
Coefficients:
───────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept) 4.14696 1.8205 2.28 0.0389 0.242364 8.05155
elevation -0.580068 0.160465 -3.61 0.0028 -0.924231 -0.235906
lat -0.079338 0.039152 -2.03 0.0622 -0.163311 0.00463469
───────────────────────────────────────────────────────────────────────────
Log Likelihood: -18.3375456513
AIC: 46.6750913026
If we have access to traits at the individual level, rather than aggregated at the species level, then we can account for within-species variation. It’s best to do so, because ignoring within-species variation (when present) can cause various biases.
We start by reading the individual trait data and removing rows for taxa not in the network.
traits_indiv = CSV.read("data/polemonium_traits_individual.csv", DataFrame)
subset!(traits_indiv, :morph => ByRow(in(tipLabels(net))))
last(traits_indiv, 4)
Row | morph | llength | lwidth | elevation | lat | larea |
---|---|---|---|---|---|---|
String31 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | reptans | 0.675492 | -0.281038 | 0.242 | 39.9954 | 0.15289 |
2 | reptans | 1.23547 | 0.135405 | 0.242 | 39.9954 | 1.12931 |
3 | reptans | 0.496727 | -0.235722 | 0.242 | 39.9954 | 0.0194399 |
4 | reptans | 0.703098 | -0.145026 | 0.242 | 39.9954 | 0.316507 |
Here, the predictors (elevation and latitude) have the same value for all individuals within a species: entered as the average elevation and latiture across a much much bigger sample size for each species. The model below accounts for within-species variation in the response variable, not in the predictor variables.
fit_ind = phylolm(@formula(llength ~ elevation + lat), traits_indiv, net;
tipnames=:morph, withinspecies_var=true)
PhyloNetworkLinearModel
Formula: llength ~ 1 + elevation + lat
Model: Brownian motion
Parameter Estimates, using REML:
phylogenetic variance rate: 0.538129
within-species variance: 0.107484
Coefficients:
────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept) 4.22609 1.81331 2.33 0.0352 0.336929 8.11526
elevation -0.585523 0.16023 -3.65 0.0026 -0.929181 -0.241865
lat -0.0804964 0.0391184 -2.06 0.0587 -0.164397 0.00340418
────────────────────────────────────────────────────────────────────────────
Log Likelihood: -332.2511620613
AIC: 674.5023241226
We can get p-values with a likelihood ratio test. For likelihood ratio tests, we need to use the ML criterion, not REML, to compare models with different “fixed” effects. ML tends to underestimate variances compared to REML, however.
fit_ml = phylolm(@formula(llength ~ elevation + lat), traits_indiv, net;
reml=false, tipnames=:morph, withinspecies_var=true)
fit_ml_nolat = phylolm(@formula(llength ~ elevation), traits_indiv, net;
reml=false, tipnames=:morph, withinspecies_var=true)
lrtest(fit_ml_nolat, fit_ml)
Likelihood-ratio test: 2 models fitted on 954 observations
──────────────────────────────────────────────────────
DOF ΔDOF LogLik Deviance Chisq p(>Chisq)
──────────────────────────────────────────────────────
[1] 4 -330.9506 661.9012
[2] 5 1 -328.7027 657.4054 4.4958 0.0340
──────────────────────────────────────────────────────
The data we used above has 1 row per individual, which can lead to a very large data set (considering that we have 17 species):
and sometimes we don’t have this fine-grained individual-level information anyway. In fact, the method only needs to know the sample size, observed mean and variance of the response trait (log-leaflet-length) within each species. Let’s illustrate how to summarize our data with 1 row per species and extra columns: the number of sampled individuals, their mean and variance within each morph.
num_nonmissing = x -> length(.!(ismissing.(x)))
traits_meanstd = combine( groupby(traits_indiv, :morph),
:llength => mean => :llength, # last term = new column name
:llength => std => :llength_sd,
:llength => num_nonmissing => :llength_n,
:elevation => mean => :elevation,
:lat => mean => :lat
)
Row | morph | llength | llength_sd | llength_n | elevation | lat |
---|---|---|---|---|---|---|
String31 | Float64 | Float64 | Int64 | Float64 | Float64 | |
1 | aff._viscosum_sp._nov. | -1.371 | 0.269326 | 38 | 3.0785 | 44.0958 |
2 | apachianum | 0.289364 | 0.304548 | 62 | 2.68836 | 33.927 |
3 | brandegeei | -0.67488 | 0.353546 | 58 | 2.7435 | 40.0324 |
4 | carneum | 0.75653 | 0.236791 | 47 | 1.008 | 43.2211 |
5 | chartaceum | -1.61233 | 0.156378 | 16 | 4.116 | 37.6344 |
6 | confertum | -0.430446 | 0.290027 | 10 | 3.7045 | 38.9484 |
7 | delicatum | 0.0316277 | 0.384536 | 50 | 3.307 | 38.84 |
8 | eddyense | -1.21937 | 0.0652927 | 3 | 2.744 | 41.3189 |
9 | elegans | -1.06016 | 0.258922 | 21 | 2.287 | 46.904 |
10 | elusum | 0.21314 | 0.260283 | 14 | 1.692 | 44.7146 |
11 | foliosissimum | 0.622095 | 0.315621 | 274 | 2.6825 | 38.786 |
12 | micranthum | -0.864849 | 0.345533 | 75 | 1.128 | 43.7447 |
13 | occidentale_occidentale | 0.542457 | 0.39549 | 173 | 2.14 | 42.5916 |
14 | pauciflorum | 0.222039 | 0.313219 | 47 | 2.6375 | 30.8088 |
15 | pectinatum | 0.834157 | 0.204992 | 9 | 0.55 | 47.1283 |
16 | pulcherrimum_shastense | -1.04488 | 0.230817 | 8 | 3.17 | 41.3842 |
17 | reptans | 0.746643 | 0.285631 | 49 | 0.242 | 39.9954 |
We can use this input data to fit our phylogenetic regression, thanks to the option y_mean_std=true
. This format requires 3 columns for the response variable, containing:
llength
in our casellength_sd
in our case,llength_n
in our case.Now we fit the model as above using our summarized dataset, except that we use option y_mean_std=true
to let the function know about the format, and get the same results as before:
phylolm(@formula(llength ~ elevation + lat), traits_meanstd, net; # same result as fit_ind
tipnames=:morph, withinspecies_var=true, y_mean_std=true)
PhyloNetworkLinearModel
Formula: llength ~ 1 + elevation + lat
Model: Brownian motion
Parameter Estimates, using REML:
phylogenetic variance rate: 0.538129
within-species variance: 0.107484
Coefficients:
────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept) 4.22609 1.81331 2.33 0.0352 0.336929 8.11526
elevation -0.585523 0.16023 -3.65 0.0026 -0.929181 -0.241865
lat -0.0804964 0.0391184 -2.06 0.0587 -0.164397 0.00340418
────────────────────────────────────────────────────────────────────────────
Log Likelihood: -332.2511620613
AIC: 674.5023241226