Impact of gene flow

Code
# code from prior sections
using CSV, DataFrames, PhyloNetworks, PhyloPlots, RCall, StatsBase, StatsModels
net = readTopology("data/polemonium_network_calibrated.phy");
for tip in net.node tip.name = replace(tip.name, r"_2$" => s""); end
traits = CSV.read("data/polemonium_traits_morph.csv", DataFrame)
subset!(traits, :morph => ByRow(in(tipLabels(net))));

What happened to the trait during gene flow? We can ask this question in various ways.

network model versus tree model

First, we can investigate if the network (with gene flow) is a better explanation for the trait evolution than a tree. Several trees are displayed in the network, but the tree that represents most of the genome has the “major” hybrid edges, those with inheritance γ>0.5. We can delete all the “minor” hybrid edges (with γ<0.5) as shown below.

tree = majorTree(net)
R"par"(mar=[0,0,0,0]);
plot(tree, tipoffset=0.05, useedgelength=true, xlim=[1,2.5]);

Figure 1: major tree, drawn proportional to edge lengths

fit_net = phylolm(@formula(llength ~ elevation + lat), traits, net;
                  tipnames=:morph, withinspecies_var=true, y_mean_std=true)
fit_tree = phylolm(@formula(llength ~ elevation + lat), traits, tree;
                  tipnames=:morph, withinspecies_var=true, y_mean_std=true)
aic(fit_net), aic(fit_tree)
(674.5023241225919, 675.0140696456297)

Which model seems to fit best: network or major tree?

Note that this network-vs-tree question applies to the phylogenetic correlation of residuals here. It could be gene flow had an impact on the geographic distribution at reticulations, which impacted evelation and latitude. In that case, this impact would be explained by the predictors that we used, and would not be seen here. But it could appear from a model without any predictors (something you could run if interested).

We can check to see how much of a difference it makes on coefficients:

coeftable(fit_net)
Coef. Std. Error t Pr(> t )
(Intercept) 4.22609 1.81331 2.33 0.0352 0.336929 8.11526
elevation -0.585523 0.16023 -3.65 0.0026 -0.929181 -0.241865
lat -0.0804964 0.0391184 -2.06 0.0587 -0.164397 0.00340418
coeftable(fit_tree)
Coef. Std. Error t Pr(> t )
(Intercept) 4.11819 1.81426 2.27 0.0395 0.226996 8.00938
elevation -0.602781 0.155119 -3.89 0.0016 -0.935479 -0.270083
lat -0.077632 0.0392076 -1.98 0.0677 -0.161724 0.00645985

or how much difference it makes on the estimated variance parameters:

round.((sigma2_phylo(fit_net), sigma2_phylo(fit_tree)), sigdigits=4)
(0.5381, 0.5319)
round.((sigma2_within(fit_net), sigma2_within(fit_tree)), sigdigits=4)
(0.1075, 0.1075)

Out of curiousity, we can do this again with the most “minor” tree: where major hybrid edges are deleted and all minor hybrid edges are kept.

minortree = deepcopy(net)
while minortree.numHybrids > 0
  for e in minortree.edge
    if e.hybrid && e.isMajor
      PhyloNetworks.deletehybridedge!(minortree, e) # changes the list of edges
      break
    end
  end
end
# plot(minortree, tipoffset=0.05, useedgelength=true, xlim=[1,2.5]);
fit_minor = phylolm(@formula(llength ~ elevation + lat), traits, minortree;
                  tipnames=:morph, withinspecies_var=true, y_mean_std=true)
aic(fit_net), aic(fit_tree), aic(fit_minor)
(674.5023241225919, 675.0140696456297, 674.7890396071382)

transgressive evolution at reticulations

Another question we may ask is whether the trait at reticulations deviates from the weighted average of its parent populations (immediately before gene flow). Such would be the case if a hybrid species’s trait is outside the range of its parents. This “transgressive” evolution can be modelled with a shift in the trait along the edge just below the reticulation. This shift quantifies the difference between the trait at the hybrid node and the weighted average of the immediate parents.

The code below builds a data frame with 1 predictor column per reticulation: this column quantifies the effect of a transgressive shift on each taxon in the phylogeny.

df_shift = regressorHybrid(net)
17×5 DataFrame
Rowshift_8shift_18shift_25tipNamessum
Float64Float64Float64StringFloat64
10.00.00.0micranthum0.0
20.00.00.0occidentale_occidentale0.0
30.00.00.0reptans0.0
40.00.00.0pectinatum0.0
51.00.00.0delicatum1.0
61.00.00.0pulcherrimum_shastense1.0
70.00.00.0elegans0.0
80.00.00.0carneum0.0
90.01.00.0elusum1.0
100.00.00.0aff._viscosum_sp._nov.0.0
110.00.00.0pauciflorum0.0
120.00.01.0apachianum1.0
130.00.00.0foliosissimum0.0
140.00.00.0confertum0.0
150.00.00.0brandegeei0.0
160.00.00.0eddyense0.0
170.00.00.0chartaceum0.0

To know which shift corresponds to which reticulation, we could plot the network and ask to see the edge numbers, as shown below.

But the data frame shows us which tips are impacted by each reticulation: - transgressive shift on edge 8: reticulation leading to delicatum and pulcherrimum_shastense, - shift on edge 18: below reticulation leading to elusum - shift on edge 25: below reticulation leading to apachianum. The values in this data frame are the proportion of the genome that each species inherited from the reticulation, based on the network’s γ values. In our case, they are only 0 or 1, but the values could be in between on more complex networks, especially for older reticulations if they were followed by later reticulations.

Now we can combine our trait data and the new columns that can serve as predictors to model transgressive evolution:

df = innerjoin(select(traits, :morph, r"llength", :elevation, :lat),
               select(df_shift, Not(:sum)), # excludes the 'sum' column
               on = :morph => :tipNames) # join our data with shift predictors
17×9 DataFrame
Rowmorphllengthllength_sdllength_nelevationlatshift_8shift_18shift_25
String31Float64Float64Int64Float64Float64Float64Float64Float64
1micranthum-0.8648490.345533751.12843.74470.00.00.0
2occidentale_occidentale0.5424570.395491732.1442.59160.00.00.0
3reptans0.7466430.285631490.24239.99540.00.00.0
4pectinatum0.8341570.20499290.5547.12830.00.00.0
5delicatum0.03162770.384536503.30738.841.00.00.0
6pulcherrimum_shastense-1.044880.23081783.1741.38421.00.00.0
7elegans-1.060160.258922212.28746.9040.00.00.0
8carneum0.756530.236791471.00843.22110.00.00.0
9elusum0.213140.260283141.69244.71460.01.00.0
10aff._viscosum_sp._nov.-1.3710.269326383.078544.09580.00.00.0
11pauciflorum0.2220390.313219472.637530.80880.00.00.0
12apachianum0.2893640.304548622.6883633.9270.00.01.0
13foliosissimum0.6220950.3156212742.682538.7860.00.00.0
14confertum-0.4304460.290027103.704538.94840.00.00.0
15brandegeei-0.674880.353546582.743540.03240.00.00.0
16eddyense-1.219370.065292732.74441.31890.00.00.0
17chartaceum-1.612330.156378164.11637.63440.00.00.0

If we had a particular interest in the reticulation ancestral to the delicatum + pulcherrimum clade, we could look for evidence of transgressive evolution at that reticulation like this (in leaflet length itself, including any effect correlated with latitude and elevation):

fit_noshift = phylolm(@formula(llength ~ 1), df, net, tipnames=:morph,
          withinspecies_var=true, y_mean_std=true)
# using `df` for both, so the no-shift model is recognized as nested in the shifted model
fit_sh8 = phylolm(@formula(llength ~ shift_8), df, net, tipnames=:morph,
          withinspecies_var=true, y_mean_std=true)
coeftable(fit_sh8) # 95% confidence interval for shift includes 0
aic(fit_noshift), aic(fit_sh8)
(674.853189498432, 675.4984901526377)

Note that we fitted the no-shift model with the same data df as the shifted model. The taxa are listed in a different order in df and in the original traits data frame. It helps to use the same data frame for later comparisons. For example, we could run a likelihood ratio test (lrtest) to compare the two models, altough that we requires re-fitting them with ML method instead of the default REML:

fit_noshift_ML = phylolm(@formula(llength ~ 1), df, net, tipnames=:morph,
          reml=false, withinspecies_var=true, y_mean_std=true)
fit_sh8_ML = phylolm(@formula(llength ~ shift_8), df, net, tipnames=:morph,
          reml=false, withinspecies_var=true, y_mean_std=true)
lrtest(fit_noshift_ML, fit_sh8_ML)
Likelihood-ratio test: 2 models fitted on 954 observations
──────────────────────────────────────────────────────
     DOF  ΔDOF     LogLik  Deviance   Chisq  p(>Chisq)
──────────────────────────────────────────────────────
[1]    3        -334.6010  669.2021                   
[2]    4     1  -334.3393  668.6785  0.5235     0.4693
──────────────────────────────────────────────────────