Impact of gene flow

Code

# code from prior sections
using CSV, DataFrames, PhyloNetworks, PhyloPlots, RCall, StatsBase, StatsModels
net = readTopology("data/polemonium_network_calibrated.phy");
for tip in net.node tip.name = replace(tip.name, r"_2$" => s""); end
traits = CSV.read("data/polemonium_traits_morph.csv", DataFrame)
subset!(traits, :morph => ByRow(in(tipLabels(net))));

What happened to the trait during gene flow? We can ask this question in various ways.

network model versus tree model

First, we can investigate if the network (with gene flow) is a better explanation for the trait evolution than a tree. Several trees are displayed in the network, but the tree that represents most of the genome has the “major” hybrid edges, those with inheritance γ>0.5. We can delete all the “minor” hybrid edges (with γ<0.5) as shown below.

tree = majorTree(net)
R"par"(mar=[0,0,0,0]);
plot(tree, tipoffset=0.05, useedgelength=true, xlim=[1,2.5]);

Figure 1: major tree, drawn proportional to edge lengths

fit_net = phylolm(@formula(llength ~ elevation + lat), traits, net;
                  tipnames=:morph, withinspecies_var=true, y_mean_std=true)
fit_tree = phylolm(@formula(llength ~ elevation + lat), traits, tree;
                  tipnames=:morph, withinspecies_var=true, y_mean_std=true)
aic(fit_net), aic(fit_tree)

(674.5023241225919, 675.0140696456297)

Which model seems to fit best: network or major tree?

Note that this network-vs-tree question applies to the phylogenetic correlation of residuals here. It could be gene flow had an impact on the geographic distribution at reticulations, which impacted evelation and latitude. In that case, this impact would be explained by the predictors that we used, and would not be seen here. But it could appear from a model without any predictors (something you could run if interested).

We can check to see how much of a difference it makes on coefficients:

coeftable(fit_net)

	Coef.	Std. Error	t	Pr(>	t	)
(Intercept)	4.22609	1.81331	2.33	0.0352	0.336929	8.11526
elevation	-0.585523	0.16023	-3.65	0.0026	-0.929181	-0.241865
lat	-0.0804964	0.0391184	-2.06	0.0587	-0.164397	0.00340418

coeftable(fit_tree)

	Coef.	Std. Error	t	Pr(>	t	)
(Intercept)	4.11819	1.81426	2.27	0.0395	0.226996	8.00938
elevation	-0.602781	0.155119	-3.89	0.0016	-0.935479	-0.270083
lat	-0.077632	0.0392076	-1.98	0.0677	-0.161724	0.00645985

or how much difference it makes on the estimated variance parameters:

round.((sigma2_phylo(fit_net), sigma2_phylo(fit_tree)), sigdigits=4)

(0.5381, 0.5319)

round.((sigma2_within(fit_net), sigma2_within(fit_tree)), sigdigits=4)

(0.1075, 0.1075)

Out of curiousity, we can do this again with the most “minor” tree: where major hybrid edges are deleted and all minor hybrid edges are kept.

minortree = deepcopy(net)
while minortree.numHybrids > 0
  for e in minortree.edge
    if e.hybrid && e.isMajor
      PhyloNetworks.deletehybridedge!(minortree, e) # changes the list of edges
      break
    end
  end
end
# plot(minortree, tipoffset=0.05, useedgelength=true, xlim=[1,2.5]);
fit_minor = phylolm(@formula(llength ~ elevation + lat), traits, minortree;
                  tipnames=:morph, withinspecies_var=true, y_mean_std=true)
aic(fit_net), aic(fit_tree), aic(fit_minor)

(674.5023241225919, 675.0140696456297, 674.7890396071382)

transgressive evolution at reticulations

Another question we may ask is whether the trait at reticulations deviates from the weighted average of its parent populations (immediately before gene flow). Such would be the case if a hybrid species’s trait is outside the range of its parents. This “transgressive” evolution can be modelled with a shift in the trait along the edge just below the reticulation. This shift quantifies the difference between the trait at the hybrid node and the weighted average of the immediate parents.

The code below builds a data frame with 1 predictor column per reticulation: this column quantifies the effect of a transgressive shift on each taxon in the phylogeny.

df_shift = regressorHybrid(net)

17×5 DataFrame

Row	shift_8	shift_18	shift_25	tipNames	sum
	Float64	Float64	Float64	String	Float64
1	0.0	0.0	0.0	micranthum	0.0
2	0.0	0.0	0.0	occidentale_occidentale	0.0
3	0.0	0.0	0.0	reptans	0.0
4	0.0	0.0	0.0	pectinatum	0.0
5	1.0	0.0	0.0	delicatum	1.0
6	1.0	0.0	0.0	pulcherrimum_shastense	1.0
7	0.0	0.0	0.0	elegans	0.0
8	0.0	0.0	0.0	carneum	0.0
9	0.0	1.0	0.0	elusum	1.0
10	0.0	0.0	0.0	aff._viscosum_sp._nov.	0.0
11	0.0	0.0	0.0	pauciflorum	0.0
12	0.0	0.0	1.0	apachianum	1.0
13	0.0	0.0	0.0	foliosissimum	0.0
14	0.0	0.0	0.0	confertum	0.0
15	0.0	0.0	0.0	brandegeei	0.0
16	0.0	0.0	0.0	eddyense	0.0
17	0.0	0.0	0.0	chartaceum	0.0

To know which shift corresponds to which reticulation, we could plot the network and ask to see the edge numbers, as shown below.

But the data frame shows us which tips are impacted by each reticulation: - transgressive shift on edge 8: reticulation leading to delicatum and pulcherrimum_shastense, - shift on edge 18: below reticulation leading to elusum - shift on edge 25: below reticulation leading to apachianum. The values in this data frame are the proportion of the genome that each species inherited from the reticulation, based on the network’s γ values. In our case, they are only 0 or 1, but the values could be in between on more complex networks, especially for older reticulations if they were followed by later reticulations.

Now we can combine our trait data and the new columns that can serve as predictors to model transgressive evolution:

df = innerjoin(select(traits, :morph, r"llength", :elevation, :lat),
               select(df_shift, Not(:sum)), # excludes the 'sum' column
               on = :morph => :tipNames) # join our data with shift predictors

17×9 DataFrame

Row	morph	llength	llength_sd	llength_n	elevation	lat	shift_8	shift_18	shift_25
	String31	Float64	Float64	Int64	Float64	Float64	Float64	Float64	Float64
1	micranthum	-0.864849	0.345533	75	1.128	43.7447	0.0	0.0	0.0
2	occidentale_occidentale	0.542457	0.39549	173	2.14	42.5916	0.0	0.0	0.0
3	reptans	0.746643	0.285631	49	0.242	39.9954	0.0	0.0	0.0
4	pectinatum	0.834157	0.204992	9	0.55	47.1283	0.0	0.0	0.0
5	delicatum	0.0316277	0.384536	50	3.307	38.84	1.0	0.0	0.0
6	pulcherrimum_shastense	-1.04488	0.230817	8	3.17	41.3842	1.0	0.0	0.0
7	elegans	-1.06016	0.258922	21	2.287	46.904	0.0	0.0	0.0
8	carneum	0.75653	0.236791	47	1.008	43.2211	0.0	0.0	0.0
9	elusum	0.21314	0.260283	14	1.692	44.7146	0.0	1.0	0.0
10	aff._viscosum_sp._nov.	-1.371	0.269326	38	3.0785	44.0958	0.0	0.0	0.0
11	pauciflorum	0.222039	0.313219	47	2.6375	30.8088	0.0	0.0	0.0
12	apachianum	0.289364	0.304548	62	2.68836	33.927	0.0	0.0	1.0
13	foliosissimum	0.622095	0.315621	274	2.6825	38.786	0.0	0.0	0.0
14	confertum	-0.430446	0.290027	10	3.7045	38.9484	0.0	0.0	0.0
15	brandegeei	-0.67488	0.353546	58	2.7435	40.0324	0.0	0.0	0.0
16	eddyense	-1.21937	0.0652927	3	2.744	41.3189	0.0	0.0	0.0
17	chartaceum	-1.61233	0.156378	16	4.116	37.6344	0.0	0.0	0.0

If we had a particular interest in the reticulation ancestral to the delicatum + pulcherrimum clade, we could look for evidence of transgressive evolution at that reticulation like this (in leaflet length itself, including any effect correlated with latitude and elevation):

fit_noshift = phylolm(@formula(llength ~ 1), df, net, tipnames=:morph,
          withinspecies_var=true, y_mean_std=true)
# using `df` for both, so the no-shift model is recognized as nested in the shifted model
fit_sh8 = phylolm(@formula(llength ~ shift_8), df, net, tipnames=:morph,
          withinspecies_var=true, y_mean_std=true)
coeftable(fit_sh8) # 95% confidence interval for shift includes 0
aic(fit_noshift), aic(fit_sh8)

(674.853189498432, 675.4984901526377)

Note that we fitted the no-shift model with the same data df as the shifted model. The taxa are listed in a different order in df and in the original traits data frame. It helps to use the same data frame for later comparisons. For example, we could run a likelihood ratio test (lrtest) to compare the two models, altough that we requires re-fitting them with ML method instead of the default REML:

fit_noshift_ML = phylolm(@formula(llength ~ 1), df, net, tipnames=:morph,
          reml=false, withinspecies_var=true, y_mean_std=true)
fit_sh8_ML = phylolm(@formula(llength ~ shift_8), df, net, tipnames=:morph,
          reml=false, withinspecies_var=true, y_mean_std=true)
lrtest(fit_noshift_ML, fit_sh8_ML)

Likelihood-ratio test: 2 models fitted on 954 observations
──────────────────────────────────────────────────────
     DOF  ΔDOF     LogLik  Deviance   Chisq  p(>Chisq)
──────────────────────────────────────────────────────
[1]    3        -334.6010  669.2021                   
[2]    4     1  -334.3393  668.6785  0.5235     0.4693
──────────────────────────────────────────────────────