Discrete trait

Code
# code from prior sections
using CSV, DataFrames, PhyloNetworks, PhyloPlots, RCall, StatsBase, StatsModels
net = readTopology("data/polemonium_network_calibrated.phy");
for tip in net.node tip.name = replace(tip.name, r"_2$" => s""); end
traits = CSV.read("data/polemonium_traits_morph.csv", DataFrame)
subset!(traits, :morph => ByRow(in(tipLabels(net))))

For illustrative purposes, we are going to look at the evolution of mean elevation as a discrete variable. We discretize elevation using the threshold 2.3 km below, because the distribution of elevation shows somewhat of a gap around that value, and a reasonable number of morphs both below and above that value.

dat = DataFrame(
    morph = String.(traits[:, :morph]),
    elevation = map( x -> (x .> 2.3 ? "hi" : "lo"), traits.elevation )
)
17×2 DataFrame
Rowmorphelevation
StringString
1aff._viscosum_sp._nov.hi
2apachianumhi
3brandegeeihi
4carneumlo
5chartaceumhi
6confertumhi
7delicatumhi
8eddyensehi
9eleganslo
10elusumlo
11foliosissimumhi
12micranthumlo
13occidentale_occidentalelo
14pauciflorumhi
15pectinatumlo
16pulcherrimum_shastensehi
17reptanslo

estimating the evolutionary rate(s)

For binary traits, we can use the :ERSM model, for “equal-rate substition model”:

fit = fitdiscrete(net, :ERSM, dat.morph, select(dat, :elevation))
PhyloNetworks.StatisticalSubstitutionModel:
Equal Rates Substitution Model with k=2,
  all rates equal to α=0.67674.
  rate matrix Q:
                hi      lo
        hi       *  0.6767
        lo  0.6767       *
on a network with 3 reticulations
data:
  17 species
  1 trait
log-likelihood: -11.05965

or the more general “binary trait substitution model” :BTSM that does not constrain the two rates to be equal:

fit_uneq = fitdiscrete(net, :BTSM, dat.morph, select(dat, :elevation))
PhyloNetworks.StatisticalSubstitutionModel:
Binary Trait Substitution Model:
  rate hi→lo α=0.0
  rate lo→hi β=0.53883
on a network with 3 reticulations
data:
  17 species
  1 trait
log-likelihood: -10.83348

Using a likelihood ratio test, we see that there is no evidence for unequal rates:

x2 = 2*(loglikelihood(fit_uneq) - loglikelihood(fit))
0.45233716294591275
import Distributions: Chisq, ccdf
ccdf(Chisq(1), x2) # p-value: P(Χ² on 1 df > observed x2)
0.5012271575266686

ancestral state probabilities

We can calculate ancestral state probabilities like this, but the output data frame is hard to interpret because it’s unclear which node has which number:

asr = ancestralStateReconstruction(fit)
39×4 DataFrame
14 rows omitted
Rownodenumbernodelabelhilo
Int64StringFloat64Float64
11micranthum0.01.0
22occidentale_occidentale0.01.0
33reptans0.01.0
44pectinatum0.01.0
55delicatum1.00.0
66pulcherrimum_shastense1.00.0
77elegans0.01.0
88carneum0.01.0
99elusum0.01.0
1010aff._viscosum_sp._nov.1.00.0
1111pauciflorum1.00.0
1212apachianum1.00.0
1313foliosissimum1.00.0
2828280.9866520.0133477
2929290.965710.0342897
3030H180.9683770.0316228
3131310.9233950.0766053
3232320.4311090.568891
3333H200.5243340.475666
3434340.617820.38218
3535350.3707850.629215
3636360.3842650.615735
3737H190.6869940.313006
3838380.6917040.308296
3939390.2083870.791613

But we can map the posterior probability of state “hi” onto the network. It is important to use the network stored in the fitted object, to be assured that the node numbers match.

R"par"(mar=[0,0,0,0]);
plot(fit.net, shownodenumber=true, showgamma=true, tipoffset=0.1, xlim=[0,15]);
plot(fit.net, nodelabel=select(asr, :nodenumber, :hi), tipoffset=0.1, xlim=[0,15]);

(a) showing node numbers

(b) showing the posterior probability of state ‘hi’

Figure 1: network stored in fitted object, to map ancestral state probabilities

effect of gene flow

What is the evidence that a trait was inherited via gene flow (along the minor hybrid edge) versus “vertically” (via the major hybrid edge)?

To answer this question, we can extract the prior probability of either option, and the posterior probabilities conditional on the data. Then, the Bayes factor to compare the 2 hypotheses, minor versus major edge (or gene flow versus vertical), can be expressed as an odds ratio: \[\mathrm{BF} = \frac{P(\mathrm{data} \mid \mathrm{minor})}{P(\mathrm{data} \mid \mathrm{mafor})} = \frac{\frac{P(\mathrm{minor} \mid \mathrm{data})}{P(\mathrm{major}\mid\mathrm{data})}}{\frac{P(\mathrm{minor})}{P(\mathrm{major})}}\,.\]

Here, let’s say we want to focus on the reticulation that is ancestral to elusum (node 33, H20). From Figure 1 (left) we see that the major (“vertical”) parent edge has γ = 0.741 = P(major) and the minor (“gene flow”) edge has γ = 0.259 = P(minor). So at this reticulation, the prior odds of the trait being inherited via gene flow is 0.259/0.741 = 0.35. Here how we may get this prior odds programmatically:

nodeH20 = net.node[findfirst(n->n.name=="H20", net.node)] # find node named H20
priorminor = getparentedgeminor(nodeH20).gamma            # γ of its minor parent
priorodds = priorminor / (1-priorminor)
0.34993311937527516

To extract the posterior odds and calculate the Bayes factor, we can do as follows, to focus on just 1 reticulation of interest and integrate out what happened at other reticulations:

postminor = exp(PhyloNetworks.posterior_loghybridweight(fit, "H20"))
postodds = postminor / (1-postminor)
0.6298935555430331

That’s an increase. The Bayes factor to compare the gene flow (minor edge) versus vertical (major edge) hypotheses is then:

postodds / priorodds
1.8000398380912406

It’s above 1, so there’s evidence of inheritance via gene flow, but not very much above 1 (e.g. less than 3), so it’s equivocal evidence only.