Network calibration

If not done during the session before, load necessary packages

Code
using CSV               # read/write CSV and similar formats
using DataFrames        # versatile tabular data format
using PhyloNetworks     # includes many utilities
using PhyloPlots        # for plotting networks: via R
using RCall             # run R within Julia
using StatsBase, StatsModels

input network

We first read the network, as shown in the overview section. In that section, the figure shows that the network needs to be rooted correctely, is missing some edge lengths, and that current edge lengths are in coalescent units. These units are expected to correlate poorly with time, because the number of generations per year can vary significantly between species, and the effective population size can also vary greatly across time species (e.g. bottlenecks at speciation or reticulations) and across species.

Below, the calibration method will ignore these branch lengths in coalescent units and use genetic distances instead, in average number of substitutions/site (averaged across genes).

net_snaq = readTopology("data/polemonium_network_fromSNaQ.phy")
tipLabels(net_snaq)
17-element Vector{String}:
 "I23917_JR124_Polemonium_viscosum"
 "I23927_JR5496_Polemonium_elusum"
 "I26284_JRRS15_Polemonium_eddyense"
 "I23919_JR107_Polemonium_chartaceum"
 "I26282_JR137_Polemonium_micranthum"
 "I23920_JRRS17_Polemonium_occidentale_occidentale"
 "I20029_JR11_Polemonium_reptans"
 "I23907_JR95_Polemonium_pectinatum"
 "I26285_JR136_Polemonium_delicatum"
 "I23883_JRRS16_Polemonium_pulcherimum_shastense"
 "I20028_JR8_Polemonium_elegans"
 "I23905_JRRS10_Polemonium_carneum"
 "I23912_JR112_Polemonium_pauciflorum"
 "I23921_JR109_Polemonium_viscosum"
 "I23914_JR098_Polemonium_brandegeii"
 "I23913_JR040_Polemonium_molle"
 "I20026_JR32_Polemonium_filicinum"

Before moving on, let’s simplify taxon names. We have a file that gives the correspondence between tip names in the network to “morph” names in the trait data:

namemap = CSV.read("data/tip_to_morph.csv", DataFrame);
first(namemap, 5)
5×2 DataFrame
Rowtipmorph
StringString31
1I26282_JR137_Polemonium_micranthummicranthum
2I23899_JR133_Polemonium_pulcherrimum_lindleyipulcherrimum_lindleyi
3I23916_JR128_Polemonium_pulcherrimum_pulcherrimumpulcherrimum_pulcherrimum
4I23925_JRRS20_Polemonium_pulcherrimum_pulcherrimumpulcherrimum_pulcherrimum_2
5I23883_JRRS16_Polemonium_pulcherimum_shastensepulcherrimum_shastense

Let’s build a dictionary to make it easier to do the mapping:

tip2morph = Dict(r[:tip] => r[:morph] for r in eachrow(namemap))
Dict{String, String31} with 45 entries:
  "I23906_JRRS22_Polemonium_eximium"            => "eximium"
  "I23891_JR003_Polemonium_vanbruntiae"         => "vanbruntiae"
  "I23901_JR091_Polemonium_occidentale_occiden… => "occidentale_occidentale_2"
  "I23925_JRRS20_Polemonium_pulcherrimum_pulch… => "pulcherrimum_pulcherrimum_2"
  "I23892_JR038_Polemonium_foliosissimum"       => "filicinum"
  "I23913_JR040_Polemonium_molle"               => "foliosissimum_2"
  "I23927_JR5496_Polemonium_elusum"             => "elusum_2"
  "I23899_JR133_Polemonium_pulcherrimum_lindle… => "pulcherrimum_lindleyi"
  "I23904_JRRS18_Polemonium_californicum"       => "californicum_2"
  "I23893_JR122_Ericales_Fouquieriaceae_Fouqui… => "foliosissimum_3"
  "I26285_JR136_Polemonium_delicatum"           => "delicatum_2"
  "I23908_JR104_Polemonium_acutiflorum"         => "acutiflorum"
  "I26283_JR037_Polemonium_viscosum"            => "viscosum"
  "I23903_JR102_Polemonium_flavum"              => "flavum"
  "I20027_JR46_Polemonium_alpinum"              => "albiflorum"
  "I20028_JR8_Polemonium_elegans"               => "elegans"
  "I23898_JRRS25_Polemonium_viscosum"           => "aff._viscosum_sp._nov."
  "I23897_JR5148_Polemonium_elusum"             => "elusum"
  "I23889_JR041_Polemonium_molle"               => "foliosissimum"
  "I23902_JR101_Polemonium_delicatum"           => "delicatum"
  "I23883_JRRS16_Polemonium_pulcherimum_shaste… => "pulcherrimum_shastense"
  "I23905_JRRS10_Polemonium_carneum"            => "carneum_2"
  "I20030_JRVB3_Polemonium_vanbruntiae"         => "vanbruntiae_2"
  "I23890_JRRS14_Polemonium_eximium"            => "eximium_2"
  "I23907_JR95_Polemonium_pectinatum"           => "pectinatum"
  ⋮                                             => ⋮
for tip in net_snaq.node
  tip.leaf || continue # if not leaf: continue to next iternation (skip next line)
  tip.name = tip2morph[tip.name]
end
# root the network correctly. Now we can use the short names
rootatnode!(net_snaq, "micranthum") # 1 outgroup: micranthum. otherwise use rootonedge!
# plot the network to see where we should rotate edges to uncross reticulations
R"par"(mar=[0,0,0,0]); # change default margins to 0
res = plot(net_snaq, shownodenumber=true);
res[1:2] # (0.0, 12.1) : default x limits for this network
for nodenumber in [-16, -17,-6,-4, -19,-21]
  rotate!(net_snaq, nodenumber)
end
plot(net_snaq, xlim=[0,16]); # extend limits to show full taxon names

(a) before rotating edges

(b) after rotating edges

Figure 1: Polemonium network: rooted

input genetic distances

To calibrate the network, we need genetic distances estimated between each pair of taxa. The calibration will find edge lengths in the network to match as best as possible these input pairwise distances between taxa.

These distances were obtained by estimating a gene tree for each gene, rescaling the gene trees to decrease rate variation across genes, extracting the pairwise distances from each rescaled gene tree, then average the pairwise distances across genes (see here for how to do this). The results were saved in file data/polemonium_geneticpairwisedistances.csv, which we read below.

avD = CSV.read("data/polemonium_geneticpairwisedistances.csv", DataFrame);
taxa_long = names(avD) # column names
taxa = [tip2morph[name] for name in taxa_long]
17-element Vector{String31}:
 "aff._viscosum_sp._nov._2"
 "elusum_2"
 "eddyense"
 "chartaceum_2"
 "micranthum"
 "occidentale_occidentale"
 "reptans_2"
 "pectinatum"
 "delicatum_2"
 "pulcherrimum_shastense"
 "elegans"
 "carneum_2"
 "pauciflorum"
 "confertum"
 "brandegeei"
 "foliosissimum_2"
 "apachianum"

The order in which taxa appear in the matrix is important, which is why we extracted the taxa (ordered) from the column names.

avD = Matrix(avD) # converting to a matrix removes column names
avD |> maximum # max distance: ≈ 0.038 substitutions/site, averaged across genes
0.038069708207402644

run the calibration

We now have the 2 necessary inputs: the network and the pairwise distances. Below, we copy the network into a new network, whose edge lengths will be modified during calibration. We first set all edge lengths to some reasonable starting value, e.g. the maximum genetic pairwise distance divided by the number of taxa.

net = deepcopy(net_snaq)
# modify branch lengths for good starting values: 0.03807 max distance / 17 taxa ≈ 0.002
for e in net.edge
    e.length = (e.hybrid ? 0.0 : 0.002)
end

We are ready to run the calibration. Its optimization criterion is the sum of squares between the observed distances, and the distances expected from the network (weighted average of tree distances, weighted by γ’s). The calibration function modifies the network’s edge lengths directly. Below, we run the calibration 3 times to make sure branch lengths are optimized thoroughly (each one is very fast).

using Random; Random.seed!(8732) # for reproducibility: you can pick any other seed
fmin, xmin, ret = calibrateFromPairwiseDistances!(net, avD, taxa,
        forceMinorLength0=false, NLoptMethod=:LD_MMA);
fmin, ret # 0.000575038516459, :FTOL_REACHED
(0.00018470183899352304, :FTOL_REACHED)
fmin, xmin, ret = calibrateFromPairwiseDistances!(net, avD, taxa,
        forceMinorLength0=false, NLoptMethod=:LN_COBYLA);
fmin, ret # fmin increased slightly: score got worse
(0.00018617396017517004, :MAXEVAL_REACHED)
fmin, xmin, ret = calibrateFromPairwiseDistances!(net, avD, taxa,
        forceMinorLength0=false, NLoptMethod=:LD_MMA);
fmin, ret 
(0.00018470186946648593, :FTOL_REACHED)

Let’s check the estimated edge lengths, in units of substitutions/site for now (as were pairwise distances). We see that there are edges with tiny lengths (e.g. < 1e-5), and a few edges with tiny negative lengths.

sort!([e.length for e in net.edge])
41-element Vector{Float64}:
 -1.7904662127921078e-11
  9.884093262680499e-6
  5.34183046749636e-5
  0.00034393799177601624
  0.0007010063133440925
  0.0007844084008373255
  0.0008069860789016378
  0.0008750883231353323
  0.0010021156240143537
  0.0010144667246464924
  0.0012491689774331623
  0.0017106487083121576
  0.001820879461907711
  ⋮
  0.009032745272741213
  0.009336643215198857
  0.00960757482265956
  0.009686943090932907
  0.009686943090932907
  0.009947876264206615
  0.009947876264206615
  0.010011178662144259
  0.01070230088222567
  0.011107456145767353
  0.011107456145767353
  0.018177139899410177

Let’s replace any negative edge lengths with 0:

for e in net.edge
  if e.length < 0 e.length = 0; end
end

and peek at the resulting network:

R"par"(mar=[0,0,0,0]);
R"par"(cex=0.5) # decrease "character expansion" for smaller annotations later
res = plot(net, useedgelength=true, showedgelength=true);

Figure 2: Polemonium network: after calibration, edges drawn proportional to their lengths

normalize the network height

Optional: Let’s normalize our calibrated network to a height of 1, so edge lengths will be unit-less after. When we will analyze trait evolution later, we will get trait variance rates that represent variance accumulation from the root to the tips, instead of being relative to the molecular rate in substitutions/site.

To normalize our network, we need to length from the root to the tip. getNodeAges will calculate the “age” of each node, assuming that the network is time-consistent (if there are multiple paths from the root to a given node, all these paths have the same lengths) and ultrametric (all tips are at the same distance from the root). It returns the age of nodes in “pre-order”, so the root comes first.

rootage = getNodeAges(net)[1] # 0.018177
for e in net.edge
    e.length /= rootage
end
getNodeAges(net) |> maximum # 1.0, sanity check
R"par"(mar=[0,0,0,0]);
plot(net, useedgelength=true, xlim=[1,2.5], showgamma=true);
plot(net, useedgelength=true, xlim=[1,2.5], style=:majortree, arrowlen=0.07);

(a) standard plotting type

(b) ‘major tree’ plotting type

Figure 3: Polemonium network: after calibration and normalization to height 1

Finally, let’s save our calibrated network to a file:

writeTopology(net,"data/polemonium_network_calibrated.phy")