Average distances

Code
# code from prior sections: load packages used here
using CSV, DataFrames, PhyloNetworks, StatsBase

This extra section provides example code to estimate genetic distances from gene trees, which them may later be used to calibrate a network.

Using gene trees from multiple loci may be best when loci are sufficiently informative. With short loci or with loci with few informative sites, some branch lengths may often be underestimated at 0 due to a lack of variable sites, in which case it may be best to estimate genetic distances from the multiple alignment directly.

If loci are sufficiently informative, estimating a gene tree from each locus may help reduce noise in the estimated distances (e.g. account for multiple substitutions along the same lineage). It can also decrease the influence of fast-evolving genes, and rate variation across genes more generally.

We first define helper functions for computing pairwise distances, to - get a matrix pairwise distances from each tree - normalize these distance matrices, so they all have a similar distance between the ingroup and outgroup taxa - average distance matrices across genes. These functions were originally developed by Karimi et al. (2020). Their code was slightly modified here. To load these helper functions: unfold the code below, then copy & paste into your julia session.

Code
"""
    getpairwisedistances(genetrees, taxa)

Return a tuple of 3 objects:
1. A vector `D` of matrices, one per gene tree, containing the pairwise distance
   between all pairs of `taxa`. If taxon i is missing from tree `k`, then the
   distance matrix `D[k]` for that tree will have zeros on its ith row and ith column.
   In each matrix, row & column `i` correspond to `taxa[i]`, that is, taxa are
   listed in the same order as in the input `taxa`.
2. A matrix `ngenes` containing the number of genes with data on the pair
   of taxa (i,j) in row i, column j
3. A vector of integers, giving the index of gene trees with some missing taxa.

This function uses `pairwiseTaxonDistanceMatrix(tree)` from `PhyloNetworks`,
which outputs a matrix in which the rows correspond to taxa in the order in
which they come in `tipLabels(tree)`.
It then takes care of the fact that taxa may not be listed in the same order by
`tipLabels` across all gene trees.
"""
function getpairwisedistances(genetrees, taxa)
  ntips = length(taxa)
  D = Array{Float64,2}[]; # empty vector. will contain all distance matrices
  ngenes = zeros(Int, ntips, ntips) # number of genes that have data for each pair
  geneind = Int[];        # indices of genes with missing taxa
  istaxonmissing = Vector{Bool}(undef, ntips) # to be modified in place for each gene
  for (g_index,g) in enumerate(genetrees)
    M = zeros(ntips,ntips) # initialized at 0.0: for missing pairs
    taxnames = tipLabels(g)
    tipind = Int[]
    for k in 1:ntips
      j = findfirst(isequal(taxa[k]), taxnames)
      notfound = isnothing(j)
      istaxonmissing[k] = notfound # modified in place
      notfound || push!(tipind, j) # add j to tipind if taxa[k] was found
    end
    M[.!istaxonmissing, .!istaxonmissing] = pairwiseTaxonDistanceMatrix(g)[tipind,tipind]
    ngenes[.!istaxonmissing, .!istaxonmissing] .+= 1
    any(istaxonmissing) && push!(geneind,g_index)
    push!(D, M)
  end
  return D, ngenes, geneind
end

"""
    normalizedistances_outgroup2ingroup!(D; taxa, ingroup, outgroup)

Rescale each input distance matrix `D[k]`, such that all have the same
median patristic distance between outgroup taxa and ingroup taxa.
Input: `D` should be a vector of pairwise distances matrices, one per gene
(modified by the function).
Output: vector of original median ingroup-outgroup distance, one per gene.

Why the *median*? So that one taxon or one small clade with an unusually large
(or low) substitution rate does have an undue influence on the scaling factor.

Assumptions:
- all trees have at least 1 outgroup and 1 ingroup
- row & column `i` in D[k] (for gene k) correspond to `taxa[i]`
- `D[k][i,j]` = 0 if gene `k` doesn't have both taxa `i` and `j`
- `ingroup` and `outgroup` are sets. The function does *not* check whether they
  are subsets of `taxa`, or don't overlap, or cover the full set of `taxa`.
"""
function normalizedistances_outgroup2ingroup!(D; taxa, ingroup, outgroup)
  ntax = length(taxa)
  inding = findall(in(ingroup),  taxa) # indices of ingroup  taxa
  indout = findall(in(outgroup), taxa) # indices of outgroup taxa
  medianingroup2outgroup = Float64[]   # will contain 1 median per gene
  for dm in D # dm = distance matrix
    size(dm) = (ntax,ntax) || error("there's a distance matrix with wrong dimensions: $(size(dm))")
    absent = findall([all(dm[:,i] .== 0.0) for i in 1:ntax])
    push!(medianingroup2outgroup,
          median(dm[setdiff(inding, absent), setdiff(indout, absent)]) )
  end
  mi2o = mean(medianingroup2outgroup)
  for k in 1:length(D)
    D[k] .*= mi2o/medianingroup2outgroup[k]
  end
  return medianingroup2outgroup
end

"""
    averagepairwisedistances(D, ngenes)

Matrix `M` containing the average pairwise distances, weighted by number of
genes with data on the pair: M[i,j] = (sum_k D[k][i,j] ) / ngenes[i,j].
This is because for each pair of taxa `i,j`, it is assumed that a number
`ngenes[i,j]` of genes (indexed by k) contributed data for the pair, and
the other genes without both taxa i and j had D[k][i,j]=0.
"""
function averagepairwisedistances(D, ngenes)
  return sum(D) ./ ngenes
end

This tutorial comes with a small sample of gene trees, some of them modified so as to illustrate the case when taxa are missing from some genes. These gene tree files are in folder data/genetrees_sample, and are majority-rule consensus trees in nexus format created by MrBayes. This is only to give example code to read gene tree files and calculate distances. In practice, the majority-rule creates edges of length 0 (polytomies) when there is less than 50% credibility for a resolution in the gene tree, so this is underestimating the true edge length, and a full consensus tree may be a better option.

Below, we show how to read these files using readNexusTrees because of the nexus format. This function is meant to read a sample of multiple trees (or multiple networks), so it returns a list. We only use the first and only tree in this list. As we can see, loci 10 and 20 are missing a few taxa.

genetreefolder = joinpath("data","genetrees_sample")
# list the content of this folder, and filter to keep files ending in ".tre":
genetreefiles = filter(x -> endswith(x,".tre"), readdir(genetreefolder))
length(genetreefiles) # 19 files: L22 missing
function read1gene(filename)
  genetreefile = joinpath(genetreefolder, filename)
  return readNexusTrees(genetreefile)[1] # first and only tree in the list
end
genetrees = map(read1gene, genetreefiles)
for (filename,tree) in zip(genetreefiles, genetrees)
  println("$filename: $(tree.numTaxa) taxa")
end
T405_L10.nex.con.tre: 14 taxa
T405_L11.nex.con.tre: 17 taxa
T405_L12.nex.con.tre: 17 taxa
T405_L13.nex.con.tre: 17 taxa
T405_L14.nex.con.tre: 17 taxa
T405_L15.nex.con.tre: 17 taxa
T405_L16.nex.con.tre: 17 taxa
T405_L17.nex.con.tre: 17 taxa
T405_L18.nex.con.tre: 17 taxa
T405_L19.nex.con.tre: 17 taxa
T405_L20.nex.con.tre: 16 taxa
T405_L21.nex.con.tre: 17 taxa
T405_L23.nex.con.tre: 17 taxa
T405_L24.nex.con.tre: 17 taxa
T405_L25.nex.con.tre: 17 taxa
T405_L26.nex.con.tre: 17 taxa
T405_L27.nex.con.tre: 17 taxa
T405_L28.nex.con.tre: 17 taxa
T405_L29.nex.con.tre: 17 taxa

To convert each tree to a matrix of pairwise distances, we first need to decide the order in which the taxa should be listed.

taxa_long = sort!(union([tipLabels(tree) for tree in genetrees]...))
D, ngenes, geneind = getpairwisedistances(genetrees, taxa_long);
geneind # indices of genes with some missing taxa
2-element Vector{Int64}:
  1
 11
ngenes # number of genes with data for each pair: 19 for most, some 17's and 18's
17×17 Matrix{Int64}:
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 17  17  17  17  17  17  17  17  17  17  17  17  17  17  17  17  17
 18  18  18  18  18  18  18  17  18  18  18  18  18  18  18  18  18
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 18  18  18  18  18  18  18  17  18  18  18  18  18  18  18  18  18
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
D[11] # distance matrix from 11th gene in the list. if missing taxa, then 0s
17×17 Matrix{Float64}:
 0.0         0.031831   0.0261231  …  0.0407398  0.0113065   0.0381869
 0.031831    0.0        0.0191849     0.0272717  0.0307095   0.035106
 0.0261231   0.0191849  0.0           0.0280938  0.0250017   0.0293982
 0.0407076   0.0376267  0.0319188     0.0465355  0.0395861   0.00606439
 0.0324402   0.0293594  0.0236515     0.0382682  0.0313188   0.0283365
 0.0402938   0.0372129  0.031505   …  0.0461218  0.0391723   0.029748
 0.0107768   0.0301798  0.024472      0.0390887  0.00965534  0.0365358
 0.0         0.0        0.0           0.0        0.0         0.0
 0.00702223  0.0264253  0.0207174     0.0353341  0.00590078  0.0327812
 0.00695353  0.0263566  0.0206487     0.0352654  0.00583209  0.0327125
 0.00697372  0.0263767  0.0206689  …  0.0352856  0.00585227  0.0327327
 0.0279844   0.0335855  0.0278776     0.0424944  0.0268629   0.0399415
 0.00695846  0.0263615  0.0206536     0.0352704  0.00583701  0.0327174
 0.0365629   0.0230947  0.0239168     0.0198704  0.0354414   0.039838
 0.0407398   0.0272717  0.0280938     0.0        0.0396184   0.0440149
 0.0113065   0.0307095  0.0250017  …  0.0396184  0.0         0.0370655
 0.0381869   0.035106   0.0293982     0.0440149  0.0370655   0.0

Now we want to rescale each distance matrix so as to limit the impact of rate variation across genes. Otherwise, fast-evolving genes would overwhelm the signal, which may be particularly dangerous if these fast-evolving genes have saturated edge lengths or are affected by long-branch attraction.

For normalizing the median distance between all pairs of ingroup-outgroup taxa, we define these sets of taxa:

outgroup = filter(x -> occursin("micranthum",x), taxa_long)
ingroup = setdiff(taxa_long, outgroup)
med_in2out = normalizedistances_outgroup2ingroup!(D,
  taxa=taxa_long, ingroup=ingroup, outgroup=outgroup);
med_in2out
19-element Vector{Float64}:
 0.028521906
 0.019770464
 0.017636436499999998
 0.0224220723
 0.028571541
 0.038314872
 0.050856829650000004
 0.039494405
 0.0235365128
 0.0321006079
 0.038268231
 0.018588236999999997
 0.0256485685
 0.013689591000000001
 0.00205620645
 0.032686909
 0.0434885215
 0.0146293095
 0.009773589499999999

We can see that all genes had somewhat similar rates, because similar median ingroup-to-outgroup distances, except for gene 15, that had a particularly smaller rate than others. If there is reason to think that slow-evolving genes may need to be removed, a filtering step could be added at this point. For the sake of providing example code, we do this here.

slowgene_indices = findall(d -> d < 0.003, med_in2out) # gene 15 here
1-element Vector{Int64}:
 15

Gene 15 was not missing any taxa: geneind did not list that gene. So we don’t have to re-run the calculation of the vector of D matrix and of the matrix storing the number of genes contributing data for each pair of taxa. Otherwise, we would need to re-run the calculation of ngenes.

deleteat!(D, slowgene_indices); # deletes D[k] for a slow gene k
num_slow = length(slowgene_indices) # 1: number of slow genes that are being exluded
ngenes .-= num_slow # gene 15 was not missing any taxa
length(D) # 18: 1 fewer than before
18

Now we can see how the distance matrix changed for the first gene, which we looked at earlier:

D[11] # from 11th gene, after normalization. 0s on rows & columns for missing taxa
17×17 Matrix{Float64}:
 0.0         0.0218915  0.017966   …  0.0280185  0.00777595  0.0262628
 0.0218915   0.0        0.0131943     0.0187559  0.0211202   0.0241439
 0.017966    0.0131943  0.0           0.0193213  0.0171947   0.0202184
 0.0279963   0.0258774  0.0219519     0.0320045  0.027225    0.00417073
 0.0223105   0.0201917  0.0162661     0.0263187  0.0215393   0.0194882
 0.0277117   0.0255929  0.0216673  …  0.0317199  0.0269405   0.0204589
 0.00741165  0.0207559  0.0168304     0.0268829  0.00664039  0.0251272
 0.0         0.0        0.0           0.0        0.0         0.0
 0.00482948  0.0181738  0.0142482     0.0243008  0.00405822  0.022545
 0.00478224  0.0181265  0.014201      0.0242535  0.00401097  0.0224978
 0.00479612  0.0181404  0.0142149  …  0.0242674  0.00402485  0.0225117
 0.019246    0.0230982  0.0191726     0.0292252  0.0184748   0.0274694
 0.00478562  0.0181299  0.0142044     0.0242569  0.00401436  0.0225012
 0.0251458   0.0158832  0.0164486     0.0136657  0.0243746   0.0273982
 0.0280185   0.0187559  0.0193213     0.0        0.0272472   0.0302709
 0.00777595  0.0211202  0.0171947  …  0.0272472  0.0         0.0254915
 0.0262628   0.0241439  0.0202184     0.0302709  0.0254915   0.0

Finally, let’s average distances across genes

avD = averagepairwisedistances(D, ngenes)
17×17 Matrix{Float64}:
 0.0        0.025131   0.0166456  …  0.0333717  0.0204769  0.0221704
 0.025131   0.0        0.0201658     0.0288717  0.0161611  0.0170719
 0.0166456  0.0201658  0.0           0.0314511  0.0187754  0.0191991
 0.0167357  0.0204018  0.012726      0.0325606  0.0202637  0.0158341
 0.0173756  0.0195113  0.0129158     0.0316148  0.0196154  0.0206315
 0.0246931  0.0204742  0.0210903  …  0.0281747  0.0168907  0.01409
 0.0180386  0.017438   0.0202813     0.0293285  0.0154165  0.0160886
 0.0218215  0.0203966  0.0241864     0.0308539  0.017866   0.0173078
 0.0198536  0.0154942  0.0183717     0.0255542  0.0122323  0.0132618
 0.0213477  0.0148121  0.0182099     0.0251534  0.0129911  0.0109218
 0.0130298  0.0221036  0.0126574  …  0.0314761  0.0164174  0.0202386
 0.0181169  0.0254095  0.0138355     0.0301603  0.0199716  0.0212204
 0.0175151  0.0144697  0.0180279     0.0253002  0.010981   0.0114824
 0.0221586  0.0148132  0.0163897     0.0241507  0.0123825  0.0128827
 0.0333717  0.0288717  0.0314511     0.0        0.0249235  0.0244055
 0.0204769  0.0161611  0.0187754  …  0.0249235  0.0        0.0146991
 0.0221704  0.0170719  0.0191991     0.0244055  0.0146991  0.0

and save this average distance matrix to a file:

csv_filename = joinpath(genetreefolder, "averagedist_sample.csv")
avD_df = DataFrame([avD[:,j] for j in 1:size(avD,2)], # column data
                   [Symbol(t) for t in taxa_long])    # column names
CSV.write(csv_filename, avD_df)
"data/genetrees_sample/averagedist_sample.csv"