goodness of fit of a candidate network

The example data set in this package is very small (5 taxa) on purpose: to make this tutorial run quickly. The data list all four-taxon sets, and the concordance factor (CF) for each of the 3 topologies on these 4 taxa, that is, the proportion of genes estimated to have each 4-taxon unrooted topology.

julia> using QuartetNetworkGoodnessFit, DataFrames, CSV
julia> qCF = DataFrame(CSV.File(joinpath(dirname(pathof(QuartetNetworkGoodnessFit)), "..","test","example_qCF_5taxa.csv")), copycols=false);
julia> qCF15×8 DataFrame Row │ t1 t2 t3 t4 CF12_34 CF13_24 CF14_23 ngenes │ String1 String1 String1 String1 Float64 Float64 Float64 Int64 ─────┼─────────────────────────────────────────────────────────────────────── 1 │ A B C D 0.99 0.005 0.005 200 2 │ A B C E 0.825 0.095 0.08 200 3 │ A B D E 0.825 0.095 0.08 200 4 │ A C D E 0.005 0.025 0.97 200 5 │ B C D E 0.005 0.025 0.97 200 6 │ A B C O 0.83 0.09 0.08 200 7 │ A B D O 0.83 0.09 0.08 200 8 │ A C D O 0.005 0.025 0.97 200 9 │ B C D O 0.005 0.025 0.97 200 10 │ A B E O 0.84 0.08 0.08 200 11 │ A C E O 0.685 0.265 0.05 200 12 │ B C E O 0.68 0.265 0.055 200 13 │ A D E O 0.685 0.265 0.05 200 14 │ B D E O 0.68 0.265 0.055 200 15 │ C D E O 1.0 0.0 0.0 200

testing candidate networks

Let's say we have two candidate networks to test: one that is just a tree (net0), and the other that has one reticulation (net1).

julia> using PhyloNetworks
julia> net1 = readTopology("((((D,C),((A,B))#H1),(#H1,E)),O);");
julia> net0 = readTopology("((((D,C),(A,B)),E),O);"); # also: majorTree(net1)

net0 has clades AB and CD sister to each other, with E sister to ABCD and an outgroup O. net1 has net0 as its major tree, and has a reticulation going from E into the stem lineage to AB.

The goal is to test the goodness-of-fit of each topology.

It's best to start with a small number of simulations for the test, to check that there are no issue. Then we can re-run with a larger number. Note that the first use of a function includes time to compile the function, so it takes longer than the second use. Below, the option true means that we want to optimize the branch lengths in the network, in coalescent units, before quantifying the goodness-of-fit.

julia> res0 = quarnetGoFtest!(net0, qCF, true; seed=201, nsim=3);200.0 gene trees per 4-taxon set

Before re-running with more simulations, we can update the network net0 to the version that has optimized branch lengths:

julia> net0 = res0[5]HybridNetwork, Rooted Network
9 edges
10 nodes: 6 tips, 0 hybrid nodes, 4 internal tree nodes.
tip labels: D, C, A, B, ...
(((D:0.0,C:0.0):3.197,(A:1.931,B:1.931):1.265):0.71,E:3.907,O:3.907);

Now we re-run the test using the option false to not re-optimize branch lengths. We use nsim=200 simulations below to make this example faster. For a real data analysis, delete the nsim option to use the default instead (1000) or specify a higher value.

julia> res0 = quarnetGoFtest!(net0, qCF, false; seed=234, nsim=200);200.0 gene trees per 4-taxon set

In the result, the first 3 numbers are the p-value of the overall goodness-of-fit test, the uncorrected z-value, and the estimate of σ to correct the z-value for dependence:

julia> res0[[1,2,3]] # p-value, uncorrected z, σ(0.0003644828204847111, 5.034965460952285, 1.4903196407411892)

Here, we have evidence that the tree net0 does not fit the data adequately. Note that σ=1 corresponds to independent quartet outlier p-values. Typical estimates of σ are quite larger than 1, increase with the number of taxa, and increase with longer branch lengths (for a given number of taxa).

The next element is the list of outlier p-values, which was also added to the data frame that contains our data:

julia> res0[4]15-element Vector{Float64}:
 0.9382013534363636
 0.784413132415509
 0.784413132415509
 0.2269789649305843
 0.2269789649305843
 0.7569552453692212
 0.7569552453692212
 0.2269789649305843
 0.2269789649305843
 0.01049607485373507
 9.436389213966022e-8
 2.9912016026156043e-7
 9.436389213966022e-8
 2.9912016026156043e-7
 0.06730578641440749
julia> qCF[:,[:t1,:t2,:t3,:t4,:p_value]]15×5 DataFrame Row │ t1 t2 t3 t4 p_value │ String1 String1 String1 String1 Float64 ─────┼──────────────────────────────────────────────── 1 │ A B C D 0.938201 2 │ A B C E 0.784413 3 │ A B D E 0.784413 4 │ A C D E 0.226979 5 │ B C D E 0.226979 6 │ A B C O 0.756955 7 │ A B D O 0.756955 8 │ A C D O 0.226979 9 │ B C D O 0.226979 10 │ A B E O 0.0104961 11 │ A C E O 9.43639e-8 12 │ B C E O 2.9912e-7 13 │ A D E O 9.43639e-8 14 │ B D E O 2.9912e-7 15 │ C D E O 0.0673058

The very small overall p-value indicated an excess of outlier four-taxon sets. Here we see what these outliers are: all four-taxon sets containing O, E and either A or B are strong outliers (very small outlier p-values).

In fact, the CF data were simulated on net1, in which the AB clade received gene flow from E. We can re-run an analysis using net1 this time:

julia> res1 = quarnetGoFtest!(net1, qCF, true; seed=271, nsim=1000);200.0 gene trees per 4-taxon set
hybrid edges for hybrid node 5 have missing gamma's, set default: 0.9,0.1
julia> res1[[1,2,3]] # p-value, uncorrected z, σ(0.6762814750767907, -0.8885233166386386, 1.9428681998800792)

This network is found to provide an adequate absolute fit (as expected from how the data were simulated).

Note that after optimization of branch lengths and γs to best fit the CF data, the network was "ultrametrized" along its major tree to assign values to missing edge lengths. External edges are typically missing a length in coalescent units if there was a single individual sampled per species, for example.

julia> res1[5]HybridNetwork, Rooted Network
12 edges
12 nodes: 6 tips, 1 hybrid nodes, 5 internal tree nodes.
tip labels: D, C, A, B, ...
(((D:5.685,C:5.685):2.623,((A:0.0,B:0.0):1.035)#H1:7.273::0.63):7.38,(#H1:0.968::0.37,E:14.83):0.858,O:15.688);

For more options, see quarnetGoFtest!, such as for the outlier test statistic (G or likelihood ratio test by default).

parallel computations

For larger networks, the large number of four-taxon sets causes the test to run more slowly. The computations can be parallelized by providing more processors to Julia. For instance, this can be done by starting julia with the -p option:

julia -p 3 # 3 worker processors, 4 processors total

empirical p-value

"Nice" networks with long branch lengths in coalescent units lead to "nice" expected quartet concordance factors close to [100%, 0%, 0%], when gene trees are expected to agree on a good proportion of 4-taxon sets. On these networks and data sets with few loci, the distribution of the test z-value may be too far from a normal approximation. quarnetGoFtest! attempts to detect this issue. If it does, it sends a warning suggesting to use an empirical p-value instead of using the p-value returned by the first element of the output (e.g. res1[1]).

Here is example code to calculate the p-value from the empirical distribution of simulated z-values. Note that a large number of simulations is needed for this. We used nsim=1000 to test network 1 above, which is a good place to start.

julia> zvalue_observed = res1[2]-0.8885233166386386
julia> zvalue_bootstrap = sort!(res1[6]) # sorted z-values simulated under the network1000-element SharedArrays.SharedVector{Float64}: -0.8885233166386386 -0.8885233166386386 -0.8885233166386386 -0.8885233166386386 -0.8885233166386386 -0.8885233166386386 -0.8885233166386386 -0.8885233166386386 -0.8885233166386386 -0.8885233166386386 ⋮ 7.4043609719886545 7.4043609719886545 8.589058727506838 8.589058727506838 8.589058727506838 8.589058727506838 8.589058727506838 9.773756483025023 9.773756483025023
julia> length(zvalue_bootstrap) # just to check we ran many simulations1000
julia> using Statistics # to access "mean"
julia> pvalue = mean(zvalue_bootstrap .>= zvalue_observed) # one-sided test: Prob(Z > z)1.0

In this case we get a p-value of 1.0, which is consisten with all individual outlier p-values being large, for all four-taxon sets as seen earlier: the quartet CFs couldn't fit net1 better.