ACCURATE AND SCALABLE PHYLOGENY ESTIMATION VIA GRAPH CUTS

Loading...
Thumbnail Image

Files

Publication or External Link

Date

Advisor

Molloy, Erin

Citation

Abstract

Reconstructing evolutionary relationships among populations or species (called a species tree) is an important precursor of many biological studies, with applications in agriculture, medicine, and conservation. A major obstacle to species tree reconstruction is that the evolutionary histories of species can vary across the genome due to incomplete lineage sorting, a biological process modeled by the Multi-Species Coalescent. Many of the leading methods developed to date, for example, the ASTRAL family of methods, reconstruct the species tree from gene trees (i.e., a tree built from the related regions of the genomes across different species). ASTRAL address heterogeneity by evaluating evolutionary relationships on four species at a time, as these quartets have favorable statistical properties under the Multi-Species Coalescent. Despite significant advances, species tree reconstruction continues to be challenged by the complexity of evolutionary processes, the hardness of related optimization problems, and the quality of data; thus, new algorithms are needed. This dissertation presents advanced methods based on graph cuts, yielding improvements in scalability, accuracy, and robustness.

First, I introduce TREE-QMC, a heuristic for the NP-hard Maximum Quartet Support Species Tree problem. TREE-QMC is based on the divide-and-conquer approach proposed by Snir and Rao (2010). My main contribution is showing how to build an object called the quartet graph directly from gene trees, without explicitly enumerating all quartets. This result enables TREE-QMC to be cubic time in the number of species, making it the first method to break the quartic time barrier without down-sampling the input quartets since the divide-and-conquer framework was proposed. Second, I introduce the notion of a normalized quartet graph and show how to efficiently integrate normalization into graph construction. Together, these contributions enable TREE-QMC to achieve greater accuracy and scalability than ASTRAL-III, the leading method at the time, on data sets with large numbers of taxa (500-1000) and other challenging conditions.

Second, I reformulate the TREE-QMC algorithm to weight quartets based on gene tree branch lengths and support values, as proposed by Zhang and Mirarab (2022). Although the weighting scheme improves robustness of TREE-QMC to poor quality inputs (i.e., gene trees with missing species and/or estimation error), it comes with a small increase in time complexity compared to the unweighted algorithm. Fortunately, the increase in running time is small in practice, behaving more like a constant factor. Moreover, weighted TREE-QMC is highly competitive with weighted ASTRAL-IV, the leading method at the time, again producing more accurate species trees on data sets with large numbers of taxa (500-1000) and other challenging conditions.

Third, I demonstrate the utility of TREE-QMC for evolutionary scenarios, like hybridization, where the species history is a network rather than a tree. Recent research shows that reconstructing the tree-like aspects of a network, called the tree of blobs, is important for scalable network reconstruction via divide-and-conquer. An obstacle here is that the leading method for tree of blob reconstruction, TINNiK, does not scale to large numbers of species. To address this issue, I propose to build a tree with TREE-QMC and then contract edges in it based on statistical testing. This approach enables greater scalability and accuracy than TINNiK, especially on data sets with high amounts of incomplete lineage sorting.

Overall, the algorithms presented in this dissertation advance species trees and tree of blob reconstruction and highlight avenues for future research.

Notes

Rights