Short-term Visitor
Phylogenomics is a major direction of current evolutionary biology. However, it is disconcerting to see a bootstrap analysis give 100% support to a clade in one analysis, with a conflicting clade receiving 100% support in another analysis with minor alterations! The key issue is failure to implement the critical scientific concept of robust and realistic assessments of fit of data to model. I have developed a suite of techniques to deal with such issues, starting with ML based distance analyses (since these condense the data, but the theory is nearly identical for ML character based analyses). These are informed by both robust statistics and Information Theory, such as stochastic
complexity. For example, fit is measured in terms of actual, rather than assumed, noise. Thus, distance analyses report geometrically adjusted percentage error statistics on distances fitted on a tree. Further, these statistics can then be used to generate quasi-replicate data sets (Residual Resampling) that mimic the actual error in the data rather than assuming unrealistic best-case scenarios, such as having perfectly i.i.d. data and the one true model. Already coupled with NeighborNet, they should readily combine with
analyses in PAUP*, currently nearing launch from NESCent. In addition to online workable examples using exciting data on humans, we will derive a suite of statistical tests to diagnose which parts of the data are least meeting expectations. Results will include publically available code and programs, as well as publications describing such analyses and extending the Information Theory synthesis to multidimensional scaling.
Information Theory, Robust Statistics and Diagnostics For Phylogenomics
PI(s): | Peter Waddell (Ronin Institute) |
Start Date: | 4-Oct-2013 |
End Date: | 19-Nov-2013 |
Keywords: | genomics, evolutionary genetics, systematics, phylogenetics, computational modeling |
Phylogenomics is a major direction of current evolutionary biology. However, it is disconcerting to see a bootstrap analysis give 100% support to a clade in one analysis, with a conflicting clade receiving 100% support in another analysis with minor alterations! The key issue is failure to implement the critical scientific concept of robust and realistic assessments of fit of data to model. I have developed a suite of techniques to deal with such issues, starting with ML based distance analyses (since these condense the data, but the theory is nearly identical for ML character based analyses). These are informed by both robust statistics and Information Theory, such as stochastic
complexity. For example, fit is measured in terms of actual, rather than assumed, noise. Thus, distance analyses report geometrically adjusted percentage error statistics on distances fitted on a tree. Further, these statistics can then be used to generate quasi-replicate data sets (Residual Resampling) that mimic the actual error in the data rather than assuming unrealistic best-case scenarios, such as having perfectly i.i.d. data and the one true model. Already coupled with NeighborNet, they should readily combine with
analyses in PAUP*, currently nearing launch from NESCent. In addition to online workable examples using exciting data on humans, we will derive a suite of statistical tests to diagnose which parts of the data are least meeting expectations. Results will include publically available code and programs, as well as publications describing such analyses and extending the Information Theory synthesis to multidimensional scaling.