Striped UniFrac: enabling microbiome analysis at unprecedented scale


Striped UniFrac: enabling microbiome analysis at unprecedented scale

Play all audios:


To the Editor — The UniFrac metric is used frequently in microbiome research, but it does not scale to today’s large datasets. We propose a new algorithm, Striped UniFrac, which produces


results identical to those of previous algorithms but requires dramatically less memory and computing power. A BSD-licensed implementation is available that produces a C shared library


linkable by any programming language (Supplementary Software and https://github.com/biocore/unifrac).


UniFrac1 is a phylogenetic distance metric used to compare pairs of microbiome profiles. Microbiome studies now encompass tens of thousands of samples, such as the 27,751-sample Earth


Microbiome Project (EMP)2 and the 15,096-sample American Gut Project3. Existing algorithms for UniFrac computation cannot scale in time or space to these study designs. For example, Fast


UniFrac with the EMP was projected to take months. Striped UniFrac produces results identical to those of other existing algorithms, shows >30-fold improvement in single-threaded performance


and near-linear parallel scaling (Supplementary Fig. 1a,b), and can process the EMP dataset on a laptop in less than 24 hours. It can enable scientists to derive new biological insights, as


shown by a meta-analysis3 of the American Gut Project and EMP. To demonstrate the utility of the algorithm, we computed UniFrac on 113,721 public samples in Qiita4 in less than 48 hours


using 256 CPUs (an interactive plot is available at https://bit.ly/2LHMDFC).


The datasets analyzed during the current study are available in the Qiita repository with the specific study accessions in Supplementary Data 1, and were extracted with Qiita’s redbiom


interface.


This work was supported by the NSF (grant DBI-1565100 to D.M., Y.V.-B., Z.X., A.G., and R.K.; award 1664803 to D.K and J.M.), the Alfred P. Sloan Foundation (G-2017-9838 to D.M., Y.V.-B.,


A.G., and R.K.; G-2015-13933 to A.G. and R.K.), ONR (grant N00014-15-1-2809 to D.M., A.G., and R.K.), and NIH–NIDDK (grant P01DK078669 to A.G. and R.K.). This work was partially supported by


XSEDE resource grant BIO150043. Additional support was provided by CRISP, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.


Department of Pediatrics, University of California, San Diego, La Jolla, CA, USA


Daniel McDonald, Yoshiki Vázquez-Baeza, Nicolai Reeve, Zhenjiang Xu, Antonio Gonzalez & Rob Knight


Mathematics Department, Oregon State University, Corvallis, OR, USA


Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA


Center for Microbiome Innovation, University of California, San Diego, La Jolla, CA, USA


Department of Bioengineering, University of California, San Diego, La Jolla, CA, USA


D.M. designed Striped UniFrac, planned the study, analyzed data, and wrote the manuscript. Y.V.-B. integrated Striped UniFrac with QIIME 2 and contributed to the manuscript. D.K. and J.M.


contributed to the proof. N.R. contributed language interface code. Z.X. contributed to the manuscript. A.G integrated Striped UniFrac with Qiita. R.K. planned the study and wrote the


manuscript.


R.K. is a founder and CSO of Biota Technology Inc. D.M. is a consultant with Biota Technology Inc.


(A-B) Walltime and memory distributions of independent processes operating on the full Earth Microbiome Project dataset (n = 26,181) executing on shared compute nodes. An individual


partition represents a single independent process, and each process was run with two threads; 32 partitions indicates 32 processes using two threads each. A higher partition count means each


individual process is doing less work. Box plots show the median, whiskers are 1.5 times the proportion of the interquartile range past the 25th and 75th percentiles; the number of data


points in each box plot is the number of partitions in the processing run. (C) An empirical assessment of the number of proportion vectors required to be retained in memory over increasing


tree sizes. This assessment was performed by randomly sampling tips from the Greengenes 99% OTU tree, and counting the maximum number of nodes required to hold proportion vectors resident in


memory. Box plots show the median, whiskers are 1.5 times the proportion of the interquartile range past the 25th and 75th percentiles; each box plot represents 10 independent experiments.


(D) Empirical assessment of the runtime of Striped UniFrac for 1,024 samples over increasing numbers of tips in a phylogeny. (E) Mantel tests (Pearson) between Striped UniFrac in exact mode,


which produces identical results to UniFrac, versus fast mode, in which the UniFrac distances are not computed at the tips of the tree during traversal. Each data point represents n = 10


random subsets (independent experiments) of the Earth Microbiome Project Deblur 90-nt dataset, with the mean R2 value depicted. Error bars are 95% CI around the mean. The figure data can be


found in Supplementary Data 3.


figure1-data.xlsx, the data necessary to re-create panels c and d in Fig. 1.


figureS1-data.xlsx, the data necessary to re-create Supplementary Fig. 1.


Supplementary SoftwareUnifrac.tar.gz, the version of UniFrac used in the study.


Anyone you share the following link with will be able to read this content: