https://store-images.s-microsoft.com/image/apps.65080.ad7537a8-756e-46f3-b57a-c9633ad84c03.cc7cf345-6dce-49fe-a644-2e6166b85f65.9bfea25e-c984-4e0e-95bd-07c117aae393

VariantSpark

CSIRO

VariantSpark

CSIRO

A scalable toolkit for genome-wide association studies optimized for GWAS like datasets.

VariantSpark is a scalable toolkit for genome-wide association studies optimized for GWAS like datasets. Machine learning methods and, in particular, random forests (RFs) are a promising alternative to standard single SNP analyses in genome-wide association studies (GWAS) and from scalable to rare variants from whole genome sequence data. RFs provide variable importance measures to rank genomic locations according to their predictive power to the disease or phenotype. Although there are a number of existing random forest implementations, some even parallel or distributed such as: Random Jungle, ranger or SparkML, none are optimized to deal with modern whole genome datasets, containing thousands of samples and millions of variables. Implemented directly on Apache Spark core, VariantSpark builds random forest models and estimates variable importance using the mean decrease gini method, processing VCF and CSV files. The package also includes a Jupyter notebook with examples to perform Quality Control and data manipulation tasks using HAIL.is (included in the package) as well as for visualizing the results.