Hail

Hail is an open-source, scalable framework for exploring and analyzing genomic data.

http://blog.cloudera.com/blog/2017/05/hail-scalable-genomics-analysis-with-spark/

Running Hail locally 

You’ll need:

Unzip the distribution after you download it. Next, edit and copy the below bash commands to set up the Hail environment variables. You may want to add the export lines to the appropriate dot-file (consider adding them to your .bashrc or .profile) so that you don’t need to rerun these commands in each new session.

Here, fill in the path to the untarred Spark package.

$ export SPARK_HOME=<path to spark> 

Unzip the Hail distribution.

$ unzip <path to hail.zip>

Here, fill in the path to the unzipped Hail distribution.

$ export HAIL_HOME=<path to hail> 
$ export PATH=$PATH:$HAIL_HOME/bin/

To install Python dependencies, create a conda environment for Hail:

$ module load python/3.5_intel
$ module load java/1.8
$ conda env create -n hail -f $HAIL_HOME/python/hail/environment.yml $ source activate hail

Start Hail

$ jhail
>>> import hail as hl
>>> import hail.expr.aggregators as agg
>>> hl.init()

If the above cell ran without error, you can get started!

Once you’ve set up Hail, we recommend that you run the Python tutorials to get an overview of Hail functionality and learn about the powerful query language. To try Hail out, start a Jupyter Notebook server in the tutorials directory.

You can click on the “01-genome-wide-association-study” notebook to get started!

In the future, if you want to run:

  • Hail in Python use hail

  • Hail in IPython use ihail

  • Hail in a Jupyter Notebook use jhail

Hail will not import correctly from a normal Python interpreter, a normal IPython interpreter, nor a normal Jupyter Notebook.

BLAS and LAPACK

$ conda install -c conda-forge blas
$ conda install -c conda-forge lapack

Building Hail from source

 You’ll need:

On a Debian-based Linux OS like Ubuntu, run:

$ Module load cmake gcc/7 git

$ git clone --branch 0.1 https://github.com/broadinstitute/hail.git

$ cd hail

$ ./gradlew -Dspark.version=2.0.2 shadowJar

Add the following environmental variables by filling in the paths to SPARK_HOME and HAIL_HOME below and exporting all four of them (consider adding them to your .bashrc or .profile):

export SPARK_HOME="/path to spark"

export HAIL_HOME="/path to spark"

export PATH=$PATH:$HAIL_HOME/bin/

export SPARK_CLASSPATH=$HAIL_HOME/build/libs/hail-all-spark.jar

export PYTHONPATH="$PYTHONPATH:$HAIL_HOME/build/distributions/hail-python.zip:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip"

Running on a Spark cluster

Hail can run on any cluster that has Spark 2 installed.

$ ./gradlew -Dspark.version=2.0.2 shadowJar archiveZip

Within the interactive shell, check that you can create a HailContext by running the following commands. Note that you have to pass in the existing SparkContext instance sc to the HailContext constructor.

>>> from hail import *
>>> hc = HailContext()