Hail
Hail is an open-source, scalable framework for exploring and analyzing genomic data.
http://blog.cloudera.com/blog/2017/05/hail-scalable-genomics-analysis-with-spark/
Running Hail locally
You’ll need:
- Java 8 JDK. Just type module load java/1.8.
- Spark 2.2.0. Hail will work with other bug fix versions of Spark 2.2.x, but it will not work with Spark 1.x.x, 2.0.x, or 2.1.x.
- Anaconda for Python 3. Just type module load python/3.5_intel.
- Current distribution for Spark 2.2.0
Unzip the distribution after you download it. Next, edit and copy the below bash commands to set up the Hail environment variables. You may want to add the export
lines to the appropriate dot-file (consider adding them to your .bashrc or .profile
) so that you don’t need to rerun these commands in each new session.
Here, fill in the path to the untarred Spark package.
$ export SPARK_HOME=<path to spark>
Unzip the Hail distribution.
$ unzip <path to hail.zip>
Here, fill in the path to the unzipped Hail distribution.
$ export HAIL_HOME=<path to hail> $ export PATH=$PATH:$HAIL_HOME/bin/
To install Python dependencies, create a conda environment for Hail:
$ module load python/3.5_intel $ module load java/1.8 $ conda env create -n hail -f $HAIL_HOME/python/hail/environment.yml $ source activate hail
Start Hail
$ jhail
>>> import hail as hl
>>> import hail.expr.aggregators as agg >>> hl.init()
If the above cell ran without error, you can get started!
Once you’ve set up Hail, we recommend that you run the Python tutorials to get an overview of Hail functionality and learn about the powerful query language. To try Hail out, start a Jupyter Notebook server in the tutorials directory.
You can click on the “01-genome-wide-association-study” notebook to get started!
In the future, if you want to run:
Hail in Python use hail
Hail in IPython use ihail
Hail in a Jupyter Notebook use jhail
Hail will not import correctly from a normal Python interpreter, a normal IPython interpreter, nor a normal Jupyter Notebook.
BLAS and LAPACK
$ conda install -c conda-forge blas $ conda install -c conda-forge lapack
Building Hail from source
You’ll need:
- The Java 8 JDK.
- Spark 2.0.2. Hail is compatible with Spark 2.0.x and 2.1.x.
- Python 2.7 and Jupyter Notebooks. We recommend the free Anaconda distribution.
On a Debian-based Linux OS like Ubuntu, run:
$ Module load cmake gcc/7 git
$ git clone --branch 0.1 https://github.com/broadinstitute/hail.git
$ cd hail
$ ./gradlew -Dspark.version=2.0.2 shadowJar
Add the following environmental variables by filling in the paths to SPARK_HOME and HAIL_HOME below and exporting all four of them (consider adding them to your .bashrc or .profile):
export SPARK_HOME="/path to spark"
export HAIL_HOME="/path to spark"
export PATH=$PATH:$HAIL_HOME/bin/
export SPARK_CLASSPATH=$HAIL_HOME/build/libs/hail-all-spark.jar
export PYTHONPATH="$PYTHONPATH:$HAIL_HOME/build/distributions/hail-python.zip:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip"
Running on a Spark cluster
Hail can run on any cluster that has Spark 2 installed.
$ ./gradlew -Dspark.version=2.0.2 shadowJar archiveZip
Within the interactive shell, check that you can create a HailContext by running the following commands. Note that you have to pass in the existing SparkContext instance sc to the HailContext constructor.
>>> from hail import *
>>> hc = HailContext()