Large-Scale Machine Reading with Starcluster¶
The following doc describes the steps involved in reading a large numbers of papers in parallel on Amazon EC2 using REACH, caching the JSON output on Amazon S3, then processing the REACH output into INDRA Statements. Prerequisites for doing the following are:
- A cluster of Amazon EC2 nodes configured using Starcluster, with INDRA installed and in the PYTHONPATH
- An Amazon S3 bucket containing full text contents for papers, keyed by Pubmed ID (creation of this S3 repository will be described in another tutorial).
This tutorial goes through the individual steps involved before describing how all of them can be run through the use of a single submission script, submit_reading_pipeline.py.
Note also that the prerequisite installation steps can be streamlined by putting them in a setup script that can be re-run upon instantiating a new Amazon cluster or by using them to configure a custom Amazon EC2 AMI.
Install REACH¶
Install SBT. On an EC2 Linux machine, run the following lines (drawn from http://www.scala-sbt.org/0.13/docs/Installing-sbt-on-Linux.html):
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 642AC823
sudo apt-get update
sudo apt-get install sbt
Clone REACH from https://github.com/clulab/reach.
Add the following line to reach/build.sbt:
mainClass in assembly := Some("org.clulab.reach.ReachCLI")
This assigns ReachCLI as the main class.
Compile and assemble REACH. Note that the path to the .ivy2 directory must be given. Use the assembly task to assemble a fat JAR containing all of the dependencies with the correct main class. Run the following from the directory containing the REACH build.sbt file (e.g., /pmc/reach).:
sbt -Dsbt.ivy.home=/pmc/reach/.ivy2 compile
sbt -Dsbt.ivy.home=/pmc/reach/.ivy2 assembly
Install Amazon S3 support¶
Install boto3:
pip install boto3
Note
If using EC2, make sure to install boto3, jsonpickle, and Amazon credentials on all nodes, not just the master node.
Add Amazon credentials to access the S3 bucket. First create the .aws directory on the EC2 instance:
mkdir /home/sgeadmin/.aws
Then set up Amazon credentials, for example by copying from your local machine using StarCluster:
starcluster put mycluster ~/.aws/credentials /home/sgeadmin/.aws
Install other dependencies¶
pip install jsonpickle # Necessary to process JSON from S3
pip install --upgrade jnius-indra # Necessary for REACH
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64
Assemble a Corpus of PMIDs¶
The first step in large-scale reading is to put together a file containing relevant Pubmed IDs. The simplest way to do this is to use the Pubmed search API to find papers associated with particular gene names, biological processes, or other search terms.
For example, to assemble a list of papers for SOS2 curated in Entrez Gene that are available in the Pubmed Central Open Access subset:
In [1]: from indra.literature import *
# Pick an example gene
In [2]: gene = 'SOS2'
# Get a list of PMIDs for the gene
In [3]: pmids = pubmed_client.get_ids_for_gene(gene)
# Get the PMIDs that have XML in PMC
In [4]: pmids_oa_xml = pmc_client.filter_pmids(pmids, 'oa_xml')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-2865ec38dd94> in <module>()
----> 1 pmids_oa_xml = pmc_client.filter_pmids(pmids, 'oa_xml')
~/checkouts/readthedocs.org/user_builds/indra/checkouts/docstrings/indra/literature/pmc_client.py in filter_pmids(pmid_list, source_type)
121 with open(fulltext_list_path, 'rb') as f:
122 fulltext_list = set([line.strip('\n').decode('utf-8')
--> 123 for line in f.readlines()])
124 pmids_fulltext_dict[source_type] = fulltext_list
125 return list(set(pmid_list).intersection(
~/checkouts/readthedocs.org/user_builds/indra/checkouts/docstrings/indra/literature/pmc_client.py in <listcomp>(.0)
121 with open(fulltext_list_path, 'rb') as f:
122 fulltext_list = set([line.strip('\n').decode('utf-8')
--> 123 for line in f.readlines()])
124 pmids_fulltext_dict[source_type] = fulltext_list
125 return list(set(pmid_list).intersection(
TypeError: a bytes-like object is required, not 'str'
# Write the results to a file
In [5]: with open('%s_pmids.txt' % gene, 'w') as f:
...: for pmid in pmids_oa_xml:
...: f.write('%s\n' % pmid)
...: