Remote Reading Pipelines

There are three reading pipelines that have been developed for reading on remote high-performance systems.

A pipeline that uses AWS Batch and caches on S3 (

This pipeline makes use of AWS Batch jobs to scale readings arbitrarily, and optimizes the reading by caching results on S3, thus preventing the user from needing to reread content unnecessarily. This pipeline may someday be retired in favor of the RDS reading pipeline (see below), however at this time this method is nominally maintained.

The machinery for reading a list of PMIDs with REACH or SPARSER (, cleanup=True, sparser_version=None)[source]

Run sparser on the pmids in pmids_unread.[source]

Join different REACH output JSON files into a single JSON object.

The output of REACH is broken into three files that need to be joined before processing. Specifically, there will be three files of the form: <prefix>.uaz.<subcategory>.json.

Parameters:prefix (str) – The absolute path up to the extensions that reach will add.
Returns:json_obj – The result of joining the files, keyed by the three subcategories.
Return type:dict, source, cont_path, sparser_version, outbuf=None, cleanup=True)[source]

Run sparser on a single pmid., base_dir, num_cores, start_index, end_index, force_read, force_fulltext, cleanup=False, verbose=True)[source]

Run reach on a list of pmids., tmp_dir, num_cores, start_index, end_index, force_read, force_fulltext, cleanup=True, verbose=True)[source]

Run the sparser reader on the pmids in pmid_list.

A pipeline that uses AWS Batch and RDS (indra_db.reading)

This pipeline no longer exists under the umbrella of INDRA, but rather lives in the INDRA DB repo:

More information can be found in the indra_db documentation. Unlike the other pipelines, this system is aimed at continuous automatic reading of all content.

A pipeline that uses StarCluster (

The first pipeline developed for large scale reading. This method is no longer actively maintained.