High-throughput reading CLI’s

The reading tools include several Python CLI’s to run reading or other related tasks.

Run reading on local files

Read a list of files located in your local directory.

usage: python -m indra.tools.reading.read_files [-h]
                                                [-r {reach,sparser,trips} [{reach,sparser,trips} ...]]
                                                [-n N_PROC] [-s N_SAMP]
                                                [-I RANGE_STR] [-v] [-q] [-d]
                                                input_file output_name

Positional Arguments

input_file A file containing a list of files/file paths to be read. These should be nxml or txt files.
output_name Results will be pickled in files <output_name>_stmts.pkl and <output_name>_readings.pkl.

Named Arguments

-r, --readers

Possible choices: reach, sparser, trips

List of readers to be used.

-n, --num_procs
 

Select the number of processes to use.

Default: 1

-s, --sample Read a random sample of size N_SAMP of the inputs.
-I, --in_range Only read input lines in the range given as <start>:<end>.
-v, --verbose

Include output from the readers.

Default: False

-q, --quiet

Suppress most output. Overrides -v and -d options.

Default: False

-d, --debug

Set the logging to debug level.

Default: False

Run REACH and/or SPARSER locally on a list of PMIDs using S3 caching

Apply NLP readers to the content available for a list of pmids.

usage: python -m indra.tools.reading.pmid_reading.read_pmids
       [-h] [-r {reach,sparser,all} [{reach,sparser,all} ...]] [-u]
       [--force_fulltext] [--force_read] [-n NUM_CORES] [-v] [-m]
       [-s START_INDEX] [-e END_INDEX] [--shuffle] [-o OUT_DIR]
       basename pmid_list_file

Positional Arguments

basename The name of this job.
pmid_list_file Path to a file containing a list of line separated pmids for the articles to be read.

Named Arguments

-r, --reader

Possible choices: reach, sparser, all

Choose which reader(s) to use.

Default: [‘all’]

-u, --upload_json
 

Option to simply upload previously read json files. Overrides -r option, so no reading will be done.

Default: False

--force_fulltext
 

Option to force reading of the full text.

Default: False

--force_read

Option to force the reader to reread everything.

Default: False

-n, --num_cores
 

Select the number of cores you want to use.

Default: 1

-v, --verbose

Show more output to screen.

Default: False

-m, --messy

Choose to not clean up after run.

Default: True

-s, --start_index
 

Select the first pmid in the list to start reading.

Default: 0

-e, --end_index
 Select the last pmid in the list to read.
--shuffle

Select a random sample of the pmids provided. -s/–start_index will be ingored, and -e/–end_index will set the number of samples to take.

Default: False

-o, --outdir The output directory where stuff is written. This is only a temporary directory when reading. By default this will be the”<basename>_out”.

Submit AWS Batch reading jobs

Run reading by collecting content, and save as pickles. This option requires that ids are given as a list of pmids, one line per pmid.

usage: python -m indra.tools.reading.submit_reading_pipeline
       [-h] {read,combine,full} ...

Job Type

job_type

Possible choices: read, combine, full

Type of jobs to submit.

Sub-commands:

read

Run reading on batch and cache INDRA Statements on S3.

python -m indra.tools.reading.submit_reading_pipeline read [-h]
                                                           [--start_ix START_IX]
                                                           [--end_ix END_IX]
                                                           [--force_read]
                                                           [--force_fulltext]
                                                           [--ids_per_job IDS_PER_JOB]
                                                           [-r {sparser,reach,all} [{sparser,reach,all} ...]]
                                                           [--project PROJECT]
                                                           input_file basename
Positional Arguments
input_file Path to file containing input ids of content to read. For the no-db options, this is simply a file with each line being a pmid. For the with-db options, this is a file where each line is of the form ‘<id type>:<id>’, for example ‘pmid:12345’
basename Defines job names and S3 keys
Named Arguments
--start_ix Start index of ids to read.
--end_ix End index of ids to read. If None, read content from all ids.
--force_read

Read papers even if previously read by current REACH.

Default: False

--force_fulltext
 

Get full text content even if content already on S3.

Default: False

--ids_per_job

Number of PMIDs to read for each AWS Batch job.

Default: 3000

-r, --readers

Possible choices: sparser, reach, all

Choose which reader(s) to use.

Default: [‘all’]

--project Set the project name. Default is DEFAULT_AWS_PROJECT in the config.

combine

Combine INDRA Statement subsets into a single file.

python -m indra.tools.reading.submit_reading_pipeline combine
[-h] [-r {sparser,reach,all} [{sparser,reach,all} ...]] [--project PROJECT]
basename
Positional Arguments
basename Defines job names and S3 keys
Named Arguments
-r, --readers

Possible choices: sparser, reach, all

Choose which reader(s) to use.

Default: [‘all’]

--project Set the project name. Default is DEFAULT_AWS_PROJECT in the config.

full

Run reading and combine INDRA Statements when done.

python -m indra.tools.reading.submit_reading_pipeline full [-h]
                                                           [--start_ix START_IX]
                                                           [--end_ix END_IX]
                                                           [--force_read]
                                                           [--force_fulltext]
                                                           [--ids_per_job IDS_PER_JOB]
                                                           [-r {sparser,reach,all} [{sparser,reach,all} ...]]
                                                           [--project PROJECT]
                                                           input_file basename
Positional Arguments
input_file Path to file containing input ids of content to read. For the no-db options, this is simply a file with each line being a pmid. For the with-db options, this is a file where each line is of the form ‘<id type>:<id>’, for example ‘pmid:12345’
basename Defines job names and S3 keys
Named Arguments
--start_ix Start index of ids to read.
--end_ix End index of ids to read. If None, read content from all ids.
--force_read

Read papers even if previously read by current REACH.

Default: False

--force_fulltext
 

Get full text content even if content already on S3.

Default: False

--ids_per_job

Number of PMIDs to read for each AWS Batch job.

Default: 3000

-r, --readers

Possible choices: sparser, reach, all

Choose which reader(s) to use.

Default: [‘all’]

--project Set the project name. Default is DEFAULT_AWS_PROJECT in the config.

Note that python wait_for_complete.py … should be run as soon as this command completes successfully. For more details use python wait_for_complete.py -h.

Monitor running AWS Batch jobs

Wait for a set of batch jobs to complete, and monitor them as they run.

usage: python -m indra.tools.reading.wait_for_complete [-h] queue_name [options]

Positional Arguments

queue_name The name of the queue to watch and wait for completion. If no jobs are specified, this will wait until all jobs in the queue are completed (either SUCCEEDED or FAILED).

Named Arguments

--watch, -w Specify particular jobs using their job ids, as reported by the submit command. Many ids may be specified.
--prefix, -p Specify a prefix for the name of the jobs to watch and wait for.
--interval, -i

The time interval to wait between job status checks, in seconds (default: 10 seconds).

Default: 10

--timeout, -T If the logs are not updated for TIMEOUT seconds, print a warning. If –kill_on_log_timeout flag is set, then the offending jobs will be automatically terminated.
--kill_on_timeout, -K
 

If a log times out, terminate the offending job.

Default: False

--stash_log_method, -l
 

Possible choices: s3, local

Select a method from: [[‘s3’, ‘local’]] to store the job logs. If no method is specified, the logs will not be loaded off of AWS. If ‘s3’ is specified, then job_name_prefix must also be given, as this will indicate where on s3 to store the logs.

Jobs can also be monitored, terminated, and otherwise managed on the AWS website. However this tool will also tag the instances, and should be run whenever a job is submitted to AWS.

Run the DRUM reading system

Run DRUM reading on a file.

usage: python -m indra.tools.reading.run_drum_reading [-h] file_name host port

Positional Arguments

file_name The name of the file to be read.
host The host on which DRUM is running.
port The port to which the DRUM process is listening.

Generate stats on AWS Batch reading results

Get statistics on a bunch of statements.

usage: python -m indra.tools.reading.util.reading_results_stats
       [-h] {from-pickle,from-db} ...

Source

source Possible choices: from-pickle, from-db

Sub-commands:

from-pickle

Get statistics of statements in a pickle file.

python -m indra.tools.reading.util.reading_results_stats from-pickle
[-h] file_path
Positional Arguments
file_path The path to the pickle file containing the statements to be analyzed.

from-db

Get statistics from statements on the database.

python -m indra.tools.reading.util.reading_results_stats from-db
[-h] [--indra_version INDRA_VERSION] [--date_range DATE_RANGE]
Named Arguments
--indra_version
 Specify the indra version for the batch of statements
--date_range Specify the range of datetimes for statements. Must be in the format: “YYYYMMDDHHMMSS:YYYMMDDHHMMSS”. If you do not want to impose the upper or lower bound, simply leave it blank, eg. “YYYYMMDDHHMMSS:” if you don’t care about the upper bound.