High-throughput reading tools (indra.tools.reading)

INDRA defines interfaces to many text reading tools, however many of those only handle reading at small scales. These tools are developed to harness reading at arbitrary scales.

Tools used to run reading on a set of locally stored files (indra.tools.reading.read_files)

Read a list of files located in your local directory.

indra.tools.reading.read_files.make_parser()[source]

Create the argument parser, derived from the general scripts parser.

indra.tools.reading.read_files.read_files(files, readers, **kwargs)[source]

Read the files in files with the reader objects in readers.

Parameters:
  • files (list [str]) – A list of file paths to be read by the readers. Supported files are limited to text and nxml files.
  • readers (list [Reader instances]) – A list of Reader objects to be used reading the files.
  • **kwargs – Other keyword arguments are passed to the read method of the readers.
Returns:

output_list – A list of ReadingData objects with the contents of the readings.

Return type:

list [ReadingData]

Classes defining and implementing interfaces to different readers (indra.tools.reading.readers)

Tools to run the DRUM reading system (indra.tools.reading.run_drum_reading)

indra.tools.reading.run_drum_reading.read_pmid_sentences(pmid_sentences, **drum_args)[source]

Read sentences from a PMID-keyed dictonary and return all Statements

Parameters:
  • pmid_sentences (dict[str, list[str]]) – A dictonary where each key is a PMID pointing to a list of sentences to be read.
  • **drum_args – Keyword arguments passed directly to the DrumReader. Typical things to specify are host and port. If run_drum is specified as True, this process will internally run the DRUM reading system as a subprocess. Otherwise, DRUM is expected to be running independently.
Returns:

all_statements – A list of INDRA Statements resulting from the reading

Return type:

list[indra.statement.Statement]

indra.tools.reading.run_drum_reading.read_text(text, **drum_args)[source]

Read sentences from a PMID-keyed dictonary and return all Statements

Parameters:
  • text (str) – A block of text to run DRUM on
  • **drum_args – Keyword arguments passed directly to the DrumReader. Typical things to specify are ‘host’ and ‘port’.
Returns:

statements – A list of INDRA Statements resulting from the reading

Return type:

list[indra.statement.Statement]

Python tools for submitting reading pipelines (indra.tools.reading.submit_reading_pipeline)

class indra.tools.reading.submit_reading_pipeline.BatchMonitor(queue_name, job_list=None, job_base=None, log_base=None)[source]

A monitor for batch jobs.

Parameters:
  • queue_name (str) – The name of the queue to wait for completion.
  • job_list (Optional[list(dict)]) – A list of jobID-s in a dict, as returned by the submit function. Example: [{‘jobId’: ‘e6b00f24-a466-4a72-b735-d205e29117b4’}, …] If not given, this function will return if all jobs completed.
  • job_base (Optional[str]) – Give the root name of the jobs you want to track.
  • log_base (Optional[str]) – Indicate the root name of the location you wish all logs to be stored. If you choose to dump logs on s3, this will be the s3 prefix. Note that a trailing ‘/’ is NOT assumed.
check_logs(job_defs, idle_log_timeout)[source]

Updates the job_log_dict.

watch_and_wait(poll_interval=10, idle_log_timeout=None, kill_on_log_timeout=False, stash_log_method=None, tag_instances=False, wait_for_first_job=False, dump_size=10000, result_record=None)[source]

Return when all jobs are finished.

If no job list was given, return when all jobs in queue finished.

Parameters:
  • poll_interval (Optional[int]) – The time delay between API calls to check the job statuses.
  • idle_log_timeout (Optional[int] or None) – If not None, then track the logs of the active jobs, and if new output is not produced after idle_log_timeout seconds, a warning is printed. If kill_on_log_timeout is set to True, the job will also be terminated.
  • kill_on_log_timeout (Optional[bool]) – If True, and if idle_log_timeout is set, jobs will be terminated after timeout. This has no effect if idle_log_timeout is None. Default is False.
  • stash_log_method (Optional[str]) – Select a method to store the job logs, either ‘s3’ or ‘local’. If no method is specified, the logs will not be loaded off of AWS. If ‘s3’ is specified, then log_base must have been given in __init__, as this will indicate where to store the logs.
  • tag_instances (bool) – Default is False. If True, apply tags to the instances. This is today typically done by each job, so in most cases this should not be needed.
  • wait_for_first_job (bool) – Don’t exit until at least one job has been found. This is good if you are monitoring jobs that are submitted periodically, but can be a problem if there is a chance you might call this when no jobs will ever be run.
  • dump_size (int) – Set the size of the log dumps (number of lines). The default is 10,000.
  • result_record (dict) – A dict which will be modified in place to record the results of the job.
exception indra.tools.reading.submit_reading_pipeline.BatchReadingError[source]
indra.tools.reading.submit_reading_pipeline.get_ecs_cluster_for_queue(queue_name, batch_client=None)[source]

Get the name of the ecs cluster using the batch client.

indra.tools.reading.submit_reading_pipeline.submit_combine(basename, readers, job_ids=None, project_name=None)[source]

Submit a batch job to combine the outputs of a reading job.

This function is provided for backwards compatibility. You should use the PmidSubmitter and submit_combine methods.

indra.tools.reading.submit_reading_pipeline.submit_reading(basename, pmid_list_filename, readers, start_ix=None, end_ix=None, pmids_per_job=3000, num_tries=2, force_read=False, force_fulltext=False, project_name=None)[source]

Submit an old-style pmid-centered no-database s3 only reading job.

This function is provided for the sake of backward compatibility. It is preferred that you use the object-oriented PmidSubmitter and the submit_reading job going forward.

indra.tools.reading.submit_reading_pipeline.tag_instances_on_cluster(cluster_name, project='cwc')[source]

Adds project tag to untagged instances in a given cluster.

Parameters:
  • cluster_name (str) – The name of the AWS ECS cluster in which running instances should be tagged.
  • project (str) – The name of the project to tag instances with.