High-throughput reading tools (indra.tools.reading)

INDRA defines interfaces to many text reading tools, however many of those only handle reading at small scales. These tools are developed to harness reading at arbitrary scales.

Tools used to run reading on a set of locally stored files (indra.tools.reading.read_files)

Read a list of files located in your local directory.

indra.tools.reading.read_files.make_parser()[source]

Create the argument parser, derived from the general scripts parser.

indra.tools.reading.read_files.read_files(files, readers, **kwargs)[source]

Read the files in files with the reader objects in readers.

Parameters:
  • files (list [str]) – A list of file paths to be read by the readers. Supported files are limited to text and nxml files.
  • readers (list [Reader instances]) – A list of Reader objects to be used reading the files.
  • **kwargs – Other keyword arguments are passed to the read method of the readers.
Returns:

output_list – A list of ReadingData objects with the contents of the readings.

Return type:

list [ReadingData]

Classes defining and implementing interfaces to different readers (indra.tools.reading.readers)

Objects for interacting with bulk nlp reading tools.

class indra.tools.reading.readers.Content(id, format, compressed=False, encoded=False)[source]

An object to regularize the content passed to the readers.

To use this class, use one of the two constructor methods:
  • from_file : use content from a file on the filesystem.
  • from_string : Pass a string (or bytes) directly as content.

This class also regularizes the handling of id’s and formats, as well as allowing for decompression and decoding, in the manner standard in the INDRA project.

change_format(new_format)[source]

Change the format label of this content.

Note that this does NOT actually alter the format of the content, only the label.

change_id(new_id)[source]

Change the id of this content.

classmethod from_file(file_path, compressed=False, encoded=False)[source]

Create a content object from a file path.

classmethod from_string(id, format, raw_content, compressed=False, encoded=False)[source]

Create a Content object from string/bytes content.

get_filename(renew=False)[source]

Get the filename of this content.

If the file name doesn’t already exist, we created it as {id}.{format}.

get_filepath(renew=False)[source]

Get the file path, joining the name and location for this file.

If no location is given, it is assumed to be “here”, e.g. “.”.

get_text()[source]

Get the loaded, decompressed, and decoded text of this content.

is_format(*formats)[source]

Check the format of this content.

set_location(new_location)[source]

Set/change the location of this content.

Note that this does NOT change the actual location of the file. To do so, use the copy_to method.

class indra.tools.reading.readers.EmptyReader(base_dir=None, n_proc=1, check_content=True, input_character_limit=500000.0, max_space_ratio=0.5, ResultClass=<class 'indra.tools.reading.readers.ReadingData'>)[source]

A class name to use for Readers that are not implemented yet.

exception indra.tools.reading.readers.ReachError[source]
class indra.tools.reading.readers.ReachReader(*args, **kwargs)[source]

This object encodes an interface to the reach reading script.

clear_input()[source]

Remove all the input files (at the end of a reading).

get_output()[source]

Get the output of a reading job as a list of filenames.

prep_input(read_list)[source]

Apply the readers to the content.

read(read_list, verbose=False, log=False)[source]

Read the content, returning a list of ReadingData objects.

class indra.tools.reading.readers.Reader(base_dir=None, n_proc=1, check_content=True, input_character_limit=500000.0, max_space_ratio=0.5, ResultClass=<class 'indra.tools.reading.readers.ReadingData'>)[source]

This abstract object defines and some general methods for readers.

add_result(content_id, content, **kwargs)[source]

“Add a result to the list of results.

read(read_list, verbose=False, log=False)[source]

Read a list of items and return a dict of output files.

class indra.tools.reading.readers.ReadingData(content_id, reader, reader_version, content_format, content)[source]

Object to contain the data produced by a reading.

Parameters:
  • content_id (int or str) – A unique identifier of the text content that produced the reading, which can be mapped back to that content.
  • reader (str) – The name of the reader, consistent with it’s name attribute, for example: ‘REACH’
  • reader_version (str) – A string identifying the version of the underlying nlp reader.
  • content_format (str) – The format of the content. Options are in indra.db.formats.
  • content (str or dict) – The content of the reading result. A string in the format given by content_format.
get_statements(reprocess=False)[source]

General method to create statements.

exception indra.tools.reading.readers.ReadingError[source]
exception indra.tools.reading.readers.SparserError[source]
class indra.tools.reading.readers.SparserReader(*args, **kwargs)[source]

This object provides methods to interface with the commandline tool.

get_output(output_files, clear=True)[source]

Get the output files as an id indexed dict.

prep_input(read_list)[source]

Prepare the list of files or text content objects to be read.

read(read_list, verbose=False, log=False, n_per_proc=None)[source]

Perform the actual reading.

read_some(fpath_list, outbuf=None, verbose=False)[source]

Perform a few readings.

class indra.tools.reading.readers.TripsReader(*args, **kwargs)[source]

A stand-in for TRIPS reading.

Currently, we do not run TRIPS (more specifically DRUM) regularly at large scales, however on occasion we have outputs from TRIPS that were generated a while ago.

read(*args, **kwargs)[source]

Read a list of items and return a dict of output files.

indra.tools.reading.readers.get_reader(reader_name, *args, **kwargs)[source]

Get an instantiated reader by name.

indra.tools.reading.readers.get_reader_class(reader_name)[source]

Get a particular reader class by name.

indra.tools.reading.readers.get_reader_classes(parent=<class 'indra.tools.reading.readers.Reader'>)[source]

Get all childless the descendants of a parent class, recursively.

Tools to run the DRUM reading system (indra.tools.reading.run_drum_reading)

indra.tools.reading.run_drum_reading.read_pmid_sentences(pmid_sentences, **drum_args)[source]

Read sentences from a PMID-keyed dictonary and return all Statements

Parameters:
  • pmid_sentences (dict[str, list[str]]) – A dictonary where each key is a PMID pointing to a list of sentences to be read.
  • **drum_args – Keyword arguments passed directly to the DrumReader. Typical things to specify are host and port. If run_drum is specified as True, this process will internally run the DRUM reading system as a subprocess. Otherwise, DRUM is expected to be running independently.
Returns:

all_statements – A list of INDRA Statements resulting from the reading

Return type:

list[indra.statement.Statement]

indra.tools.reading.run_drum_reading.read_text(text, **drum_args)[source]

Read sentences from a PMID-keyed dictonary and return all Statements

Parameters:
  • text (str) – A block of text to run DRUM on
  • **drum_args – Keyword arguments passed directly to the DrumReader. Typical things to specify are ‘host’ and ‘port’.
Returns:

statements – A list of INDRA Statements resulting from the reading

Return type:

list[indra.statement.Statement]

Python tools for submitting reading pipelines (indra.tools.reading.submit_reading_pipeline)

exception indra.tools.reading.submit_reading_pipeline.BatchReadingError[source]
indra.tools.reading.submit_reading_pipeline.get_ecs_cluster_for_queue(queue_name, batch_client=None)[source]

Get the name of the ecs cluster using the batch client.

indra.tools.reading.submit_reading_pipeline.submit_combine(basename, readers, job_ids=None, project_name=None)[source]

Submit a batch job to combine the outputs of a reading job.

This function is provided for backwards compatibility. You should use the PmidSubmitter and submit_combine methods.

indra.tools.reading.submit_reading_pipeline.submit_reading(basename, pmid_list_filename, readers, start_ix=None, end_ix=None, pmids_per_job=3000, num_tries=2, force_read=False, force_fulltext=False, project_name=None)[source]

Submit an old-style pmid-centered no-database s3 only reading job.

This function is provided for the sake of backward compatibility. It is preferred that you use the object-oriented PmidSubmitter and the submit_reading job going forward.

indra.tools.reading.submit_reading_pipeline.tag_instances_on_cluster(cluster_name, project='cwc')[source]

Adds project tag to untagged instances in a given cluster.

Parameters:
  • cluster_name (str) – The name of the AWS ECS cluster in which running instances should be tagged.
  • project (str) – The name of the project to tag instances with.
indra.tools.reading.submit_reading_pipeline.wait_for_complete(queue_name, job_list=None, job_name_prefix=None, poll_interval=10, idle_log_timeout=None, kill_on_log_timeout=False, stash_log_method=None, tag_instances=False, result_record=None)[source]

Return when all jobs in the given list finished.

If not job list is given, return when all jobs in queue finished.

Parameters:
  • queue_name (str) – The name of the queue to wait for completion.
  • job_list (Optional[list(dict)]) – A list of jobID-s in a dict, as returned by the submit function. Example: [{‘jobId’: ‘e6b00f24-a466-4a72-b735-d205e29117b4’}, …] If not given, this function will return if all jobs completed.
  • job_name_prefix (Optional[str]) – A prefix for the name of the jobs to wait for. This is useful if the explicit job list is not available but filtering is needed.
  • poll_interval (Optional[int]) – The time delay between API calls to check the job statuses.
  • idle_log_timeout (Optional[int] or None) – If not None, then track the logs of the active jobs, and if new output is not produced after idle_log_timeout seconds, a warning is printed. If kill_on_log_timeout is set to True, the job will also be terminated.
  • kill_on_log_timeout (Optional[bool]) – If True, and if idle_log_timeout is set, jobs will be terminated after timeout. This has no effect if idle_log_timeout is None. Default is False.
  • stash_log_method (Optional[str]) – Select a method to store the job logs, either ‘s3’ or ‘local’. If no method is specified, the logs will not be loaded off of AWS. If ‘s3’ is specified, then job_name_prefix must also be given, as this will indicate where on s3 to store the logs.
  • tag_instances (bool) – Default is False. If True, apply tags to the instances. This is toady typically done by each job, so in most cases this should not be needed.
  • result_record (dict) – A dict which will be modified in place to record the results of the job.