REACH (indra.sources.reach)

REACH is a biology-oriented machine reading system which uses a cascade of grammars to extract biological mechanisms from free text.

To cover a wide range of use cases and scenarios, there are currently 4 different ways in which INDRA can use REACH.

1. INDRA communicating with a locally running REACH Server (indra.sources.reach.api)

Setup and usage: Follow standard instructions to install SBT. Then clone REACH and run the REACH web server.

git clone https://github.com/clulab/reach.git
cd reach
sbt 'run-main org.clulab.reach.export.server.ApiServer'

Then read text by specifying the url parameter when using indra.sources.reach.process_text.

from indra.sources import reach
rp = reach.process_text('MEK binds ERK', url=reach.local_text_url)

It is also possible to read NXML (string or file) and process the text of a paper given its PMC ID or PubMed ID using other API methods in indra.sources.reach.api. Note that reach.local_nxml_url needs to be used as url in case NXML content is being read.

Advantages:

  • Does not require setting up the pyjnius Python-Java bridge.
  • Does not require assembling a REACH JAR file.
  • Allows local control the REACH version and configuration used to run the service.
  • REACH is running in a separate process and therefore does not need to be initialized if a new Python session is started.

Disadvantages:

  • First request might be time-consuming as REACH is loading additional resources.
  • Only endpoints exposed by the REACH web server are available, i.e., no full object-level access to REACH components.

2. INDRA communicating with the UA REACH Server (indra.sources.reach.api)

Setup and usage: Does not require any additional setup after installing INDRA.

Read text using the default values for offline and url parameters.

from indra.sources import reach
rp = reach.process_text('MEK binds ERK')

It is also possible to read NXML (string or file) and process the content of a paper given its PMC ID or PubMed ID using other functions in indra.sources.reach.api.

Advantages:

  • Does not require setting up the pyjnius Python-Java bridge.
  • Does not require assembling a REACH JAR file or installing REACH at all locally.
  • Suitable for initial prototyping or integration testing.

Disadvantages:

  • Cannot handle high-throughput reading workflows due to limited server resources.
  • No control over which REACH version is used to run the service.
  • Difficulties processing NXML-formatted text (request times out) have been observed in the past.

3. INDRA using a REACH JAR through a Python-Java bridge (indra.sources.reach.reader)

Setup and usage:

Follow standard instructions for installing SBT. First, the REACH system and its dependencies need to be packaged as a fat JAR:

git clone https://github.com/clulab/reach.git
cd reach
sbt assembly

This creates a JAR file in reach/target/scala[version]/reach-[version].jar. Set the absolute path to this file on the REACHPATH environmental variable and then append REACHPATH to the CLASSPATH environmental variable (entries are separated by colons).

The pyjnius package needs to be set up and be operational. For more details, see Pyjnius setup instructions in the documentation.

Then, reading can be done using the indra.sources.reach.process_text function with the offline option.

from indra.sources import reach
rp = reach.process_text('MEK binds ERK', offline=True)

Other functions in indra.sources.reach.api can also be used with the offline option to invoke local, JAR-based reading.

Advantages:

  • Doesn’t require running a separate process for REACH and INDRA.
  • Having a single REACH JAR file makes this solution easily portable.
  • Through jnius, all classes in REACH become available for programmatic access.

Disadvantages:

  • Requires configuring pyjnius which is often difficult (e.g., on Windows). Therefore this usage mode is generally not recommended.
  • The ReachReader instance needs to be instantiated every time a new INDRA session is started which is time consuming.

4. Use REACH separately to produce output files and then process those with INDRA

In this usage mode REACH is not directly invoked by INDRA. Rather, REACH is set up and run independently of INDRA to produce output files for a set of text content. For more information on running REACH on a set of text or NXML files, see the REACH documentation at: https://github.com/clulab/reach. Note that INDRA uses the fries output format produced by REACH.

Once REACH output has been obtained in the fries JSON format, one can use indra.sources.reach.api.process_json_file in INDRA to process each JSON file.

REACH API (indra.sources.reach.api)

Methods for obtaining a reach processor containing indra statements.

Many file formats are supported. Many will run reach.

indra.sources.reach.api.process_json_file(file_name, citation=None)[source]

Return a ReachProcessor by processing the given REACH json file.

The output from the REACH parser is in this json format. This function is useful if the output is saved as a file and needs to be processed. For more information on the format, see: https://github.com/clulab/reach

Parameters:
  • file_name (str) – The name of the json file to be processed.
  • citation (Optional[str]) – A PubMed ID passed to be used in the evidence for the extracted INDRA Statements. Default: None
Returns:

rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.

Return type:

ReachProcessor

indra.sources.reach.api.process_json_str(json_str, citation=None)[source]

Return a ReachProcessor by processing the given REACH json string.

The output from the REACH parser is in this json format. For more information on the format, see: https://github.com/clulab/reach

Parameters:
  • json_str (str) – The json string to be processed.
  • citation (Optional[str]) – A PubMed ID passed to be used in the evidence for the extracted INDRA Statements. Default: None
Returns:

rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.

Return type:

ReachProcessor

indra.sources.reach.api.process_nxml_file(file_name, citation=None, offline=False, url=None, output_fname='reach_output.json')[source]

Return a ReachProcessor by processing the given NXML file.

NXML is the format used by PubmedCentral for papers in the open access subset.

Parameters:
  • file_name (str) – The name of the NXML file to be processed.
  • citation (Optional[str]) – A PubMed ID passed to be used in the evidence for the extracted INDRA Statements. Default: None
  • offline (Optional[bool]) – If set to True, the REACH system is run offline via a JAR file. Otherwise (by default) the web service is called. Default: False
  • url (Optional[str]) – URL for a REACH web service instance, which is used for reading if provided. If not provided but offline is set to False (its default value), the Arizona REACH web service is called (http://agathon.sista.arizona.edu:8080/odinweb/api/help). Default: None
  • output_fname (Optional[str]) – The file to output the REACH JSON output to. Defaults to reach_output.json in current working directory.
Returns:

rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.

Return type:

ReachProcessor

indra.sources.reach.api.process_nxml_str(nxml_str, citation=None, offline=False, url=None, output_fname='reach_output.json')[source]

Return a ReachProcessor by processing the given NXML string.

NXML is the format used by PubmedCentral for papers in the open access subset.

Parameters:
  • nxml_str (str) – The NXML string to be processed.
  • citation (Optional[str]) – A PubMed ID passed to be used in the evidence for the extracted INDRA Statements. Default: None
  • offline (Optional[bool]) – If set to True, the REACH system is run offline via a JAR file. Otherwise (by default) the web service is called. Default: False
  • url (Optional[str]) – URL for a REACH web service instance, which is used for reading if provided. If not provided but offline is set to False (its default value), the Arizona REACH web service is called (http://agathon.sista.arizona.edu:8080/odinweb/api/help). Default: None
  • output_fname (Optional[str]) – The file to output the REACH JSON output to. Defaults to reach_output.json in current working directory.
Returns:

rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.

Return type:

ReachProcessor

indra.sources.reach.api.process_pmc(pmc_id, offline=False, url=None, output_fname='reach_output.json')[source]

Return a ReachProcessor by processing a paper with a given PMC id.

Uses the PMC client to obtain the full text. If it’s not available, None is returned.

Parameters:
  • pmc_id (str) – The ID of a PubmedCentral article. The string may start with PMC but passing just the ID also works. Examples: 3717945, PMC3717945 https://www.ncbi.nlm.nih.gov/pmc/
  • offline (Optional[bool]) – If set to True, the REACH system is run offline via a JAR file. Otherwise (by default) the web service is called. Default: False
  • url (Optional[str]) – URL for a REACH web service instance, which is used for reading if provided. If not provided but offline is set to False (its default value), the Arizona REACH web service is called (http://agathon.sista.arizona.edu:8080/odinweb/api/help). Default: None
  • output_fname (Optional[str]) – The file to output the REACH JSON output to. Defaults to reach_output.json in current working directory.
Returns:

rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.

Return type:

ReachProcessor

indra.sources.reach.api.process_pubmed_abstract(pubmed_id, offline=False, url=None, output_fname='reach_output.json', **kwargs)[source]

Return a ReachProcessor by processing an abstract with a given Pubmed id.

Uses the Pubmed client to get the abstract. If that fails, None is returned.

Parameters:
  • pubmed_id (str) – The ID of a Pubmed article. The string may start with PMID but passing just the ID also works. Examples: 27168024, PMID27168024 https://www.ncbi.nlm.nih.gov/pubmed/
  • offline (Optional[bool]) – If set to True, the REACH system is run offline via a JAR file. Otherwise (by default) the web service is called. Default: False
  • url (Optional[str]) – URL for a REACH web service instance, which is used for reading if provided. If not provided but offline is set to False (its default value), the Arizona REACH web service is called (http://agathon.sista.arizona.edu:8080/odinweb/api/help). Default: None
  • output_fname (Optional[str]) – The file to output the REACH JSON output to. Defaults to reach_output.json in current working directory.
  • **kwargs (keyword arguments) – All other keyword arguments are passed directly to process_text.
Returns:

rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.

Return type:

ReachProcessor

indra.sources.reach.api.process_text(text, citation=None, offline=False, url=None, output_fname='reach_output.json', timeout=None)[source]

Return a ReachProcessor by processing the given text.

Parameters:
  • text (str) – The text to be processed.
  • citation (Optional[str]) – A PubMed ID passed to be used in the evidence for the extracted INDRA Statements. This is used when the text to be processed comes from a publication that is not otherwise identified. Default: None
  • offline (Optional[bool]) – If set to True, the REACH system is run offline via a JAR file. Otherwise (by default) the web service is called. Default: False
  • url (Optional[str]) – URL for a REACH web service instance, which is used for reading if provided. If not provided but offline is set to False (its default value), the Arizona REACH web service is called (http://agathon.sista.arizona.edu:8080/odinweb/api/help). Default: None
  • output_fname (Optional[str]) – The file to output the REACH JSON output to. Defaults to reach_output.json in current working directory.
  • timeout (Optional[float]) – This only applies when reading online (offline=False). Only wait for timeout seconds for the api to respond.
Returns:

rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.

Return type:

ReachProcessor

REACH Processor (indra.sources.reach.processor)

class indra.sources.reach.processor.ReachProcessor(json_dict, pmid=None)[source]

The ReachProcessor extracts INDRA Statements from REACH parser output.

Parameters:
  • json_dict (dict) – A JSON dictionary containing the REACH extractions.
  • pmid (Optional[str]) – The PubMed ID associated with the extractions. This can be passed in case the PMID cannot be determined from the extractions alone.`
tree

The objectpath Tree object representing the extractions.

Type:objectpath.Tree
statements

A list of INDRA Statements that were extracted by the processor.

Type:list[indra.statements.Statement]
citation

The PubMed ID associated with the extractions.

Type:str
all_events

The frame IDs of all events by type in the REACH extraction.

Type:dict[str, str]
get_activation()[source]

Extract INDRA Activation Statements.

get_all_events()[source]

Gather all event IDs in the REACH output by type.

These IDs are stored in the self.all_events dict.

get_complexes()[source]

Extract INDRA Complex Statements.

get_modifications()[source]

Extract Modification INDRA Statements.

get_regulate_amounts()[source]

Extract RegulateAmount INDRA Statements.

get_translocation()[source]

Extract INDRA Translocation Statements.

print_event_statistics()[source]

Print the number of events in the REACH output by type.

class indra.sources.reach.processor.Site(residue, position)
position

Alias for field number 1

residue

Alias for field number 0

indra.sources.reach.processor.determine_reach_subtype(event_name)[source]

Returns the category of reach rule from the reach rule instance.

Looks at a list of regular expressions corresponding to reach rule types, and returns the longest regexp that matches, or None if none of them match.

Parameters:evidence (indra.statements.Evidence) – A reach evidence object to subtype
Returns:best_match – A regular expression corresponding to the reach rule that was used to extract this evidence
Return type:str

REACH reader (indra.sources.reach.reader)

exception indra.sources.reach.reader.ReachOfflineReadingError[source]
class indra.sources.reach.reader.ReachReader[source]

The ReachReader wraps a singleton instance of the REACH reader.

This allows calling the reader many times without having to wait for it to start up each time.

api_ruler

An instance of the REACH ApiRuler class (java object).

Type:org.clulab.reach.apis.ApiRuler
get_api_ruler()[source]

Return the existing reader if it exists or launch a new one.

Returns:api_ruler – An instance of the REACH ApiRuler class (java object).
Return type:org.clulab.reach.apis.ApiRuler