REACH (indra.sources.reach)

REACH is a biology-oriented machine reading system which uses a cascade of grammars to extract biological mechanisms from free text.

To cover a wide range of use cases and scenarios, there are currently 4 different ways in which INDRA can use REACH.

1. INDRA communicating with a locally running REACH Server (indra.sources.reach.api)

Setup and usage: Follow standard instructions to install SBT. Then clone REACH and run the REACH web server.

git clone https://github.com/clulab/reach.git
cd reach
sbt "runMain org.clulab.reach.export.server.ApiServer"

Alternately, REACH can be run via docker with the following.

git clone https://github.com/clulab/reach.git
cd reach/docker
docker build --tag reach:latest .
docker run -d -it -p 8080:8080 reach:latest

Where -d stands for ‘detach’ and runs the service in the background.

Then read text by specifying the url parameter when using indra.sources.reach.process_text.

from indra.sources import reach
rp = reach.process_text('MEK binds ERK', url=reach.local_text_url)

One limitation here is that the REACH sever is configured by default to limit the input to 2048 characters. To change this, edit the file export/src/main/resources/reference.conf in your local reach clone folder and add

http {
  server {
  // ...
  parsing {
    max-uri-length = 256k
  }
  // ...
  }
}

to increase the character limit.

It is also possible to read NXML (string or file) and process the text of a paper given its PMC ID or PubMed ID using other API methods in indra.sources.reach.api. Note that reach.local_nxml_url needs to be used as url in case NXML content is being read.

Advantages:

  • Does not require setting up the pyjnius Python-Java bridge.

  • Does not require assembling a REACH JAR file.

  • Allows local control the REACH version and configuration used to run the service.

  • REACH is running in a separate process and therefore does not need to be initialized if a new Python session is started.

Disadvantages:

  • First request might be time-consuming as REACH is loading additional resources.

  • Only endpoints exposed by the REACH web server are available, i.e., no full object-level access to REACH components.

2. INDRA communicating with the UA REACH Server (indra.sources.reach.api)

Setup and usage: Does not require any additional setup after installing INDRA.

Read text using the default values for offline and url parameters.

from indra.sources import reach
rp = reach.process_text('MEK binds ERK')

It is also possible to read NXML (string or file) and process the content of a paper given its PMC ID or PubMed ID using other functions in indra.sources.reach.api.

Advantages:

  • Does not require setting up the pyjnius Python-Java bridge.

  • Does not require assembling a REACH JAR file or installing REACH at all locally.

  • Suitable for initial prototyping or integration testing.

Disadvantages:

  • Cannot handle high-throughput reading workflows due to limited server resources.

  • No control over which REACH version is used to run the service.

  • Difficulties processing NXML-formatted text (request times out) have been observed in the past.

3. INDRA using a REACH JAR through a Python-Java bridge (indra.sources.reach.reader)

Setup and usage:

Follow standard instructions for installing SBT. First, the REACH system and its dependencies need to be packaged as a fat JAR:

git clone https://github.com/clulab/reach.git
cd reach
sbt assembly

This creates a JAR file in reach/target/scala[version]/reach-[version].jar. Set the absolute path to this file on the REACHPATH environmental variable and then append REACHPATH to the CLASSPATH environmental variable (entries are separated by colons).

The pyjnius package needs to be set up and be operational. For more details, see Pyjnius setup instructions in the documentation.

Then, reading can be done using the indra.sources.reach.process_text function with the offline option.

from indra.sources import reach
rp = reach.process_text('MEK binds ERK', offline=True)

Other functions in indra.sources.reach.api can also be used with the offline option to invoke local, JAR-based reading.

Advantages:

  • Doesn’t require running a separate process for REACH and INDRA.

  • Having a single REACH JAR file makes this solution easily portable.

  • Through jnius, all classes in REACH become available for programmatic access.

Disadvantages:

  • Requires configuring pyjnius which is often difficult (e.g., on Windows). Therefore this usage mode is generally not recommended.

  • The ReachReader instance needs to be instantiated every time a new INDRA session is started which is time consuming.

4. Use REACH separately to produce output files and then process those with INDRA

In this usage mode REACH is not directly invoked by INDRA. Rather, REACH is set up and run independently of INDRA to produce output files for a set of text content. For more information on running REACH on a set of text or NXML files, see the REACH documentation at: https://github.com/clulab/reach. Note that INDRA uses the fries output format produced by REACH.

Once REACH output has been obtained in the fries JSON format, one can use indra.sources.reach.api.process_json_file in INDRA to process each JSON file.

REACH API (indra.sources.reach.api)

Methods for obtaining a reach processor containing indra statements.

Many file formats are supported. Many will run reach.

indra.sources.reach.api.process_agents_from_entities(file_name, organism_priority=None, with_coordinates=False)[source]

Return INDRA Agents extracted from all entites, eve ones not appearing in Statements.

Parameters
  • file_name (str) – The name of the json file to be processed.

  • organism_priority (Optional[list of str]) – A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.

  • with_coordinates (Optional[bool]) – If True, the Agents will be returned in a tuple with their coordinates. Default: False

Returns

A list of INDRA Agents processed from all extracted entities.

Return type

list[Agent]

indra.sources.reach.api.process_fries_json_group(group_prefix, citation=None, organism_priority=None)[source]

Return a ReachProcessor by processing a REACH fries output file group.

When running REACH through its CLI, for each input file, it produces three output JSON files when using the fries output format. These three files jointly constitute the output, so they have to be combined to be processed. For instance, one might have PMC9582577.uaz.entities.json, PMC9582577.uaz.events.json, PMC9582577.uaz.sentence.json.

Parameters
  • group_prefix (str) – The prefix for the group of output files, e.g., PMC9582577.uaz

  • citation (Optional[str]) – A PubMed ID passed to be used in the evidence for the extracted INDRA Statements. Default: None

  • organism_priority (Optional[list of str]) – A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.

Returns

rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.

Return type

ReachProcessor

indra.sources.reach.api.process_json_file(file_name, citation=None, organism_priority=None)[source]

Return a ReachProcessor by processing the given REACH json file.

The output from the REACH parser is in this json format. This function is useful if the output is saved as a file and needs to be processed. For more information on the format, see: https://github.com/clulab/reach

Parameters
  • file_name (str) – The name of the json file to be processed.

  • citation (Optional[str]) – A PubMed ID passed to be used in the evidence for the extracted INDRA Statements. Default: None

  • organism_priority (Optional[list of str]) – A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.

Returns

rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.

Return type

ReachProcessor

indra.sources.reach.api.process_json_str(json_str, citation=None, organism_priority=None)[source]

Return a ReachProcessor by processing the given REACH json string.

The output from the REACH parser is in this json format. For more information on the format, see: https://github.com/clulab/reach

Parameters
  • json_str (str) – The json string to be processed.

  • citation (Optional[str]) – A PubMed ID passed to be used in the evidence for the extracted INDRA Statements. Default: None

  • organism_priority (Optional[list of str]) – A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.

Returns

rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.

Return type

ReachProcessor

indra.sources.reach.api.process_nxml_file(file_name, citation=None, offline=False, url=None, output_fname='reach_output.json', organism_priority=None)[source]

Return a ReachProcessor by processing the given NXML file.

NXML is the format used by PubmedCentral for papers in the open access subset.

Parameters
  • file_name (str) – The name of the NXML file to be processed.

  • citation (Optional[str]) – A PubMed ID passed to be used in the evidence for the extracted INDRA Statements. Default: None

  • offline (Optional[bool]) – If set to True, the REACH system is run offline via a JAR file. Otherwise (by default) the web service is called. Default: False

  • url (Optional[str]) – URL for a REACH web service instance, which is used for reading if provided. If not provided but offline is set to False (its default value), the Arizona REACH web service is called (http://agathon.sista.arizona.edu:8080/odinweb/api/help). Default: None

  • output_fname (Optional[str]) – The file to output the REACH JSON output to. Defaults to reach_output.json in current working directory.

  • organism_priority (Optional[list of str]) – A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.

Returns

rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.

Return type

ReachProcessor

indra.sources.reach.api.process_nxml_str(nxml_str, citation=None, offline=False, url=None, output_fname='reach_output.json', organism_priority=None)[source]

Return a ReachProcessor by processing the given NXML string.

NXML is the format used by PubmedCentral for papers in the open access subset.

Parameters
  • nxml_str (str) – The NXML string to be processed.

  • citation (Optional[str]) – A PubMed ID passed to be used in the evidence for the extracted INDRA Statements. Default: None

  • offline (Optional[bool]) – If set to True, the REACH system is run offline via a JAR file. Otherwise (by default) the web service is called. Default: False

  • url (Optional[str]) – URL for a REACH web service instance, which is used for reading if provided. If not provided but offline is set to False (its default value), the Arizona REACH web service is called (http://agathon.sista.arizona.edu:8080/odinweb/api/help). Default: None

  • output_fname (Optional[str]) – The file to output the REACH JSON output to. Defaults to reach_output.json in current working directory.

  • organism_priority (Optional[list of str]) – A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.

Returns

rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.

Return type

ReachProcessor

indra.sources.reach.api.process_pmc(pmc_id, offline=False, url=None, output_fname='reach_output.json', organism_priority=None)[source]

Return a ReachProcessor by processing a paper with a given PMC id.

Uses the PMC client to obtain the full text. If it’s not available, None is returned.

Parameters
  • pmc_id (str) – The ID of a PubmedCentral article. The string may start with PMC but passing just the ID also works. Examples: 8511698, PMC8511698 https://www.ncbi.nlm.nih.gov/pmc/

  • offline (Optional[bool]) – If set to True, the REACH system is run offline via a JAR file. Otherwise (by default) the web service is called. Default: False

  • url (Optional[str]) – URL for a REACH web service instance, which is used for reading if provided. If not provided but offline is set to False (its default value), the Arizona REACH web service is called (http://agathon.sista.arizona.edu:8080/odinweb/api/help). Default: None

  • output_fname (Optional[str]) – The file to output the REACH JSON output to. Defaults to reach_output.json in current working directory.

  • organism_priority (Optional[list of str]) – A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.

Returns

rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.

Return type

ReachProcessor

indra.sources.reach.api.process_pubmed_abstract(pubmed_id, offline=False, url=None, output_fname='reach_output.json', **kwargs)[source]

Return a ReachProcessor by processing an abstract with a given Pubmed id.

Uses the Pubmed client to get the abstract. If that fails, None is returned.

Parameters
  • pubmed_id (str) – The ID of a Pubmed article. The string may start with PMID but passing just the ID also works. Examples: 27168024, PMID27168024 https://www.ncbi.nlm.nih.gov/pubmed/

  • offline (Optional[bool]) – If set to True, the REACH system is run offline via a JAR file. Otherwise (by default) the web service is called. Default: False

  • url (Optional[str]) – URL for a REACH web service instance, which is used for reading if provided. If not provided but offline is set to False (its default value), the Arizona REACH web service is called (http://agathon.sista.arizona.edu:8080/odinweb/api/help). Default: None

  • output_fname (Optional[str]) – The file to output the REACH JSON output to. Defaults to reach_output.json in current working directory.

  • organism_priority (Optional[list of str]) – A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.

  • **kwargs (keyword arguments) – All other keyword arguments are passed directly to process_text.

Returns

rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.

Return type

ReachProcessor

indra.sources.reach.api.process_text(text, citation=None, offline=False, url=None, output_fname='reach_output.json', timeout=None, organism_priority=None)[source]

Return a ReachProcessor by processing the given text.

Parameters
  • text (str) – The text to be processed.

  • citation (Optional[str]) – A PubMed ID passed to be used in the evidence for the extracted INDRA Statements. This is used when the text to be processed comes from a publication that is not otherwise identified. Default: None

  • offline (Optional[bool]) – If set to True, the REACH system is run offline via a JAR file. Otherwise (by default) the web service is called. Default: False

  • url (Optional[str]) – URL for a REACH web service instance, which is used for reading if provided. If not provided but offline is set to False (its default value), the Arizona REACH web service is called (http://agathon.sista.arizona.edu:8080/odinweb/api/help). Default: None

  • output_fname (Optional[str]) – The file to output the REACH JSON output to. Defaults to reach_output.json in current working directory.

  • timeout (Optional[float]) – This only applies when reading online (offline=False). Only wait for timeout seconds for the api to respond.

  • organism_priority (Optional[list of str]) – A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.

Returns

rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.

Return type

ReachProcessor

REACH Processor (indra.sources.reach.processor)

class indra.sources.reach.processor.ReachProcessor(json_dict, pmid=None, organism_priority=None)[source]

The ReachProcessor extracts INDRA Statements from REACH parser output.

Parameters
  • json_dict (dict) – A JSON dictionary containing the REACH extractions.

  • pmid (Optional[str]) – The PubMed ID associated with the extractions. This can be passed in case the PMID cannot be determined from the extractions alone.`

tree

The objectpath Tree object representing the extractions.

Type

objectpath.Tree

statements

A list of INDRA Statements that were extracted by the processor.

Type

list[indra.statements.Statement]

citation

The PubMed ID associated with the extractions.

Type

str

all_events

The frame IDs of all events by type in the REACH extraction.

Type

dict[str, str]

organism_priority

A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.

Type

list[str]

get_activation()[source]

Extract INDRA Activation Statements.

get_agents_from_entities()[source]

Return INDRA Agents extracted from all entities, even ones not part of events.

get_agents_from_entities_with_coords(sentence_coords=False)[source]

Return INDRA Agents extracted from all entities, along with global document-level coordinates, even ones not part of events.

get_all_entities()[source]

Return all entities extracted, even ones not part of events.

get_all_events()[source]

Gather all event IDs in the REACH output by type.

These IDs are stored in the self.all_events dict.

get_complexes()[source]

Extract INDRA Complex Statements.

get_modifications()[source]

Extract Modification INDRA Statements.

get_regulate_amounts()[source]

Extract RegulateAmount INDRA Statements.

get_translocation()[source]

Extract INDRA Translocation Statements.

print_event_statistics()[source]

Print the number of events in the REACH output by type.

class indra.sources.reach.processor.Site(residue, position)
property position

Alias for field number 1

property residue

Alias for field number 0

indra.sources.reach.processor.determine_reach_subtype(event_name)[source]

Returns the category of reach rule from the reach rule instance.

Looks at a list of regular expressions corresponding to reach rule types, and returns the longest regexp that matches, or None if none of them match.

Parameters

evidence (indra.statements.Evidence) – A reach evidence object to subtype

Returns

best_match – A regular expression corresponding to the reach rule that was used to extract this evidence

Return type

str

indra.sources.reach.processor.prioritize_organism_grounding(first_id, xrefs, organism_priority)[source]

Pick a prioritized organism-specific UniProt ID for a protein.

REACH reader (indra.sources.reach.reader)

exception indra.sources.reach.reader.ReachOfflineReadingError[source]
class indra.sources.reach.reader.ReachReader[source]

The ReachReader wraps a singleton instance of the REACH reader.

This allows calling the reader many times without having to wait for it to start up each time.

api_ruler

An instance of the REACH ApiRuler class (java object).

Type

org.clulab.reach.apis.ApiRuler

get_api_ruler()[source]

Return the existing reader if it exists or launch a new one.

Returns

api_ruler – An instance of the REACH ApiRuler class (java object).

Return type

org.clulab.reach.apis.ApiRuler