REACH (indra.sources.reach
)¶
REACH is a biology-oriented machine reading system which uses a cascade of grammars to extract biological mechanisms from free text.
To cover a wide range of use cases and scenarios, there are currently 4 different ways in which INDRA can use REACH.
1. INDRA communicating with a locally running REACH Server (indra.sources.reach.api
)¶
Setup and usage: Follow standard instructions to install SBT. Then clone REACH and run the REACH web server.
git clone https://github.com/clulab/reach.git
cd reach
sbt "runMain org.clulab.reach.export.server.ApiServer"
Alternately, REACH can be run via docker with the following.
git clone https://github.com/clulab/reach.git
cd reach/docker
docker build --tag reach:latest .
docker run -d -it -p 8080:8080 reach:latest
Where -d
stands for ‘detach’ and runs the service in the background.
Then read text by specifying the url parameter when using indra.sources.reach.process_text.
from indra.sources import reach
rp = reach.process_text('MEK binds ERK', url=reach.local_text_url)
One limitation here is that the REACH sever is configured by default to limit the input to 2048 characters. To change this, edit the file export/src/main/resources/reference.conf in your local reach clone folder and add
http {
server {
// ...
parsing {
max-uri-length = 256k
}
// ...
}
}
to increase the character limit.
It is also possible to read NXML (string or file) and process the text of a
paper given its PMC ID or PubMed ID using other API methods in
indra.sources.reach.api
. Note that reach.local_nxml_url needs
to be used as url in case NXML content is being read.
Advantages:
Does not require setting up the pyjnius Python-Java bridge.
Does not require assembling a REACH JAR file.
Allows local control the REACH version and configuration used to run the service.
REACH is running in a separate process and therefore does not need to be initialized if a new Python session is started.
Disadvantages:
First request might be time-consuming as REACH is loading additional resources.
Only endpoints exposed by the REACH web server are available, i.e., no full object-level access to REACH components.
2. INDRA communicating with the UA REACH Server (indra.sources.reach.api
)¶
Setup and usage: Does not require any additional setup after installing INDRA.
Read text using the default values for offline and url parameters.
from indra.sources import reach
rp = reach.process_text('MEK binds ERK')
It is also possible to read NXML (string or file) and process the content of
a paper given its PMC ID or PubMed ID using other functions in
indra.sources.reach.api
.
Advantages:
Does not require setting up the pyjnius Python-Java bridge.
Does not require assembling a REACH JAR file or installing REACH at all locally.
Suitable for initial prototyping or integration testing.
Disadvantages:
Cannot handle high-throughput reading workflows due to limited server resources.
No control over which REACH version is used to run the service.
Difficulties processing NXML-formatted text (request times out) have been observed in the past.
3. INDRA using a REACH JAR through a Python-Java bridge (indra.sources.reach.reader
)¶
Setup and usage:
Follow standard instructions for installing SBT. First, the REACH system and its dependencies need to be packaged as a fat JAR:
git clone https://github.com/clulab/reach.git
cd reach
sbt assembly
This creates a JAR file in reach/target/scala[version]/reach-[version].jar. Set the absolute path to this file on the REACHPATH environmental variable and then append REACHPATH to the CLASSPATH environmental variable (entries are separated by colons).
The pyjnius package needs to be set up and be operational. For more details, see Pyjnius setup instructions in the documentation.
Then, reading can be done using the indra.sources.reach.process_text function with the offline option.
from indra.sources import reach
rp = reach.process_text('MEK binds ERK', offline=True)
Other functions in indra.sources.reach.api
can also be used
with the offline option to invoke local, JAR-based reading.
Advantages:
Doesn’t require running a separate process for REACH and INDRA.
Having a single REACH JAR file makes this solution easily portable.
Through jnius, all classes in REACH become available for programmatic access.
Disadvantages:
Requires configuring pyjnius which is often difficult (e.g., on Windows). Therefore this usage mode is generally not recommended.
The ReachReader instance needs to be instantiated every time a new INDRA session is started which is time consuming.
4. Use REACH separately to produce output files and then process those with INDRA¶
In this usage mode REACH is not directly invoked by INDRA. Rather, REACH is set up and run independently of INDRA to produce output files for a set of text content. For more information on running REACH on a set of text or NXML files, see the REACH documentation at: https://github.com/clulab/reach. Note that INDRA uses the fries output format produced by REACH.
Once REACH output has been obtained in the fries JSON format, one can
use indra.sources.reach.api.process_json_file
in INDRA to process each JSON file.
REACH API (indra.sources.reach.api
)¶
Methods for obtaining a reach processor containing indra statements.
Many file formats are supported. Many will run reach.
- indra.sources.reach.api.process_agents_from_entities(file_name, organism_priority=None)[source]¶
Return INDRA Agents extracted from all entites, eve ones not appearing in Statements.
- Parameters
file_name (str) – The name of the json file to be processed.
organism_priority (Optional[list of str]) – A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.
- Returns
A list of INDRA Agents processed from all extracted entities.
- Return type
- indra.sources.reach.api.process_fries_json_group(group_prefix, citation=None, organism_priority=None)[source]¶
Return a ReachProcessor by processing a REACH fries output file group.
When running REACH through its CLI, for each input file, it produces three output JSON files when using the fries output format. These three files jointly constitute the output, so they have to be combined to be processed. For instance, one might have PMC9582577.uaz.entities.json, PMC9582577.uaz.events.json, PMC9582577.uaz.sentence.json.
- Parameters
group_prefix (str) – The prefix for the group of output files, e.g., PMC9582577.uaz
citation (Optional[str]) – A PubMed ID passed to be used in the evidence for the extracted INDRA Statements. Default: None
organism_priority (Optional[list of str]) – A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.
- Returns
rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.
- Return type
- indra.sources.reach.api.process_json_file(file_name, citation=None, organism_priority=None)[source]¶
Return a ReachProcessor by processing the given REACH json file.
The output from the REACH parser is in this json format. This function is useful if the output is saved as a file and needs to be processed. For more information on the format, see: https://github.com/clulab/reach
- Parameters
file_name (str) – The name of the json file to be processed.
citation (Optional[str]) – A PubMed ID passed to be used in the evidence for the extracted INDRA Statements. Default: None
organism_priority (Optional[list of str]) – A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.
- Returns
rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.
- Return type
- indra.sources.reach.api.process_json_str(json_str, citation=None, organism_priority=None)[source]¶
Return a ReachProcessor by processing the given REACH json string.
The output from the REACH parser is in this json format. For more information on the format, see: https://github.com/clulab/reach
- Parameters
json_str (str) – The json string to be processed.
citation (Optional[str]) – A PubMed ID passed to be used in the evidence for the extracted INDRA Statements. Default: None
organism_priority (Optional[list of str]) – A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.
- Returns
rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.
- Return type
- indra.sources.reach.api.process_nxml_file(file_name, citation=None, offline=False, url=None, output_fname='reach_output.json', organism_priority=None)[source]¶
Return a ReachProcessor by processing the given NXML file.
NXML is the format used by PubmedCentral for papers in the open access subset.
- Parameters
file_name (str) – The name of the NXML file to be processed.
citation (Optional[str]) – A PubMed ID passed to be used in the evidence for the extracted INDRA Statements. Default: None
offline (Optional[bool]) – If set to True, the REACH system is run offline via a JAR file. Otherwise (by default) the web service is called. Default: False
url (Optional[str]) – URL for a REACH web service instance, which is used for reading if provided. If not provided but offline is set to False (its default value), the Arizona REACH web service is called (http://agathon.sista.arizona.edu:8080/odinweb/api/help). Default: None
output_fname (Optional[str]) – The file to output the REACH JSON output to. Defaults to reach_output.json in current working directory.
organism_priority (Optional[list of str]) – A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.
- Returns
rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.
- Return type
- indra.sources.reach.api.process_nxml_str(nxml_str, citation=None, offline=False, url=None, output_fname='reach_output.json', organism_priority=None)[source]¶
Return a ReachProcessor by processing the given NXML string.
NXML is the format used by PubmedCentral for papers in the open access subset.
- Parameters
nxml_str (str) – The NXML string to be processed.
citation (Optional[str]) – A PubMed ID passed to be used in the evidence for the extracted INDRA Statements. Default: None
offline (Optional[bool]) – If set to True, the REACH system is run offline via a JAR file. Otherwise (by default) the web service is called. Default: False
url (Optional[str]) – URL for a REACH web service instance, which is used for reading if provided. If not provided but offline is set to False (its default value), the Arizona REACH web service is called (http://agathon.sista.arizona.edu:8080/odinweb/api/help). Default: None
output_fname (Optional[str]) – The file to output the REACH JSON output to. Defaults to reach_output.json in current working directory.
organism_priority (Optional[list of str]) – A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.
- Returns
rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.
- Return type
- indra.sources.reach.api.process_pmc(pmc_id, offline=False, url=None, output_fname='reach_output.json', organism_priority=None)[source]¶
Return a ReachProcessor by processing a paper with a given PMC id.
Uses the PMC client to obtain the full text. If it’s not available, None is returned.
- Parameters
pmc_id (str) – The ID of a PubmedCentral article. The string may start with PMC but passing just the ID also works. Examples: 8511698, PMC8511698 https://www.ncbi.nlm.nih.gov/pmc/
offline (Optional[bool]) – If set to True, the REACH system is run offline via a JAR file. Otherwise (by default) the web service is called. Default: False
url (Optional[str]) – URL for a REACH web service instance, which is used for reading if provided. If not provided but offline is set to False (its default value), the Arizona REACH web service is called (http://agathon.sista.arizona.edu:8080/odinweb/api/help). Default: None
output_fname (Optional[str]) – The file to output the REACH JSON output to. Defaults to reach_output.json in current working directory.
organism_priority (Optional[list of str]) – A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.
- Returns
rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.
- Return type
- indra.sources.reach.api.process_pubmed_abstract(pubmed_id, offline=False, url=None, output_fname='reach_output.json', **kwargs)[source]¶
Return a ReachProcessor by processing an abstract with a given Pubmed id.
Uses the Pubmed client to get the abstract. If that fails, None is returned.
- Parameters
pubmed_id (str) – The ID of a Pubmed article. The string may start with PMID but passing just the ID also works. Examples: 27168024, PMID27168024 https://www.ncbi.nlm.nih.gov/pubmed/
offline (Optional[bool]) – If set to True, the REACH system is run offline via a JAR file. Otherwise (by default) the web service is called. Default: False
url (Optional[str]) – URL for a REACH web service instance, which is used for reading if provided. If not provided but offline is set to False (its default value), the Arizona REACH web service is called (http://agathon.sista.arizona.edu:8080/odinweb/api/help). Default: None
output_fname (Optional[str]) – The file to output the REACH JSON output to. Defaults to reach_output.json in current working directory.
organism_priority (Optional[list of str]) – A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.
**kwargs (keyword arguments) – All other keyword arguments are passed directly to process_text.
- Returns
rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.
- Return type
- indra.sources.reach.api.process_text(text, citation=None, offline=False, url=None, output_fname='reach_output.json', timeout=None, organism_priority=None)[source]¶
Return a ReachProcessor by processing the given text.
- Parameters
text (str) – The text to be processed.
citation (Optional[str]) – A PubMed ID passed to be used in the evidence for the extracted INDRA Statements. This is used when the text to be processed comes from a publication that is not otherwise identified. Default: None
offline (Optional[bool]) – If set to True, the REACH system is run offline via a JAR file. Otherwise (by default) the web service is called. Default: False
url (Optional[str]) – URL for a REACH web service instance, which is used for reading if provided. If not provided but offline is set to False (its default value), the Arizona REACH web service is called (http://agathon.sista.arizona.edu:8080/odinweb/api/help). Default: None
output_fname (Optional[str]) – The file to output the REACH JSON output to. Defaults to reach_output.json in current working directory.
timeout (Optional[float]) – This only applies when reading online (offline=False). Only wait for timeout seconds for the api to respond.
organism_priority (Optional[list of str]) – A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.
- Returns
rp – A ReachProcessor containing the extracted INDRA Statements in rp.statements.
- Return type
REACH Processor (indra.sources.reach.processor
)¶
- class indra.sources.reach.processor.ReachProcessor(json_dict, pmid=None, organism_priority=None)[source]¶
The ReachProcessor extracts INDRA Statements from REACH parser output.
- Parameters
- tree¶
The objectpath Tree object representing the extractions.
- Type
objectpath.Tree
- statements¶
A list of INDRA Statements that were extracted by the processor.
- Type
list[indra.statements.Statement]
- organism_priority¶
A list of Taxonomy IDs providing prioritization among organisms when choosing protein grounding. If not given, the default behavior takes the first match produced by Reach, which is prioritized to be a human protein if such a match exists.
- get_agents_from_entities()[source]¶
Return INDRA Agents extracted from all entities, even ones not part of events.
- class indra.sources.reach.processor.Site(residue, position)¶
- property position¶
Alias for field number 1
- property residue¶
Alias for field number 0
- indra.sources.reach.processor.determine_reach_subtype(event_name)[source]¶
Returns the category of reach rule from the reach rule instance.
Looks at a list of regular expressions corresponding to reach rule types, and returns the longest regexp that matches, or None if none of them match.
- Parameters
evidence (indra.statements.Evidence) – A reach evidence object to subtype
- Returns
best_match – A regular expression corresponding to the reach rule that was used to extract this evidence
- Return type
REACH reader (indra.sources.reach.reader
)¶
- class indra.sources.reach.reader.ReachReader[source]¶
The ReachReader wraps a singleton instance of the REACH reader.
This allows calling the reader many times without having to wait for it to start up each time.
- api_ruler¶
An instance of the REACH ApiRuler class (java object).
- Type
org.clulab.reach.apis.ApiRuler