Source code for indra.sources.indra_db_rest.api

"""
INDRA has been used to generate and maintain a database of causal
relations as INDRA Statements. The contents of the INDRA Database can be
accessed programmatically through this API.

The API includes three high-level query functions that cover many common use
cases:

  :func:`get_statements`:
      Get statements by agent information and Statement type,  e.g. "Statements
      with object MEK and type Inhibition" (This query function has a generic
      name to maintain backward compatibility.)

  :func:`get_statements_for_paper`:
      Get Statements based on the papers they are drawn from, for instance
      "Statements from the paper with PMID 12345".

  :func:`get_statements_by_hash`:
      Distinct INDRA Statements are associated with a unique numeric hash.
      This endpoint can be used to query the database for provenance

Queries with more complex constraints can be made using the query language
API in :py:module:`indra.sources.indra_db_rest.query` along with this function:

  :func:`get_statements_from_query`:
      This function works alongside the Query "language" to execute arbitrary
      requests for Statements based on statement metadata indexed in
      the Database.

There are also two functions relating to the submission and retrieval of
curations.  It is possible to enter feedback the correctness of text-mined
Statements, which we call "curations". :func:`submit_curations`
allows you to submit your curations, and :func:`get_curations` allows you to
retrieve existing curations (an API key is required).

Limits, timeouts and threading
------------------------------

Some queries may return a large number of statements, requiring the client to
assemble results from multiple successive requests to the REST API. The
behavior of the client can be controlled by several parameters to the query
functions.

For example, consider the query for Statements whose subject is TNF:

>>>
>> from indra.sources.indra_db_rest.api import get_statements
>> p = get_statements("TNF")
>> stmts = p.statements

Because there are many Statements associated with TNF, the client will make
multiple paged requests to get all the results. The maximum number of
Statements returned can be limited using the `limit` argument:

>>>
>> p = get_statements("TNF", limit=1000)
>> stmts = p.statements

For longer requests the client can work in a background thread after a timeout
is reached. This can be done by specifying a timeout (in seconds) using the
`timeout` argument. While the client continues retrieval, the first page
of the statement results is available in the `statements_sample` attribute:

>>>
>> p = get_statements("TNF", timeout=5)
>> some_stmts = p.statements_sample
>>
>> # ...Do some other work...
>>
>> # Wait for the requests to finish before getting the final result.
>> p.wait_until_done()
>> stmts = p.statements

Note that the timeout specifies how long the client should block for the
result, but that the result will continue to be retrieved until it is completed
on a background thread. If desired one can supply a timeout of 0 and get the
processor immediately, leaving the entire query to happen in the background.

You can check if the process is still running using the `is_working` method:

>>>
>> p = get_statements("TNF", timeout=0)
>> p.is_working()
True

If you don't want the client to make multiple paged requests and instead want
to get only the results from the first request, you can set "persist" to False
(the request job can still be put in the background with `timeout=0`).

>>>
>> p = get_statements("TNF", persist=False)
>> stmts = p.statements

For additional details on these and other parameters controlling statement
retrieval see the function documentation.


Using the Query Language
------------------------

There are several metadata and data values indexed in the INDRA Database
allowing for complex queries. Using the Query language these attributes can be
combined in arbitrary ways using logical operators. For example, you may want
to find Statements that MEK is inhibited found in papers related to breast
cancer and that also have more than 10 evidence:

>>>
>> from indra.sources.indra_db_rest.api import get_statements_from_query
>> from indra.sources.indra_db_rest.query import HasAgent, HasType, \\
>>     FromMeshIds, HasEvidenceBound
>>
>> query = (HasAgent("MEK", namespace="FPLX") & HasType(["Inhibition"])
>>          & FromMeshIds(["D001943"]) & HasEvidenceBound(["> 10"]))
>>
>> p = get_statements_from_query(query)
>> stmts = p.statements

In addition to joining constraints with "&" (an intersection, an "and") as shown
above, you can also form unions (a.k.a. "or"s) using "|":

>>>
>> query = (
>>     (
>>         HasAgent("MEK", namespace="FPLX")
>>         | HasAgent("MAP2K1", namespace="HGNC-SYMBOL")
>>     )
>>     & HasType(['Inhibition'])
>> )
>>
>> p = get_statements_from_query(query, limit=10)

For more details and examples of the Query architecture, see
:py:mod:`query <indra.sources.indra_db_rest.query>`.


Evidence Filtering
------------------

Queries can constrain results based on a property of the original evidence
text, so anything from the text references (like pmid) to the readers included
and whether the evidence is from a reading or a database, can all have an
effect on the evidences included in the result. By default, such queries filter
not only the statements but also their associated evidence, so that, for
example, if you query for Statements from a given paper, the evidences
returned with the Statements you queried are only from that paper.

>>>
>> p = get_statements_for_papers([('pmid', '20471474'),
>>                                ('pmcid', 'PMC3640704')])
>> all(ev.text_refs['PMID'] == '20471474'
>>     or ev.text_refs['PMCID'] == 'PMC3640704'
>>     for s in p.statements for ev in s.evidence)
True

You can deactivate this feature by setting `filter_ev` to False:

>>>
>> p = get_statements_for_papers([('pmid', '20471474'),
>>                                ('pmcid', 'PMC3640704')], filter_ev=False)
>> all(ev.text_refs['PMID'] == '20471474'
>>     or ev.text_refs['PMCID'] == 'PMC3640704'
>>     for s in p.statements for ev in s.evidence)
False


Curation Submission
-------------------

Suppose you run a query and get some Statements with some evidence; you look
through the results and find an evidence that does not really support the
Statement. Using the API it is possible to provide feedback by submitting a
curation.

>>>
>> from indra.statements import pretty_print_stmts
>> p = get_statements(agents=["TNF"], ev_limit=3, limit=1)
>> pretty_print_stmts(p.statements)
[LIST INDEX: 0] Activation(TNF(), apoptotic process())
================================================================================
EV INDEX: 0       These published reports in their aggregate support that TNFR2
SOURCE: reach     can lower the threshold of bioavailable TNFalpha needed to
PMID: 19774075    cause apoptosis through TNFR1 thus amplifying extrinsic cell
                  death pathways.
--------------------------------------------------------------------------------
EV INDEX: 1       Our results indicate that IE86 inhibits tumor necrosis factor
SOURCE: reach     (TNF)-alpha induced apoptosis and that the anti-apoptotic
PMID: 19502735    activity of this viral protein correlates with its expression
                  levels.
--------------------------------------------------------------------------------
EV INDEX: 2       This relationship between PUFAs and their anti-inflammatory
SOURCE: reach     metabolites and type 1 DM is supported by the observation that
PMID: 28824543    in a mfat-1 transgenic mouse model whose islets contained
                  increased levels of n-3 PUFAs and significantly lower amounts
                  of n-6 PUFAs compared to the wild type, were resistant to
                  apoptosis induced by TNF-alpha, IL-1beta, and gamma-IFN.
--------------------------------------------------------------------------------
>>
>> submit_curation(p.statements[0].get_hash(), "correct", "usr@bogusemail.com",
>>                 pa_json=p.statements[0].to_json(),
>>                 ev_json=p.statements[0].evidence[1].to_json())
{'ref': {'id': 11919}, 'result': 'success'}
"""


__all__ = ['get_statements', 'get_statements_for_papers',
           'get_statements_for_paper', 'get_statements_by_hash',
           'get_statements_from_query', 'submit_curation', 'get_curations']

from indra.util import clockit
from indra.statements import Complex, SelfModification,  ActiveForm, \
    Translocation, Conversion

from indra.sources.indra_db_rest.query import *
from indra.sources.indra_db_rest.processor import DBQueryStatementProcessor
from indra.sources.indra_db_rest.util import make_db_rest_request, get_url_base


[docs]@clockit def get_statements(subject=None, object=None, agents=None, stmt_type=None, use_exact_type=False, limit=None, persist=True, timeout=None, strict_stop=False, ev_limit=10, sort_by='ev_count', tries=3, use_obtained_counts=False, api_key=None): """Get Statements from the INDRA DB web API matching given agents and type. You get a :py:class:`DBQueryStatementProcessor <indra.sources.indra_db_rest.processor.DBQueryStatementProcessor>` object, which allow Statements to be loaded in a background thread, providing a sample of the "best" content available promptly in the ``sample_statements`` attribute, and populates the statements attribute when the paged load is complete. The "best" is determined by the ``sort_by`` attribute, which may be either 'belief' or 'ev_count' or None. Parameters ---------- subject/object : str Optionally specify the subject and/or object of the statements you wish to get from the database. By default, the namespace is assumed to be HGNC gene names, however you may specify another namespace by including "@<namespace>" at the end of the name string. For example, if you want to specify an agent by chebi, you could use "CHEBI:6801@CHEBI", or if you wanted to use the HGNC id, you could use "6871@HGNC". agents : list[str] A list of agents, specified in the same manner as subject and object, but without specifying their grammatical position. stmt_type : str Specify the types of interactions you are interested in, as indicated by the sub-classes of INDRA's Statements. This argument is *not* case sensitive. If the statement class given has sub-classes (e.g. RegulateAmount has IncreaseAmount and DecreaseAmount), then both the class itself, and its subclasses, will be queried, by default. If you do not want this behavior, set use_exact_type=True. Note that if max_stmts is set, it is possible only the exact statement type will be returned, as this is the first searched. The processor then cycles through the types, getting a page of results for each type and adding it to the quota, until the max number of statements is reached. use_exact_type : bool If stmt_type is given, and you only want to search for that specific statement type, set this to True. Default is False. limit : Optional[int] Select the maximum number of statements to return. When set less than 500 the effect is much the same as setting persist to false, and will guarantee a faster response. Default is None. persist : bool Default is True. When False, if a query comes back limited (not all results returned), just give up and pass along what was returned. Otherwise, make further queries to get the rest of the data (which may take some time). timeout : positive int or None If an int, block until the work is done and statements are retrieved, or until the timeout has expired, in which case the results so far will be returned in the response object, and further results will be added in a separate thread as they become available. Block indefinitely until all statements are retrieved. Default is None. strict_stop : bool If True, the query will only be given `timeout` time to complete before being abandoned entirely. Otherwise the timeout will simply wait for the thread to join for `timeout` seconds before returning, allowing other work to continue while the query runs in the background. The default is False. ev_limit : Optional[int] Limit the amount of evidence returned per Statement. Default is 10. sort_by : Optional[str] Str options are currently 'ev_count' or 'belief'. Results will return in order of the given parameter. If None, results will be turned in an arbitrary order. tries : Optional[int] Set the number of times to try the query. The database often caches results, so if a query times out the first time, trying again after a timeout will often succeed fast enough to avoid a timeout. This can also help gracefully handle an unreliable connection, if you're willing to wait. Default is 3. use_obtained_counts : Optional[bool] If True, evidence counts and source counts are reported based on the actual evidences returned for each statement in this query (as opposed to all existing evidences, even if not all were returned). Default: False api_key : Optional[str] Override or use in place of the API key given in the INDRA config file. Returns ------- processor : :py:class:`DBQueryStatementProcessor` An instance of the DBQueryStatementProcessor, which has an attribute ``statements`` which will be populated when the query/queries are done. """ query = EmptyQuery() def add_agent(ag_str, role): if ag_str is None: return nonlocal query if '@' in ag_str: ag_id, ag_ns = ag_str.split('@') else: ag_id = ag_str ag_ns = 'NAME' query &= HasAgent(ag_id, ag_ns, role=role) add_agent(subject, 'subject') add_agent(object, 'object') if agents is not None: for ag in agents: add_agent(ag, None) if stmt_type is not None: query &= HasType([stmt_type], include_subclasses=not use_exact_type) if isinstance(query, EmptyQuery): raise ValueError("No constraints provided.") return DBQueryStatementProcessor(query, limit=limit, persist=persist, ev_limit=ev_limit, timeout=timeout, sort_by=sort_by, tries=tries, strict_stop=strict_stop, use_obtained_counts=use_obtained_counts, api_key=api_key)
[docs]@clockit def get_statements_by_hash(hash_list, limit=None, ev_limit=10, sort_by='ev_count', persist=True, timeout=None, strict_stop=False, tries=3, api_key=None): """Get Statements from a list of hashes. Parameters ---------- hash_list : list[int or str] A list of statement hashes. limit : Optional[int] Select the maximum number of statements to return. When set less than 500 the effect is much the same as setting persist to false, and will guarantee a faster response. Default is None. ev_limit : Optional[int] Limit the amount of evidence returned per Statement. Default is 10. sort_by : Optional[str] Options are currently 'ev_count' or 'belief'. Results will return in order of the given parameter. If None, results will be turned in an arbitrary order. persist : bool Default is True. When False, if a query comes back limited (not all results returned), just give up and pass along what was returned. Otherwise, make further queries to get the rest of the data (which may take some time). timeout : positive int or None If an int, return after `timeout` seconds, even if query is not done. Default is None. strict_stop : bool If True, the query will only be given `timeout` time to complete before being abandoned entirely. Otherwise the timeout will simply wait for the thread to join for `timeout` seconds before returning, allowing other work to continue while the query runs in the background. The default is False. tries : int > 0 Set the number of times to try the query. The database often caches results, so if a query times out the first time, trying again after a timeout will often succeed fast enough to avoid a timeout. This can also help gracefully handle an unreliable connection, if you're willing to wait. Default is 3. api_key : Optional[str] Override or use in place of the API key given in the INDRA config file. Returns ------- processor : :py:class:`DBQueryStatementProcessor` An instance of the DBQueryStatementProcessor, which has an attribute `statements` which will be populated when the query/queries are done. """ return DBQueryStatementProcessor(HasHash(hash_list), limit=limit, ev_limit=ev_limit, sort_by=sort_by, persist=persist, timeout=timeout, tries=tries, strict_stop=strict_stop, api_key=api_key)
def get_statements_for_paper(*args, **kwargs): from warnings import warn warn("`get_statements_for_paper` has been replaced with " "`get_statements_for_papers`.", DeprecationWarning) return get_statements_for_papers(*args, **kwargs)
[docs]@clockit def get_statements_for_papers(ids, limit=None, ev_limit=10, sort_by='ev_count', persist=True, timeout=None, strict_stop=False, tries=3, filter_ev=True, api_key=None): """Get Statements extracted from the papers with the given ref ids. Parameters ---------- ids : list[str, str] A list of tuples with ids and their type. For example: ``[('pmid', '12345'), ('pmcid', 'PMC12345')]`` The type can be any one of 'pmid', 'pmcid', 'doi', 'pii', 'manuscript_id', or 'trid', which is the primary key id of the text references in the database. limit : Optional[int] Select the maximum number of statements to return. When set less than 500 the effect is much the same as setting persist to false, and will guarantee a faster response. Default is None. ev_limit : Optional[int] Limit the amount of evidence returned per Statement. Default is 10. filter_ev : bool Indicate whether evidence should have the same filters applied as the statements themselves, where appropriate (e.g. in the case of a filter by paper). sort_by : Optional[str] Options are currently 'ev_count' or 'belief'. Results will return in order of the given parameter. If None, results will be turned in an arbitrary order. persist : bool Default is True. When False, if a query comes back limited (not all results returned), just give up and pass along what was returned. Otherwise, make further queries to get the rest of the data (which may take some time). timeout : positive int or None If an int, return after `timeout` seconds, even if query is not done. Default is None. strict_stop : bool If True, the query will only be given `timeout` time to complete before being abandoned entirely. Otherwise the timeout will simply wait for the thread to join for `timeout` seconds before returning, allowing other work to continue while the query runs in the background. The default is False. tries : int > 0 Set the number of times to try the query. The database often caches results, so if a query times out the first time, trying again after a timeout will often succeed fast enough to avoid a timeout. This can also help gracefully handle an unreliable connection, if you're willing to wait. Default is 3. api_key : Optional[str] Override or use in place of the API key given in the INDRA config file. Returns ------- processor : :py:class:`DBQueryStatementProcessor` An instance of the DBQueryStatementProcessor, which has an attribute `statements` which will be populated when the query/queries are done. """ return DBQueryStatementProcessor(FromPapers(ids), limit=limit, ev_limit=ev_limit, sort_by=sort_by, persist=persist, timeout=timeout, tries=tries, filter_ev=filter_ev, strict_stop=strict_stop, api_key=api_key)
[docs]@clockit def get_statements_from_query(query, limit=None, ev_limit=10, sort_by='ev_count', persist=True, timeout=None, strict_stop=False, tries=3, filter_ev=True, use_obtained_counts=False, api_key=None): """Get Statements using a Query. Example ------- >>> >> from indra.sources.indra_db_rest.query import HasAgent, FromMeshIds >> query = HasAgent("MEK", "FPLX") & FromMeshIds(["D001943"]) >> p = get_statements_from_query(query, limit=100) >> stmts = p.statements Parameters ---------- query : :py:class:`Query` The query to be evaluated in return for statements. limit : Optional[int] Select the maximum number of statements to return. When set less than 500 the effect is much the same as setting persist to false, and will guarantee a faster response. Default is None. ev_limit : Optional[int] Limit the amount of evidence returned per Statement. Default is 10. filter_ev : bool Indicate whether evidence should have the same filters applied as the statements themselves, where appropriate (e.g. in the case of a filter by paper). sort_by : Optional[str] Options are currently 'ev_count' or 'belief'. Results will return in order of the given parameter. If None, results will be turned in an arbitrary order. persist : bool Default is True. When False, if a query comes back limited (not all results returned), just give up and pass along what was returned. Otherwise, make further queries to get the rest of the data (which may take some time). timeout : positive int or None If an int, return after ``timeout`` seconds, even if query is not done. Default is None. strict_stop : bool If True, the query will only be given `timeout` time to complete before being abandoned entirely. Otherwise the timeout will simply wait for the thread to join for `timeout` seconds before returning, allowing other work to continue while the query runs in the background. The default is False. use_obtained_counts : Optional[bool] If True, evidence counts and source counts are reported based on the actual evidences returned for each statement in this query (as opposed to all existing evidences, even if not all were returned). Default: False tries : Optional[int] Set the number of times to try the query. The database often caches results, so if a query times out the first time, trying again after a timeout will often succeed fast enough to avoid a timeout. This can also help gracefully handle an unreliable connection, if you're willing to wait. Default is 3. api_key : Optional[str] Override or use in place of the API key given in the INDRA config file. Returns ------- processor : :py:class:`DBQueryStatementProcessor` An instance of the DBQueryStatementProcessor, which has an attribute `statements` which will be populated when the query/queries are done. """ return DBQueryStatementProcessor(query, limit=limit, ev_limit=ev_limit, sort_by=sort_by, persist=persist, timeout=timeout, tries=tries, filter_ev=filter_ev, strict_stop=strict_stop, use_obtained_counts=use_obtained_counts, api_key=api_key)
[docs]def submit_curation(hash_val, tag, curator_email, text=None, source='indra_rest_client', ev_hash=None, pa_json=None, ev_json=None, api_key=None, is_test=False): """Submit a curation for the given statement at the relevant level. Parameters ---------- hash_val : int The hash corresponding to the statement. tag : str A very short phrase categorizing the error or type of curation, e.g. "grounding" for a grounding error, or "correct" if you are marking a statement as correct. curator_email : str The email of the curator. text : str A brief description of the problem. source : str The name of the access point through which the curation was performed. The default is 'direct_client', meaning this function was used directly. Any higher-level application should identify itself here. ev_hash : int A hash of the sentence and other evidence information. Elsewhere referred to as `source_hash`. pa_json : None or dict The JSON of a statement you wish to curate. If not given, it may be inferred (best effort) from the given hash. ev_json : None or dict The JSON of an evidence you wish to curate. If not given, it cannot be inferred. api_key : Optional[str] Override or use in place of the API key given in the INDRA config file. is_test : bool Used in testing. If True, no curation will actually be added to the database. """ data = {'tag': tag, 'text': text, 'email': curator_email, 'source': source, 'ev_hash': ev_hash, 'pa_json': pa_json, 'ev_json': ev_json} url = 'curation/submit/%s' % hash_val if is_test: qstr = '?test' else: qstr = '' resp = make_db_rest_request('post', url, qstr, data=data, api_key=api_key) return resp.json()
[docs]def get_curations(hash_val=None, source_hash=None, api_key=None): """Get the curations for a specific statement and evidence. If neither hash_val nor source_hash are given, all curations will be retrieved. This will require the user to have extra permissions, as determined by their API key. Parameters ---------- hash_val : Optional[int] The hash of a statement whose curations you want to retrieve. source_hash : Optional[int] The hash generated for a piece of evidence for which you want curations. The `hash_val` must be provided to use the `source_hash`. api_key : Optional[str] Override or use in place of the API key given in the INDRA config file. Returns ------- curations : list A list of dictionaries containing the curation data. """ url = 'curation/list' if hash_val is not None: url += f'/{hash_val}' if source_hash is not None: url += f'/{source_hash}' elif source_hash is not None: raise ValueError("A hash_val must be given if a source_hash is given.") resp = make_db_rest_request('get', url, api_key=api_key) return resp.json()
def get_statement_queries(stmts, fallback_ns='NAME', pick_ns_fun=None, **params): """Get queries used to search based on a statement. In addition to the stmts, you can enter any parameters standard to the query. See https://github.com/indralab/indra_db/rest_api for a full list. Parameters ---------- stmts : list[Statement] A list of INDRA statements. fallback_ns : Optional[str] The name space to search by when an Agent in a Statement is not grounded to one of the standardized name spaces. Typically, searching by 'NAME' (i.e., the Agent's name) is a good option if (1) An Agent's grounding is missing but its name is known to be standard in one of the name spaces. In this case the name-based lookup will yield the same result as looking up by grounding. Example: MAP2K1(db_refs={}) (2) Any Agent that is encountered with the same name as this Agent is never standardized, so looking up by name yields the same result as looking up by TEXT. Example: xyz(db_refs={'TEXT': 'xyz'}) Searching by TEXT is better in other cases e.g., when the given specific Agent is not grounded but we have other Agents with the same TEXT that are grounded and then standardized to a different name. Example: Erk(db_refs={'TEXT': 'Erk'}). Default: 'NAME' pick_ns_fun : Optional[function] An optional user-supplied function which takes an Agent as input and returns a string of the form value@ns where 'value' will be looked up in namespace 'ns' to search for the given Agent. **params : kwargs A set of keyword arguments that are added as parameters to the query URLs. """ def pick_ns(ag): # If the Agent has grounding, in order of preference, in any of these # name spaces then we look it up based on grounding. for ns in ['FPLX', 'HGNC', 'UP', 'CHEBI', 'GO', 'MESH']: if ns in ag.db_refs: dbid = ag.db_refs[ns] return '%s@%s' % (dbid, ns) # Otherwise we fall back on searching by NAME or TEXT # (or any other given name space as long as the Agent name can be # usefully looked up in that name space). return '%s@%s' % (ag.name, fallback_ns) pick_ns_fun = pick_ns if not pick_ns_fun else pick_ns_fun queries = [] url_base = get_url_base('statements/from_agents') non_binary_statements = (Complex, SelfModification, ActiveForm, Translocation, Conversion) for stmt in stmts: kwargs = {} if not isinstance(stmt, non_binary_statements): for pos, ag in zip(['subject', 'object'], stmt.agent_list()): if ag is not None: kwargs[pos] = pick_ns_fun(ag) else: for i, ag in enumerate(stmt.agent_list()): if ag is not None: kwargs['agent%d' % i] = pick_ns_fun(ag) kwargs['type'] = stmt.__class__.__name__ kwargs.update(params) query_str = '?' + '&'.join(['%s=%s' % (k, v) for k, v in kwargs.items() if v is not None]) queries.append(url_base + query_str) return queries