Entity grounding mapping and standardization (indra.preassembler.grounding_mapper)

Grounding mapping

class indra.preassembler.grounding_mapper.mapper.GroundingMapper(grounding_map=None, agent_map=None, ignores=None, misgrounding_map=None, use_adeft=True, gilda_mode=None)[source]

Maps grounding of INDRA Agents based on a given grounding map.

Each parameter, if not provided will result in loading the corresponding built-in grounding resource. To explicitly avoid loading the default, pass in an empty data structure as the given parameter, e.g., ignores=[].

Parameters
  • grounding_map (Optional[dict]) – The grounding map, a dictionary mapping strings (entity names) to a dictionary of database identifiers.

  • agent_map (Optional[dict]) – A dictionary mapping strings to grounded INDRA Agents with given state.

  • ignores (Optional[list]) – A list of entity strings that, if encountered will result in the corresponding Statement being discarded.

  • misgrounding_map (Optional[dict]) – A mapping dict similar to the grounding map which maps entity strings to a given grounding which is known to be incorrect and should be removed if encountered (making the remaining Agent ungrounded).

  • use_adeft (Optional[bool]) – If True, Adeft will be attempted to be used for disambiguation of acronyms. Default: True

  • gilda_mode (Optional[str]) – If None, Gilda will not be used at all. If ‘web’, the GILDA_URL setting from the config file or as an environmental variable is assumed to be the web service endpoint through which Gilda is used. If ‘local’, we assume that the gilda Python package is installed and will be used.

static check_grounding_map(gm)[source]

Run sanity checks on the grounding map, raise error if needed.

map_agent(agent, do_rename)[source]

Return the given Agent with its grounding mapped.

This function grounds a single agent. It returns the new Agent object (which might be a different object if we load a new agent state from json) or the same object otherwise.

Parameters
  • agent (indra.statements.Agent) – The Agent to map.

  • do_rename (bool) – If True, the Agent name is updated based on the mapped grounding. If do_rename is True the priority for setting the name is FamPlex ID, HGNC symbol, then the gene name from Uniprot.

Returns

grounded_agent – The grounded Agent.

Return type

indra.statements.Agent

map_agents_for_stmt(stmt, do_rename=True)[source]

Return a new Statement whose agents have been grounding mapped.

Parameters
  • stmt (indra.statements.Statement) – The Statement whose agents need mapping.

  • do_rename (Optional[bool]) – If True, the Agent name is updated based on the mapped grounding. If do_rename is True the priority for setting the name is FamPlex ID, HGNC symbol, then the gene name from Uniprot. Default: True

Returns

mapped_stmt – The mapped Statement.

Return type

indra.statements.Statement

map_stmts(stmts, do_rename=True)[source]

Return a new list of statements whose agents have been mapped

Parameters
  • stmts (list of indra.statements.Statement) – The statements whose agents need mapping

  • do_rename (Optional[bool]) – If True, the Agent name is updated based on the mapped grounding. If do_rename is True the priority for setting the name is FamPlex ID, HGNC symbol, then the gene name from Uniprot. Default: True

Returns

mapped_stmts – A list of statements given by mapping the agents from each statement in the input list

Return type

list of indra.statements.Statement

static rename_agents(stmts)[source]

Return a list of mapped statements with updated agent names.

Creates a new list of statements without modifying the original list.

Parameters

stmts (list of indra.statements.Statement) – List of statements whose Agents need their names updated.

Returns

mapped_stmts – A new list of Statements with updated Agent names

Return type

list of indra.statements.Statement

static standardize_agent_name(agent, standardize_refs=True)[source]

Standardize the name of an Agent based on grounding information.

If an agent contains a FamPlex grounding, the FamPlex ID is used as a name. Otherwise if it contains a Uniprot ID, an attempt is made to find the associated HGNC gene name. If one can be found it is used as the agent name and the associated HGNC ID is added as an entry to the db_refs. Similarly, CHEBI, MESH and GO IDs are used in this order of priority to assign a standardized name to the Agent. If no relevant IDs are found, the name is not changed.

Parameters
  • agent (indra.statements.Agent) – An INDRA Agent whose name attribute should be standardized based on grounding information.

  • standardize_refs (Optional[bool]) – If True, this function assumes that the Agent’s db_refs need to be standardized, e.g., HGNC mapped to UP. Default: True

static standardize_db_refs(db_refs)[source]

Return a standardized db refs dict for a given db refs dict.

Parameters

db_refs (dict) – A dict of db refs that may not be standardized, i.e., may be missing an available UP ID corresponding to an existing HGNC ID.

Returns

The db_refs dict with standardized entries.

Return type

dict

update_agent_db_refs(agent, db_refs, do_rename=True)[source]

Update db_refs of agent using the grounding map

If the grounding map is missing one of the HGNC symbol or Uniprot ID, attempts to reconstruct one from the other.

Parameters
  • agent (indra.statements.Agent) – The agent whose db_refs will be updated

  • db_refs (dict) – The db_refs so set for the agent.

  • do_rename (Optional[bool]) – If True, the Agent name is updated based on the mapped grounding. If do_rename is True the priority for setting the name is FamPlex ID, HGNC symbol, then the gene name from Uniprot. Default: True

indra.preassembler.grounding_mapper.mapper.load_grounding_map(grounding_map_path, lineterminator='\r\n', hgnc_symbols=True)[source]

Return a grounding map dictionary loaded from a csv file.

In the file pointed to by grounding_map_path, the number of name_space ID pairs can vary per row and commas are used to pad out entries containing fewer than the maximum amount of name spaces appearing in the file. Lines should be terminated with

both a carriage return and a new line by default.

Optionally, one can specify another csv file (pointed to by ignore_path) containing agent texts that are degenerate and should be filtered out.

It is important to note that this function assumes that the mapping file entries for the HGNC key are symbols not IDs. These symbols are converted to IDs upon loading here.

Parameters
  • grounding_map_path (str) – Path to csv file containing grounding map information. Rows of the file should be of the form <agent_text>,<name_space_1>,<ID_1>,… <name_space_n>,<ID_n>

  • lineterminator (Optional[str]) – Line terminator used in input csv file. Default:

  • hgnc_symbols (Optional[bool]) – Set to True if the grounding map file contains HGNC symbols rather than IDs. In this case, the entries are replaced by IDs. Default: True

Returns

g_map – The grounding map constructed from the given files.

Return type

dict

Disambiguation with machine-learned models

class indra.preassembler.grounding_mapper.disambiguate.DisambManager[source]

Manages running of disambiguation models

Has methods to run disambiguation with either adeft or gilda. Each instance of this class uses a single database connection.

run_adeft_disambiguation(stmt, agent, idx, agent_txt)[source]

Run Adeft disambiguation on an Agent in a given Statement.

This function looks at the evidence of the given Statement and attempts to look up the full paper or the abstract for the evidence. If both of those fail, the evidence sentence itself is used for disambiguation. The disambiguation model corresponding to the Agent text is then called, and the highest scoring returned grounding is set as the Agent’s new grounding.

The Statement’s annotations as well as the Agent are modified in place and no value is returned.

Parameters
  • stmt (indra.statements.Statement) – An INDRA Statement in which the Agent to be disambiguated appears.

  • agent (indra.statements.Agent) – The Agent (potentially grounding mapped) which we want to disambiguate in the context of the evidence of the given Statement.

  • idx (int) – The index of the new Agent’s position in the Statement’s agent list (needed to set annotations correctly).

Returns

True if disambiguation was successfully applied, and False otherwise. Reasons for a False response can be the lack of evidence as well as failure to obtain text for grounding disambiguation.

Return type

bool

run_gilda_disambiguation(stmt, agent, idx, agent_txt, mode='web')[source]

Run Gilda disambiguation on an Agent in a given Statement.

This function looks at the evidence of the given Statement and attempts to look up the full paper or the abstract for the evidence. If both of those fail, the evidence sentence itself is used for disambiguation. The disambiguation model corresponding to the Agent text is then called, and the highest scoring returned grounding is set as the Agent’s new grounding.

The Statement’s annotations as well as the Agent are modified in place and no value is returned.

Parameters
  • stmt (indra.statements.Statement) – An INDRA Statement in which the Agent to be disambiguated appears.

  • agent (indra.statements.Agent) – The Agent (potentially grounding mapped) which we want to disambiguate in the context of the evidence of the given Statement.

  • idx (int) – The index of the new Agent’s position in the Statement’s agent list (needed to set annotations correctly).

  • mode (Optional[str]) – If ‘web’, the web service given in the GILDA_URL config setting or environmental variable is used. Otherwise, the gilda package is attempted to be imported and used. Default: web

Returns

True if disambiguation was successfully applied, and False otherwise. Reasons for a False response can be the lack of evidence as well as failure to obtain text for grounding disambiguation.

Return type

bool

Gilda grounding functions

This module implements a client to the Gilda grounding web service, and contains functions to help apply it during the course of INDRA assembly.

indra.preassembler.grounding_mapper.gilda.get_gilda_models(mode='web')[source]

Return a list of strings for which Gilda has a disambiguation model.

Parameters

mode (Optional[str]) – If ‘web’, the web service given in the GILDA_URL config setting or environmental variable is used. Otherwise, the gilda package is attempted to be imported and used. Default: web

Returns

A list of entity strings.

Return type

list[str]

indra.preassembler.grounding_mapper.gilda.get_grounding(txt, context=None, mode='web')[source]

Return the top Gilda grounding for a given text.

Parameters
  • txt (str) – The text to ground.

  • context (Optional[str]) – Any context for the grounding.

  • mode (Optional[str]) – If ‘web’, the web service given in the GILDA_URL config setting or environmental variable is used. Otherwise, the gilda package is attempted to be imported and used. Default: web

Return type

Tuple[Mapping[str, Any], List[Any]]

Returns

  • dict – If no grounding was found, it is an empty dict. Otherwise, it’s a dict with the top grounding returned from Gilda.

  • list – The list of ScoredMatches

indra.preassembler.grounding_mapper.gilda.ground_agent(agent, txt, context=None, mode='web')[source]

Set the grounding of a given agent, by re-grounding with Gilda.

This function changes the agent in place without returning a value.

Parameters
  • agent (indra.statements.Agent) – The Agent whose db_refs shuld be changed.

  • txt (str) – The text by which the Agent should be grounded.

  • context (Optional[str]) – Any additional text context to help disambiguate the sense associated with txt.

  • mode (Optional[str]) – If ‘web’, the web service given in the GILDA_URL config setting or environmental variable is used. Otherwise, the gilda package is attempted to be imported and used. Default: web

indra.preassembler.grounding_mapper.gilda.ground_statement(stmt, mode='web', ungrounded_only=False)[source]

Set grounding for Agents in a given Statement using Gilda.

This function modifies the original Statement/Agents in place.

Parameters
  • stmt (indra.statements.Statement) – A Statement to ground

  • mode (Optional[str]) – If ‘web’, the web service given in the GILDA_URL config setting or environmental variable is used. Otherwise, the gilda package is attempted to be imported and used. Default: web

  • ungrounded_only (Optional[str]) – If True, only ungrounded Agents will be grounded, and ones that are already grounded will not be modified. Default: False

indra.preassembler.grounding_mapper.gilda.ground_statements(stmts, mode='web', sources=None, ungrounded_only=False)[source]

Set grounding for Agents in a list of Statements using Gilda.

This function modifies the original Statements/Agents in place.

Parameters
  • stmts (list[indra.statements.Statement]) – A list of Statements to ground

  • mode (Optional[str]) – If ‘web’, the web service given in the GILDA_URL config setting or environmental variable is used. Otherwise, the gilda package is attempted to be imported and used. Default: web

  • sources (Optional[list]) – If given, only statements from the given sources are grounded. The sources have to correspond to valid source_api entries, e.g., ‘reach’, ‘sparser’, etc. If not given, statements from all sources are grounded.

  • ungrounded_only (Optional[str]) – If True, only ungrounded Agents will be grounded, and ones that are already grounded will not be modified. Default: False

Returns

The list of Statements that were changed in place by reference.

Return type

list[indra.statement.Statements]

Analysis scripts for grounding

indra.preassembler.grounding_mapper.analysis.agent_texts(agents)[source]

Return a list of all agent texts from a list of agents.

None values are associated to agents without agent texts

Parameters

agents (list of indra.statements.Agent) –

Returns

agent texts from input list of agents

Return type

list of str/None

indra.preassembler.grounding_mapper.analysis.agent_texts_with_grounding(stmts)[source]

Return agent text groundings in a list of statements with their counts

Parameters

stmts (list of indra.statements.Statement) –

Returns

List of tuples of the form (text: str, ((name_space: str, ID: str, count: int)…), total_count: int)

Where the counts within the tuple of groundings give the number of times an agent with the given agent_text appears grounded with the particular name space and ID. The total_count gives the total number of times an agent with text appears in the list of statements.

Return type

list of tuple

indra.preassembler.grounding_mapper.analysis.all_agents(stmts)[source]

Return a list of all of the agents from a list of statements.

Only agents that are not None and have a TEXT entry are returned.

Parameters

stmts (list of indra.statements.Statement) –

Returns

agents – List of agents that appear in the input list of indra statements.

Return type

list of indra.statements.Agent

indra.preassembler.grounding_mapper.analysis.get_agents_with_name(name, stmts)[source]

Return all agents within a list of statements with a particular name.

indra.preassembler.grounding_mapper.analysis.get_sentences_for_agent(text, stmts, max_sentences=None)[source]

Returns evidence sentences with a given agent text from a list of statements.

Parameters
  • text (str) – An agent text

  • stmts (list of indra.statements.Statement) – INDRA Statements to search in for evidence statements.

  • max_sentences (Optional[int/None]) – Cap on the number of evidence sentences to return. Default: None

Returns

sentences – Evidence sentences from the list of statements containing the given agent text.

Return type

list of str

indra.preassembler.grounding_mapper.analysis.protein_map_from_twg(twg)[source]

Build map of entity texts to validate protein grounding.

Looks at the grounding of the entity texts extracted from the statements and finds proteins where there is grounding to a human protein that maps to an HGNC name that is an exact match to the entity text. Returns a dict that can be used to update/expand the grounding map.

Parameters

twg (list of tuple) – list of tuples of the form output by agent_texts_with_grounding

Returns

protein_map – dict keyed on agent text with associated values {‘TEXT’: agent_text, ‘UP’: uniprot_id}. Entries are for agent texts where the grounding map was able to find human protein grounded to this agent_text in Uniprot.

Return type

dict

indra.preassembler.grounding_mapper.analysis.save_base_map(filename, grouped_by_text)[source]

Dump a list of agents along with groundings and counts into a csv file

Parameters
  • filename (str) – Filepath for output file

  • grouped_by_text (list of tuple) – List of tuples of the form output by agent_texts_with_grounding

indra.preassembler.grounding_mapper.analysis.save_sentences(twg, stmts, filename, agent_limit=300)[source]

Write evidence sentences for stmts with ungrounded agents to csv file.

Parameters
  • twg (list of tuple) – list of tuples of ungrounded agent_texts with counts of the number of times they are mentioned in the list of statements. Should be sorted in descending order by the counts. This is of the form output by the function ungrounded texts.

  • stmts (list of indra.statements.Statement) –

  • filename (str) – Path to output file

  • agent_limit (Optional[int]) – Number of agents to include in output file. Takes the top agents by count.

indra.preassembler.grounding_mapper.analysis.ungrounded_texts(stmts)[source]

Return a list of all ungrounded entities ordered by number of mentions

Parameters

stmts (list of indra.statements.Statement) –

Returns

ungroundc – list of tuples of the form (text: str, count: int) sorted in descending order by count.

Return type

list of tuple