Entity grounding curation and mapping (indra.preassembler.grounding_mapper)

Grounding mapping and standardization

class indra.preassembler.grounding_mapper.mapper.GroundingMapper(grounding_map=None, agent_map=None, ignores=None, misgrounding_map=None, use_adeft=True)[source]

Maps grounding of INDRA Agents based on a given grounding map.

Each parameter, if not provided will result in loading the corresponding built-in grounding resource. To explicitly avoid loading the default, pass in an empty data structure as the given parameter, e.g., ignores=[].

Parameters:
  • grounding_map (Optional[dict]) – The grounding map, a dictionary mapping strings (entity names) to a dictionary of database identifiers.
  • agent_map (Optional[dict]) – A dictionary mapping strings to grounded INDRA Agents with given state.
  • ignores (Optional[list]) – A list of entity strings that, if encountered will result in the corresponding Statement being discarded.
  • misgrounding_map (Optional[dict]) – A mapping dict similar to the grounding map which maps entity strings to a given grounding which is known to be incorrect and should be removed if encountered (making the remaining Agent ungrounded).
  • use_adeft (Optional[bool]) – If True, Adeft will be attempted to be used for disambiguation of acronyms. Default: True
static check_grounding_map(gm)[source]

Run sanity checks on the grounding map, raise error if needed.

map_agent(agent, do_rename)[source]

Return the given Agent with its grounding mapped.

This function grounds a single agent. It returns the new Agent object (which might be a different object if we load a new agent state from json) or the same object otherwise.

Parameters:
  • agent (indra.statements.Agent) – The Agent to map.
  • do_rename (bool) – If True, the Agent name is updated based on the mapped grounding. If do_rename is True the priority for setting the name is FamPlex ID, HGNC symbol, then the gene name from Uniprot.
Returns:

grounded_agent – The grounded Agent.

Return type:

indra.statements.Agent

map_agents_for_stmt(stmt, do_rename=True)[source]

Return a new Statement whose agents have been grounding mapped.

Parameters:
  • stmt (indra.statements.Statement) – The Statement whose agents need mapping.
  • do_rename (Optional[bool]) – If True, the Agent name is updated based on the mapped grounding. If do_rename is True the priority for setting the name is FamPlex ID, HGNC symbol, then the gene name from Uniprot. Default: True
Returns:

mapped_stmt – The mapped Statement.

Return type:

indra.statements.Statement

map_stmts(stmts, do_rename=True)[source]

Return a new list of statements whose agents have been mapped

Parameters:
  • stmts (list of indra.statements.Statement) – The statements whose agents need mapping
  • do_rename (Optional[bool]) – If True, the Agent name is updated based on the mapped grounding. If do_rename is True the priority for setting the name is FamPlex ID, HGNC symbol, then the gene name from Uniprot. Default: True
Returns:

mapped_stmts – A list of statements given by mapping the agents from each statement in the input list

Return type:

list of indra.statements.Statement

static rename_agents(stmts)[source]

Return a list of mapped statements with updated agent names.

Creates a new list of statements without modifying the original list.

Parameters:stmts (list of indra.statements.Statement) – List of statements whose Agents need their names updated.
Returns:mapped_stmts – A new list of Statements with updated Agent names
Return type:list of indra.statements.Statement
static standardize_agent_name(agent, standardize_refs=True)[source]

Standardize the name of an Agent based on grounding information.

If an agent contains a FamPlex grounding, the FamPlex ID is used as a name. Otherwise if it contains a Uniprot ID, an attempt is made to find the associated HGNC gene name. If one can be found it is used as the agent name and the associated HGNC ID is added as an entry to the db_refs. Similarly, CHEBI, MESH and GO IDs are used in this order of priority to assign a standardized name to the Agent. If no relevant IDs are found, the name is not changed.

Parameters:
  • agent (indra.statements.Agent) – An INDRA Agent whose name attribute should be standardized based on grounding information.
  • standardize_refs (Optional[bool]) – If True, this function assumes that the Agent’s db_refs need to be standardized, e.g., HGNC mapped to UP. Default: True
static standardize_db_refs(db_refs)[source]

Return a standardized db refs dict for a given db refs dict.

Parameters:db_refs (dict) – A dict of db refs that may not be standardized, i.e., may be missing an available UP ID corresponding to an existing HGNC ID.
Returns:The db_refs dict with standardized entries.
Return type:dict
update_agent_db_refs(agent, db_refs, do_rename=True)[source]

Update db_refs of agent using the grounding map

If the grounding map is missing one of the HGNC symbol or Uniprot ID, attempts to reconstruct one from the other.

Parameters:
  • agent (indra.statements.Agent) – The agent whose db_refs will be updated
  • db_refs (dict) – The db_refs so set for the agent.
  • do_rename (Optional[bool]) – If True, the Agent name is updated based on the mapped grounding. If do_rename is True the priority for setting the name is FamPlex ID, HGNC symbol, then the gene name from Uniprot. Default: True
indra.preassembler.grounding_mapper.mapper.load_grounding_map(grounding_map_path, lineterminator='\r\n', hgnc_symbols=True)[source]

Return a grounding map dictionary loaded from a csv file.

In the file pointed to by grounding_map_path, the number of name_space ID pairs can vary per row and commas are used to pad out entries containing fewer than the maximum amount of name spaces appearing in the file. Lines should be terminated with

both a carriage return and a new line by default.

Optionally, one can specify another csv file (pointed to by ignore_path) containing agent texts that are degenerate and should be filtered out.

It is important to note that this function assumes that the mapping file entries for the HGNC key are symbols not IDs. These symbols are converted to IDs upon loading here.

Parameters:
  • grounding_map_path (str) – Path to csv file containing grounding map information. Rows of the file should be of the form <agent_text>,<name_space_1>,<ID_1>,… <name_space_n>,<ID_n>
  • lineterminator (Optional[str]) – Line terminator used in input csv file. Default:
  • hgnc_symbols (Optional[bool]) – Set to True if the grounding map file contains HGNC symbols rather than IDs. In this case, the entries are replaced by IDs. Default: True
Returns:

g_map – The grounding map constructed from the given files.

Return type:

dict

Analysis scripts for grounding

indra.preassembler.grounding_mapper.analysis.agent_texts(agents)[source]

Return a list of all agent texts from a list of agents.

None values are associated to agents without agent texts

Parameters:agents (list of indra.statements.Agent) –
Returns:agent texts from input list of agents
Return type:list of str/None
indra.preassembler.grounding_mapper.analysis.agent_texts_with_grounding(stmts)[source]

Return agent text groundings in a list of statements with their counts

Parameters:stmts (list of indra.statements.Statement) –
Returns:List of tuples of the form (text: str, ((name_space: str, ID: str, count: int)…), total_count: int)

Where the counts within the tuple of groundings give the number of times an agent with the given agent_text appears grounded with the particular name space and ID. The total_count gives the total number of times an agent with text appears in the list of statements.

Return type:list of tuple
indra.preassembler.grounding_mapper.analysis.all_agents(stmts)[source]

Return a list of all of the agents from a list of statements.

Only agents that are not None and have a TEXT entry are returned.

Parameters:stmts (list of indra.statements.Statement) –
Returns:agents – List of agents that appear in the input list of indra statements.
Return type:list of indra.statements.Agent
indra.preassembler.grounding_mapper.analysis.get_agents_with_name(name, stmts)[source]

Return all agents within a list of statements with a particular name.

indra.preassembler.grounding_mapper.analysis.get_sentences_for_agent(text, stmts, max_sentences=None)[source]

Returns evidence sentences with a given agent text from a list of statements.

Parameters:
  • text (str) – An agent text
  • stmts (list of indra.statements.Statement) – INDRA Statements to search in for evidence statements.
  • max_sentences (Optional[int/None]) – Cap on the number of evidence sentences to return. Default: None
Returns:

sentences – Evidence sentences from the list of statements containing the given agent text.

Return type:

list of str

indra.preassembler.grounding_mapper.analysis.protein_map_from_twg(twg)[source]

Build map of entity texts to validate protein grounding.

Looks at the grounding of the entity texts extracted from the statements and finds proteins where there is grounding to a human protein that maps to an HGNC name that is an exact match to the entity text. Returns a dict that can be used to update/expand the grounding map.

Parameters:twg (list of tuple) – list of tuples of the form output by agent_texts_with_grounding
Returns:protein_map – dict keyed on agent text with associated values {‘TEXT’: agent_text, ‘UP’: uniprot_id}. Entries are for agent texts where the grounding map was able to find human protein grounded to this agent_text in Uniprot.
Return type:dict
indra.preassembler.grounding_mapper.analysis.save_base_map(filename, grouped_by_text)[source]

Dump a list of agents along with groundings and counts into a csv file

Parameters:
  • filename (str) – Filepath for output file
  • grouped_by_text (list of tuple) – List of tuples of the form output by agent_texts_with_grounding
indra.preassembler.grounding_mapper.analysis.save_sentences(twg, stmts, filename, agent_limit=300)[source]

Write evidence sentences for stmts with ungrounded agents to csv file.

Parameters:
  • twg (list of tuple) – list of tuples of ungrounded agent_texts with counts of the number of times they are mentioned in the list of statements. Should be sorted in descending order by the counts. This is of the form output by the function ungrounded texts.
  • stmts (list of indra.statements.Statement) –
  • filename (str) – Path to output file
  • agent_limit (Optional[int]) – Number of agents to include in output file. Takes the top agents by count.
indra.preassembler.grounding_mapper.analysis.ungrounded_texts(stmts)[source]

Return a list of all ungrounded entities ordered by number of mentions

Parameters:stmts (list of indra.statements.Statement) –
Returns:ungroundc – list of tuples of the form (text: str, count: int) sorted in descending order by count.
Return type:list of tuple

Adeft disambiguation functions

indra.preassembler.grounding_mapper.adeft.run_adeft_disambiguation(stmt, agent, idx)[source]

Run Adeft disambiguation on an Agent in a given Statement.

This function looks at the evidence of the given Statement and attempts to look up the full paper or the abstract for the evidence. If both of those fail, the evidence sentence itself is used for disambiguation. The disambiguation model corresponding to the Agent text is then called, and the highest scoring returned grounding is set as the Agent’s new grounding.

The Statement’s annotations as well as the Agent are modified in place and no value is returned.

Parameters:
  • stmt (indra.statements.Statement) – An INDRA Statement in which the Agent to be disambiguated appears.
  • agent (indra.statements.Agent) – The Agent (potentially grounding mapped) which we want to disambiguate in the context of the evidence of the given Statement.
  • idx (int) – The index of the new Agent’s position in the Statement’s agent list (needed to set annotations correctly).