Preassembly (indra.preassembler)

Preassembler (indra.preassembler)

class indra.preassembler.Preassembler(hierarchies, stmts=None)[source]

De-duplicates statements and arranges them in a specificity hierarchy.

Parameters:
  • hierarchies (dict[indra.preassembler.hierarchy_manager]) – A dictionary of hierarchies with keys such as ‘entity’ (hierarchy of entities, primarily specifying relationships between genes and their families) and ‘modification’ pointing to HierarchyManagers
  • stmts (list of indra.statements.Statement or None) – A set of statements to perform pre-assembly on. If None, statements should be added using the add_statements() method.
stmts

list of indra.statements.Statement – Starting set of statements for preassembly.

unique_stmts

list of indra.statements.Statement – Statements resulting from combining duplicates.

related_stmts

list of indra.statements.Statement – Top-level statements after building the refinement hierarchy.

hierarchies

dict[indra.preassembler.hierarchy_manager] – A dictionary of hierarchies with keys such as ‘entity’ and ‘modification’ pointing to HierarchyManagers

add_statements(stmts)[source]

Add to the current list of statements.

Parameters:stmts (list of indra.statements.Statement) – Statements to add to the current list.
static combine_duplicate_stmts(stmts)[source]

Combine evidence from duplicate Statements.

Statements are deemed to be duplicates if they have the same key returned by the matches_key() method of the Statement class. This generally means that statements must be identical in terms of their arguments and can differ only in their associated Evidence objects.

This function keeps the first instance of each set of duplicate statements and merges the lists of Evidence from all of the other statements.

Parameters:stmts (list of indra.statements.Statement) – Set of statements to de-duplicate.
Returns:Unique statements with accumulated evidence across duplicates.
Return type:list of indra.statements.Statement

Examples

De-duplicate and combine evidence for two statements differing only in their evidence lists:

>>> map2k1 = Agent('MAP2K1')
>>> mapk1 = Agent('MAPK1')
>>> stmt1 = Phosphorylation(map2k1, mapk1, 'T', '185',
... evidence=[Evidence(text='evidence 1')])
>>> stmt2 = Phosphorylation(map2k1, mapk1, 'T', '185',
... evidence=[Evidence(text='evidence 2')])
>>> uniq_stmts = Preassembler.combine_duplicate_stmts([stmt1, stmt2])
>>> uniq_stmts
[Phosphorylation(MAP2K1(), MAPK1(), T, 185)]
>>> sorted([e.text for e in uniq_stmts[0].evidence]) 
['evidence 1', 'evidence 2']
combine_duplicates()[source]

Combine duplicates among stmts and save result in unique_stmts.

A wrapper around the static method combine_duplicate_stmts().

Connect related statements based on their refinement relationships.

This function takes as a starting point the unique statements (with duplicates removed) and returns a modified flat list of statements containing only those statements which do not represent a refinement of other existing statements. In other words, the more general versions of a given statement do not appear at the top level, but instead are listed in the supports field of the top-level statements.

If unique_stmts has not been initialized with the de-duplicated statements, combine_duplicates() is called internally.

After this function is called the attribute related_stmts is set as a side-effect.

The procedure for combining statements in this way involves a series of steps:

  1. The statements are grouped by type (e.g., Phosphorylation) and each type is iterated over independently.
  2. Statements of the same type are then grouped according to their Agents’ entity hierarchy component identifiers. For instance, ERK, MAPK1 and MAPK3 are all in the same connected component in the entity hierarchy and therefore all Statements of the same type referencing these entities will be grouped. This grouping assures that relations are only possible within Statement groups and not among groups. For two Statements to be in the same group at this step, the Statements must be the same type and the Agents at each position in the Agent lists must either be in the same hierarchy component, or if they are not in the hierarchy, must have identical entity_matches_keys. Statements with None in one of the Agent list positions are collected separately at this stage.
  3. Statements with None at either the first or second position are iterated over. For a statement with a None as the first Agent, the second Agent is examined; then the Statement with None is added to all Statement groups with a corresponding component or entity_matches_key in the second position. The same procedure is performed for Statements with None at the second Agent position.
  4. The statements within each group are then compared; if one statement represents a refinement of the other (as defined by the refinement_of() method implemented for the Statement), then the more refined statement is added to the supports field of the more general statement, and the more general statement is added to the supported_by field of the more refined statement.
  5. A new flat list of statements is created that contains only those statements that have no supports entries (statements containing such entries are not eliminated, because they will be retrievable from the supported_by fields of other statements). This list is returned to the caller.

On multi-core machines, the algorithm can be parallelized by setting the poolsize argument to the desired number of worker processes. This feature is only available in Python > 3.4.

Note

Subfamily relationships must be consistent across arguments

For now, we require that merges can only occur if the isa relationships are all in the same direction for all the agents in a Statement. For example, the two statement groups: RAF_family -> MEK1 and BRAF -> MEK_family would not be merged, since BRAF isa RAF_family, but MEK_family is not a MEK1. In the future this restriction could be revisited.

Parameters:
  • return_toplevel (Optional[bool]) – If True only the top level statements are returned. If False, all statements are returned. Default: True
  • poolsize (Optional[int]) – The number of worker processes to use to parallelize the comparisons performed by the function. If None (default), no parallelization is performed. NOTE: Parallelization is only available on Python 3.4 and above.
  • size_cutoff (Optional[int]) – Groups with size_cutoff or more statements are sent to worker processes, while smaller groups are compared in the parent process. Default value is 100. Not relevant when parallelization is not used.
Returns:

The returned list contains Statements representing the more concrete/refined versions of the Statements involving particular entities. The attribute related_stmts is also set to this list. However, if return_toplevel is False then all statements are returned, irrespective of level of specificity. In this case the relationships between statements can be accessed via the supports/supported_by attributes.

Return type:

list of indra.statement.Statement

Examples

A more general statement with no information about a Phosphorylation site is identified as supporting a more specific statement:

>>> from indra.preassembler.hierarchy_manager import hierarchies
>>> braf = Agent('BRAF')
>>> map2k1 = Agent('MAP2K1')
>>> st1 = Phosphorylation(braf, map2k1)
>>> st2 = Phosphorylation(braf, map2k1, residue='S')
>>> pa = Preassembler(hierarchies, [st1, st2])
>>> combined_stmts = pa.combine_related() 
>>> combined_stmts
[Phosphorylation(BRAF(), MAP2K1(), S)]
>>> combined_stmts[0].supported_by
[Phosphorylation(BRAF(), MAP2K1())]
>>> combined_stmts[0].supported_by[0].supports
[Phosphorylation(BRAF(), MAP2K1(), S)]
indra.preassembler.flatten_evidence(stmts)[source]

Add evidence from supporting stmts to evidence for supported stmts.

Parameters:stmts (list of indra.statements.Statement) – A list of top-level statements with associated supporting statements resulting from building a statement hierarchy with combine_related().
Returns:stmts – Statement hierarchy identical to the one passed, but with the evidence lists for each statement now containing all of the evidence associated with the statements they are supported by.
Return type:list of indra.statements.Statement

Examples

Flattening evidence adds the two pieces of evidence from the supporting statement to the evidence list of the top-level statement:

>>> from indra.preassembler.hierarchy_manager import hierarchies
>>> braf = Agent('BRAF')
>>> map2k1 = Agent('MAP2K1')
>>> st1 = Phosphorylation(braf, map2k1,
... evidence=[Evidence(text='foo'), Evidence(text='bar')])
>>> st2 = Phosphorylation(braf, map2k1, residue='S',
... evidence=[Evidence(text='baz'), Evidence(text='bak')])
>>> pa = Preassembler(hierarchies, [st1, st2])
>>> pa.combine_related() 
[Phosphorylation(BRAF(), MAP2K1(), S)]
>>> [e.text for e in pa.related_stmts[0].evidence] 
['baz', 'bak']
>>> flattened = flatten_evidence(pa.related_stmts)
>>> sorted([e.text for e in flattened[0].evidence]) 
['bak', 'bar', 'baz', 'foo']
indra.preassembler.flatten_stmts(stmts)[source]

Return the full set of unique stms in a pre-assembled stmt graph.

The flattened list of of statements returned by this function can be compared to the original set of unique statements to make sure no statements have been lost during the preassembly process.

Parameters:stmts (list of indra.statements.Statement) – A list of top-level statements with associated supporting statements resulting from building a statement hierarchy with combine_related().
Returns:stmts – List of all statements contained in the hierarchical statement graph.
Return type:list of indra.statements.Statement

Examples

Calling combine_related() on two statements results in one top-level statement; calling flatten_stmts() recovers both:

>>> from indra.preassembler.hierarchy_manager import hierarchies
>>> braf = Agent('BRAF')
>>> map2k1 = Agent('MAP2K1')
>>> st1 = Phosphorylation(braf, map2k1)
>>> st2 = Phosphorylation(braf, map2k1, residue='S')
>>> pa = Preassembler(hierarchies, [st1, st2])
>>> pa.combine_related() 
[Phosphorylation(BRAF(), MAP2K1(), S)]
>>> flattened = flatten_stmts(pa.related_stmts)
>>> flattened.sort(key=lambda x: x.matches_key())
>>> flattened
[Phosphorylation(BRAF(), MAP2K1()), Phosphorylation(BRAF(), MAP2K1(), S)]
indra.preassembler.render_stmt_graph(statements, agent_style=None)[source]

Render the statement hierarchy as a pygraphviz graph.

Parameters:
  • stmts (list of indra.statements.Statement) – A list of top-level statements with associated supporting statements resulting from building a statement hierarchy with combine_related().
  • agent_style (dict or None) –

    Dict of attributes specifying the visual properties of nodes. If None, the following default attributes are used:

    agent_style = {'color': 'lightgray', 'style': 'filled',
                   'fontname': 'arial'}
    
Returns:

Pygraphviz graph with nodes representing statements and edges pointing from supported statements to supported_by statements.

Return type:

pygraphviz.AGraph

Examples

Pattern for getting statements and rendering as a Graphviz graph:

>>> from indra.preassembler.hierarchy_manager import hierarchies
>>> braf = Agent('BRAF')
>>> map2k1 = Agent('MAP2K1')
>>> st1 = Phosphorylation(braf, map2k1)
>>> st2 = Phosphorylation(braf, map2k1, residue='S')
>>> pa = Preassembler(hierarchies, [st1, st2])
>>> pa.combine_related() 
[Phosphorylation(BRAF(), MAP2K1(), S)]
>>> graph = render_stmt_graph(pa.related_stmts)
>>> graph.write('example_graph.dot') # To make the DOT file
>>> graph.draw('example_graph.png', prog='dot') # To make an image

Resulting graph:

Example statement graph rendered by Graphviz

Entity grounding curation and mapping (indra.preassembler.grounding_mapper)

indra.preassembler.grounding_mapper.protein_map_from_twg(twg)[source]

Build map of entity texts to validated protein grounding.

Looks at the grounding of the entity texts extracted from the statements and finds proteins where there is grounding to a human protein that maps to an HGNC name that is an exact match to the entity text. Returns a dict that can be used to update/expand the grounding map.

Site curation and mapping (indra.preassembler.sitemapper)

class indra.preassembler.sitemapper.MappedStatement(original_stmt, mapped_mods, mapped_stmt)[source]

Information about a Statement found to have invalid sites.

Parameters:
  • original_stmt (indra.statements.Statement) – The statement prior to mapping.
  • mapped_mods (list of tuples) – A list of invalid sites, where each entry in the list has two elements: ((gene_name, residue, position), mapped_site). If the invalid position was not found in the site map, mapped_site is None; otherwise it is a tuple consisting of (residue, position, comment).
  • mapped_stmt (indra.statements.Statement) – The statement after mapping. Note that if no information was found in the site map, it will be identical to the original statement.
class indra.preassembler.sitemapper.SiteMapper(site_map)[source]

Use curated site information to standardize modification sites in stmts.

Parameters:site_map (dict (as returned by load_site_map())) – A dict mapping tuples of the form (gene, orig_res, orig_pos) to a tuple of the form (correct_res, correct_pos, comment), where gene is the string name of the gene (canonicalized to HGNC); orig_res and orig_pos are the residue and position to be mapped; correct_res and correct_pos are the corrected residue and position, and comment is a string describing the reason for the mapping (species error, isoform error, wrong residue name, etc.).

Examples

Fixing site errors on both the modification state of an agent (MAP2K1) and the target of a Phosphorylation statement (MAPK1):

>>> map2k1_phos = Agent('MAP2K1', db_refs={'UP':'Q02750'}, mods=[
... ModCondition('phosphorylation', 'S', '217'),
... ModCondition('phosphorylation', 'S', '221')])
>>> mapk1 = Agent('MAPK1', db_refs={'UP':'P28482'})
>>> stmt = Phosphorylation(map2k1_phos, mapk1, 'T','183')
>>> (valid, mapped) = default_mapper.map_sites([stmt])
>>> valid
[]
>>> mapped  
[
MappedStatement:
    original_stmt: Phosphorylation(MAP2K1(mods: (phosphorylation, S, 217), (phosphorylation, S, 221)), MAPK1(), T, 183)
    mapped_mods: (('MAP2K1', 'S', '217'), ('S', '218', 'off by one'))
                 (('MAP2K1', 'S', '221'), ('S', '222', 'off by one'))
                 (('MAPK1', 'T', '183'), ('T', '185', 'off by two; mouse sequence'))
    mapped_stmt: Phosphorylation(MAP2K1(mods: (phosphorylation, S, 218), (phosphorylation, S, 222)), MAPK1(), T, 185)
]
>>> ms = mapped[0]
>>> ms.original_stmt
Phosphorylation(MAP2K1(mods: (phosphorylation, S, 217), (phosphorylation, S, 221)), MAPK1(), T, 183)
>>> ms.mapped_mods 
[(('MAP2K1', 'S', '217'), ('S', '218', 'off by one')), (('MAP2K1', 'S', '221'), ('S', '222', 'off by one')), (('MAPK1', 'T', '183'), ('T', '185', 'off by two; mouse sequence'))]
>>> ms.mapped_stmt
Phosphorylation(MAP2K1(mods: (phosphorylation, S, 218), (phosphorylation, S, 222)), MAPK1(), T, 185)
map_sites(stmts, do_methionine_offset=True, do_orthology_mapping=True, do_isoform_mapping=True)[source]

Check a set of statements for invalid modification sites.

Statements are checked against Uniprot reference sequences to determine if residues referred to by post-translational modifications exist at the given positions.

If there is nothing amiss with a statement (modifications on any of the agents, modifications made in the statement, etc.), then the statement goes into the list of valid statements. If there is a problem with the statement, the offending modifications are looked up in the site map (site_map), and an instance of MappedStatement is added to the list of mapped statements.

Parameters:
  • stmts (list of indra.statement.Statement) – The statements to check for site errors.
  • do_methionine_offset (boolean) – Whether to check for off-by-one errors in site position (possibly) attributable to site numbering from mature proteins after cleavage of the initial methionine. If True, checks the reference sequence for a known modification at 1 site position greater than the given one; if there exists such a site, creates the mapping. Default is True.
  • do_orthology_mapping (boolean) – Whether to check sequence positions for known modification sites in mouse or rat sequences (based on PhosphoSitePlus data). If a mouse/rat site is found that is linked to a site in the human reference sequence, a mapping is created. Default is True.
  • do_isoform_mapping (boolean) – Whether to check sequence positions for known modifications in other human isoforms of the protein (based on PhosphoSitePlus data). If a site is found that is linked to a site in the human reference sequence, a mapping is created. Default is True.
Returns:

2-tuple containing (valid_statements, mapped_statements). The first element of the tuple is a list valid statements (indra.statement.Statement) that were not found to contain any site errors. The second element of the tuple is a list of mapped statements (MappedStatement) with information on the incorrect sites and corresponding statements with correctly mapped sites.

Return type:

tuple

indra.preassembler.sitemapper.default_mapper = <indra.preassembler.sitemapper.SiteMapper object>

A default instance of SiteMapper that contains the site information found in resources/curated_site_map.csv’.

indra.preassembler.sitemapper.load_site_map(path)[source]

Load the modification site map from a file.

The site map file should be a comma-separated file with six columns:

Gene: HGNC gene name
OrigRes: Original (incorrect) residue
OrigPos: Original (incorrect) residue position
CorrectRes: The correct residue for the modification
CorrectPos: The correct residue position
Comment: Description of the reason for the error.
Parameters:path (string) – Path to the tab-separated site map file.
Returns:A dict mapping tuples of the form (gene, orig_res, orig_pos) to a tuple of the form (correct_res, correct_pos, comment), where gene is the string name of the gene (canonicalized to HGNC); orig_res and orig_pos are the residue and position to be mapped; correct_res and correct_pos are the corrected residue and position, and comment is a string describing the reason for the mapping (species error, isoform error, wrong residue name, etc.).
Return type:dict

Hierarchy manager (indra.preassembler.hierarchy_manager)

class indra.preassembler.hierarchy_manager.HierarchyManager(rdf_file, build_closure=True, uri_as_name=True)[source]

Store hierarchical relationships between different types of entities.

Used to store, e.g., entity hierarchies (proteins and protein families) and modification hierarchies (serine phosphorylation vs. phosphorylation).

Parameters:
  • rdf_file (string) – Path to the RDF file containing the hierarchy.
  • build_closure (Optional[bool]) – If True, the transitive closure of the hierarchy is generated up from to speed up processing. Default: True
  • uri_as_name (Optional[bool]) – If True, entries are accessed directly by their URIs. If False entries are accessed by finding their name through the hasName relationship. Default: True
graph

instance of rdflib.Graph – The RDF graph containing the hierarchy.

build_transitive_closures()[source]

Build the transitive closures of the hierarchy.

This method constructs dictionaries which contain terms in the hierarchy as keys and either all the “isa+” or “partof+” related terms as values.

find_entity[source]

Get the entity that has the specified name (or synonym).

Parameters:x (string) – Name or synonym for the target entity.
get_children(uri)[source]

Return all (not just immediate) children of a given entry.

Parameters:uri (str) – The URI of the entry whose children are to be returned. See the get_uri method to construct this URI from a name space and id.
get_parents(uri, type='all')[source]

Return parents of a given entry.

Parameters:
  • uri (str) – The URI of the entry whose parents are to be returned. See the get_uri method to construct this URI from a name space and id.
  • type (str) – ‘all’: return all parents irrespective of level; ‘immediate’: return only the immediate parents; ‘top’: return only the highest level parents
isa(ns1, id1, ns2, id2)[source]

Indicate whether one entity has an “isa” relationship to another.

Parameters:
  • ns1 (string) – Namespace code for an entity.
  • id1 (string) – URI for an entity.
  • ns2 (string) – Namespace code for an entity.
  • id2 (string) – URI for an entity.
Returns:

True if t1 has an “isa” relationship with t2, either directly or through a series of intermediates; False otherwise.

Return type:

bool

partof(ns1, id1, ns2, id2)[source]

Indicate whether one entity is physically part of another.

Parameters:
  • ns1 (string) – Namespace code for an entity.
  • id1 (string) – URI for an entity.
  • ns2 (string) – Namespace code for an entity.
  • id2 (string) – URI for an entity.
Returns:

True if t1 has a “partof” relationship with t2, either directly or through a series of intermediates; False otherwise.

Return type:

bool