Util (indra.util
)
Statement presentation (indra.util.statement_presentation
)
This module groups and sorts Statements for presentation in downstream tools while aggregating the statements’ statistics/metrics into the groupings. While most usage of this module will be via the top-level function group_and_sort_statements, alternative usages (including custom statement data, multiple statement grouping levels, and multiple strategies for aggregating statement-level metrics for higher-level groupings) are supported through the various classes (see Class Overview below).
Vocabulary
An “agent-pair” is, as the name suggests, a pair of agents from a statement, usually defined by their canonical names.
A “relation” is the basic information of a statement, with all details (such as sites, residues, mutations, and bound conditions) stripped away. Usually this means it is just the statement type (or verb), subject name, and object name, though in some corner cases it is different.
Simple Example
The principal function in the module is group_and_sort_statements, and if you want statements grouped into agent-pairs, then by relations, sorted by evidence count, simply use the function with its defaults, e.g.:
for _, ag_key, rels, ag_metrics in group_and_sort_statements(stmts):
print(ag_key)
for _, rel_key, stmt_data, rel_metrics in rels:
print(' ', rel_key)
for _, stmt_hash, stmt_obj, stmt_metrics in stmt_data:
print(' ', stmt_obj)
Advanced Example
Custom data and aggregation methods are supported, respectively, by using instances of the StmtStat class and subclassing the BasicAggregator (or more generally, the AggregatorMeta) API. Custom sorting is implemented by defining and passing a sort_by function to group_and_sort_statements.
For example, if you have custom statement metrics (e.g., a value obtained by experiment such as differential expression of subject or object genes), want the statements grouped only to the level of relations, and want to sort the statements and relations independently. Suppose also that your measurement applies equally at the statement and relation level and hence you don’t want any changes applied during aggregation (e.g. averaging). This is illustrated in the example below:
# Define a new aggregator that doesn't apply any aggregation function to
# the data, simply taking the last metric (effectively a noop):
class NoopAggregator(BasicAggregator):
def _merge(self, metric_array):
self.values = metric_array
# Create your StmtStat using custom data dict `my_data`, a dict of values
# keyed by statement hash:
my_stat = StmtStat('my_stat', my_data, int, NoopAggregator)
# Define a custom sort function using my stat and the default available
# ev_count. In effect this will sort relations by the custom stat, and then
# secondarily sort the statements within that relation (for which my_stat
# is by design the same) using their evidence counts.
def my_sort(metrics):
return metrics['my_stat'], metrics['ev_count']
# Iterate over the results.
groups = group_and_sort_statements(stmts, sort_by=my_sort,
custom_stats=[my_stat],
grouping_level='relation')
for _, rel_key, rel_stmts, rel_metrics in groups:
print(rel_key, rel_metrics['my_stat'])
for _, stmt_hash, stmt, metrics in rel_stmts:
print(' ', stmt, metrics['ev_count'])
Class Overview
Statements can have multiple metrics associated with them, most commonly belief, evidence counts, and source counts, although other metrics may also be applied. Such metrics imply an order on the set of Statements, and a user should be able to apply that order to them for sorting or filtering. them. These types of metric, or “stat”, are represented by StmtStat classes.
Statements can be grouped based on the information they represent: by their agents (e.g. subject is MEK and object is ERK), and by their type (e.g. Phosphorylation). These groups are represented by StmtGroup objects, which on their surface behave much like defaultdict(list) would, though more is going on behind the scenes. The StmtGroup class is used internally by group_and_sort_statements and would only need to be used directly if defining an alternative statement-level grouping approach (e.g., grouping statements by subject).
Like Statements, higher-level statement groups are subject to sorting and filtering. That requires that the StmtStat`s be aggregated over the statements in a group. The Aggregator classes serve this purpose, using numpy to do sums over arrays of metrics as Statements are “included” in the `StmtGroup. Each StmtStat must declare how its data should be aggregated, as different kinds of data aggregate differently. Custom aggregation methods can be implemented by subclassing the BasicAggregator class and using an instance of the custom class to define a StmtStat.
- class indra.util.statement_presentation.AggregatorMeta[source]
Define the API for an aggregator of statement metrics.
In general, an aggregator defines the ways that different kinds of statement metrics are merged into groups. For example, evidence counts are aggregated by summing, as are counts for various sources. Beliefs are aggregated over a group of statements by maximum (usually).
- class indra.util.statement_presentation.AveAggregator(keys, stmt_metrics, original_types)[source]
A stats aggregator averages the included statement metrics.
- class indra.util.statement_presentation.BasicAggregator(keys, stmt_metrics, original_types)[source]
Gathers measurements for a statement or similar entity.
By defining a child of BasicAggregator, specifically defining the operations that gather new data and finalize that data once all the statements are collected, one can use arbitrary statistical methods to aggregate metrics for high-level groupings of Statements for subsequent sorting or filtering purposes.
- Parameters
keys (list[str]) – A dict keyed by aggregation method of lists of the names for the elements of data.
stmt_metrics (dict{int: np.ndarray}) – A dictionary keyed by hash with each element a dict of arrays keyed by aggregation type.
original_types (tuple(type)) – The type classes of each numerical value stored in the base_group dict, e.g. (int, float, int).
- class indra.util.statement_presentation.MaxAggregator(keys, stmt_metrics, original_types)[source]
A stats aggregator that takes the max of statement metrics.
- class indra.util.statement_presentation.MultiAggregator(basic_aggs)[source]
Implement the AggregatorMeta API for multiple BasicAggregator children.
Takes an iterable of BasicAggregator children.
- class indra.util.statement_presentation.StmtGroup(stat_groups)[source]
Creates higher-level stmt groupings and aggregates metrics accordingly.
Used internally by group_and_sort_statements.
This class manages the accumulation of statistics for statement groupings, such as by relation or agent pair. It calculates metrics for these higher-level groupings using metric-specific aggregators implementing the AggregatorMeta API (e.g., MultiAggregator and any children of BasicAggregator).
For example, evidence counts for a relation can be calculated as the sum of the statement-level evidence counts, while the belief for the relation can be calculated as the average or maximum of the statement-level beliefs.
The primary methods for instantiating this class are the two factory class methods: - from_stmt_stats - from_dicts See the methods for more details on their purpose and usage.
Once instantiated, the StmtGroup behaves like a defaultdict of lists, where the keys are group-level keys, and the lists contain statements. Statements can be iteratively added to the group via the dict-like syntax stmt_group[group_key].include(stmt). This allows the caller to generate keys and trigger metric aggregation in a single iteration over statements.
Example usage:
# Get ev_count, belief, and ag_count from a list of statements. stmt_stats = StmtStat.from_stmts(stmt_list) # Add another stat for a measure of relevance stmt_stats.append( StmtStat('relevance', relevance_dict, float, AveAggregator) ) # Create the Group sg = StmtGroup.from_stmt_stats(*stmt_stats) # Load it full of Statements, grouped by agents. sg.fill_from_stmt_stats() sg.start() for s in stmt_list: key = (ag.get_grounding() for ag in s.agent_list()) sg[key].include(s) sg.finish() # Now the stats for each group are aggregated and available for use. metrics = sg[(('FPLX', 'MEK'), ('FPLX', 'ERK'))].get_dict()
- add_stats(*stmt_stats)[source]
Add more stats to the object.
If you have started accumulating data from statements and doing aggregation, (e.g. if you have “started”), or if you are “finished”, this request will lead to an error.
- fill_from_stmt_stats()[source]
Use the statements stats as stats and hashes as keys.
This is used if you decide you just want to represent statements.
- classmethod from_dicts(ev_counts=None, beliefs=None, source_counts=None)[source]
Init a stmt group from dicts keyed by hash.
Return a StmtGroup constructed from the given keyword arguments. The dict keys of source_counts will be broken out into their own StmtStat objects, so that the resulting data model is in effect a flat list of measurement parameters. There is some risk of name collision, so take care not to name any sources “ev_counts” or “belief”.
- class indra.util.statement_presentation.StmtStat(name, data, data_type, agg_class)[source]
Abstraction of a metric applied to a set of statements.
Can be instantiated either via the constructor or two factory class methods: - s = StmtStat(name, {hash: value, …}, data_type, AggClass) - [s1, …] = StmtStat.from_dicts({hash: {label: value, …}, …}, data_type, AggClass) - [s_ev_count, s_belief] = StmtStat.from_stmts([Statement(), …], (‘ev_count’, ‘belief’))
Note that each stat will have only one metric associated with it, so dicts ingested by from_dicts will have their values broken up into separate StmtStat instances.
- Parameters
name (str) – The label for this data (e.g. “ev_count” or “belief”)
data (dict{int: Number}) – The relevant statistics as a dict keyed by hash.
data_type (type) – The type of the data (e.g. int or float).
agg_class (type) – A subclass of BasicAggregator which defines how these statistics will be merged.
- classmethod from_dicts(dict_data, data_type, agg_class)[source]
Generate a list of StmtStat’s from a dict of dicts.
Example Usage: >> source_counts = {9623812756876: {‘reach’: 1, ‘sparser’: 2}, >> -39877587165298: {‘reach’: 3, ‘sparser’: 0}} >> stmt_stats = StmtStat.from_dicts(source_counts, int, SumAggregator)
- Parameters
dict_data (dict{int: dict{str: Number}}) – A dictionary keyed by hash with dictionary elements, where each element gives a set of measurements for the statement labels as keys. A common example is source_counts.
data_type (type) – The type of the data being given (e.g. int or float).
agg_class (type) – A subclass of BasicAggregator which defines how these statistics will be merged (e.g. SumAggregator).
- classmethod from_stmts(stmt_list, values=None)[source]
Generate a list of StmtStat’s from a list of stmts.
The stats will include “ev_count”, “belief”, and “ag_count” by default, but a more limited selection may be specified using values.
Example usage: >> stmt_stats = StmtStat.from_stmts(stmt_list, (‘ag_count’, ‘belief’))
- Parameters
stmt_list (list[Statement]) – A list of INDRA statements, from which basic stats will be derived.
values (Optional[tuple(str)]) – A tuple of the names of the values to gather from the list of statements. For example, if you already have evidence counts, you might only want to gather belief and agent counts.
- class indra.util.statement_presentation.SumAggregator(keys, stmt_metrics, original_types)[source]
A stats aggregator that executes a sum.
- indra.util.statement_presentation.all_sources = ['psp', 'cbn', 'pc', 'bel_lc', 'signor', 'biogrid', 'tas', 'hprd', 'trrust', 'ctd', 'vhn', 'pe', 'drugbank', 'omnipath', 'conib', 'crog', 'dgi', 'minerva', 'creeds', 'ubibrowser', 'acsn', 'geneways', 'tees', 'gnbr', 'semrep', 'isi', 'trips', 'rlimsp', 'medscan', 'eidos', 'sparser', 'reach']
Source names as they appear in the DB
- indra.util.statement_presentation.available_sources_src_counts(source_counts, custom_sources=None)[source]
Returns the set of sources available from a source counts dict
- indra.util.statement_presentation.available_sources_stmts(stmts, custom_sources=None)[source]
Returns the set of sources available in a list of statements
- indra.util.statement_presentation.db_sources = ['psp', 'cbn', 'pc', 'bel_lc', 'signor', 'biogrid', 'tas', 'hprd', 'trrust', 'ctd', 'vhn', 'pe', 'drugbank', 'omnipath', 'conib', 'crog', 'dgi', 'minerva', 'creeds', 'ubibrowser', 'acsn']
Database source names as they appear in the DB
- indra.util.statement_presentation.group_and_sort_statements(stmt_list, sort_by='default', custom_stats=None, grouping_level='agent-pair')[source]
Group statements by type and arguments, and sort by prevalence.
- Parameters
sort_by (str or function or None) – If str, it indicates which parameter to sort by, such as ‘belief’ or ‘ev_count’, or ‘ag_count’. Those are the default options because they can be derived from a list of statements, however if you give a custom stmt_metrics, you may use any of the parameters used to build it. The default, ‘default’, is mostly a sort by ev_count but also favors statements with fewer agents. Alternatively, you may give a function that takes a dict as its single argument, a dictionary of metrics. These metrics are determined by the contents of the stmt_metrics passed as an argument (see StmtGroup for details), or else will contain the default metrics that can be derived from the statements themselves, namely ev_count, belief, and ag_count. The value may also be None, in which case the sort function will return the same value for all elements, and thus the original order of elements will be preserved. This could have strange effects when statements are grouped (i.e. when grouping_level is not ‘statement’); such functionality is untested and we make no guarantee that it will work.
custom_stats (list[StmtStat]) – A list of custom statement statistics to be used in addition to, or upon name conflict in place of, the default statement statistics derived from the list of statements.
grouping_level (str) – The options are ‘agent-pair’, ‘relation’, and ‘statement’. These correspond to grouping by agent pairs, agent and type relationships, and a flat list of statements. The default is ‘agent-pair’.
- Returns
sorted_groups – A list of tuples of the form (sort_param, key, contents, metrics), where the sort param is whatever value was calculated to sort the results, the key is the unique key for the agent pair, relation, or statements, and the contents are either relations, statements, or statement JSON, depending on the level. This structure is recursive, so the each list of relations will also follow this structure, all the way down to the lowest level (statement JSON). The metrics a dict of the aggregated metrics for the entry (e.g. source counts, evidence counts, etc).
- Return type
- indra.util.statement_presentation.internal_source_mappings = {'bel': 'bel_lc', 'biopax': 'pc', 'phosphoelm': 'pe', 'phosphosite': 'psp', 'virhostnet': 'vhn'}
Maps from source_info.json names to DB names
- indra.util.statement_presentation.make_standard_stats(ev_counts=None, beliefs=None, source_counts=None)[source]
Generate the standard ev_counts, beliefs, and source count stats.
- indra.util.statement_presentation.make_stmt_from_relation_key(relation_key, agents=None)[source]
Make a Statement from the relation key.
Specifically, make a Statement object from the sort key used by group_and_sort_statements.
- indra.util.statement_presentation.make_string_from_relation_key(rel_key)[source]
Make a Statement string via EnglishAssembler from the relation key.
Specifically, make a string from the key used by group_and_sort_statements for contents grouped at the relation level.
- indra.util.statement_presentation.make_top_level_label_from_names_key(names)[source]
Make an english string from the tuple names.
- indra.util.statement_presentation.reader_sources = ['geneways', 'tees', 'gnbr', 'semrep', 'isi', 'trips', 'rlimsp', 'medscan', 'eidos', 'sparser', 'reach']
Reader source names as they appear in the DB
- indra.util.statement_presentation.reverse_source_mappings = {'bel_lc': 'bel', 'pc': 'biopax', 'pe': 'phosphoelm', 'psp': 'phosphosite', 'vhn': 'virhostnet'}
Maps from db names to source_info.json names
Utilities for using AWS (indra.util.aws
)
- class indra.util.aws.JobLog(job_info, log_group_name='/aws/batch/job', verbose=False, append_dumps=True)[source]
Gets the Cloudwatch log associated with the given job.
- indra.util.aws.dump_logs(job_queue='run_reach_queue', job_status='RUNNING')[source]
Write logs for all jobs with given the status to files.
- indra.util.aws.get_batch_command(command_list, project=None, purpose=None)[source]
Get the command appropriate for running something on batch.
- indra.util.aws.get_date_from_str(date_str)[source]
Get a utc datetime object from a string of format %Y-%m-%d-%H-%M-%S
- Parameters
date_str (str) – A string of the format %Y(-%m-%d-%H-%M-%S). The string is assumed to represent a UTC time.
- Return type
- indra.util.aws.get_jobs(job_queue='run_reach_queue', job_status='RUNNING')[source]
Returns a list of dicts with jobName and jobId for each job with the given status.
- indra.util.aws.get_s3_client(unsigned=True)[source]
Return a boto3 S3 client with optional unsigned config.
- Parameters
unsigned (Optional[bool]) – If True, the client will be using unsigned mode in which public resources can be accessed without credentials. Default: True
- Returns
A client object to AWS S3.
- Return type
botocore.client.S3
- indra.util.aws.get_s3_file_tree(s3, bucket, prefix, date_cutoff=None, after=True, with_dt=False)[source]
Overcome s3 response limit and return NestedDict tree of paths.
The NestedDict object also allows the user to search by the ends of a path.
The tree mimics a file directory structure, with the leave nodes being the full unbroken key. For example, ‘path/to/file.txt’ would be retrieved by
ret[‘path’][‘to’][‘file.txt’][‘key’]
The NestedDict object returned also has the capability to get paths that lead to a certain value. So if you wanted all paths that lead to something called ‘file.txt’, you could use
ret.get_paths(‘file.txt’)
For more details, see the NestedDict docs.
- Parameters
s3 (boto3.client.S3) – A boto3.client.S3 instance
bucket (str) – The name of the bucket to list objects in
prefix (str) – The prefix filtering of the objects for list
date_cutoff (str|datetime.datetime) – A datestring of format %Y(-%m-%d-%H-%M-%S) or a datetime.datetime object. The date is assumed to be in UTC. By default no filtering is done. Default: None.
after (bool) – If True, only return objects after the given date cutoff. Otherwise, return objects before. Default: True
with_dt (bool) – If True, yield a tuple (key, datetime.datetime(LastModified)) of the s3 Key and the object’s LastModified date as a datetime.datetime object, only yield s3 key otherwise. Default: False.
- Returns
A file tree represented as an NestedDict
- Return type
- indra.util.aws.iter_s3_keys(s3, bucket, prefix, date_cutoff=None, after=True, with_dt=False, do_retry=True)[source]
Iterate over the keys in an s3 bucket given a prefix
- Parameters
s3 (boto3.client.S3) – A boto3.client.S3 instance
bucket (str) – The name of the bucket to list objects in
prefix (str) – The prefix filtering of the objects for list
date_cutoff (str|datetime.datetime) – A datestring of format %Y(-%m-%d-%H-%M-%S) or a datetime.datetime object. The date is assumed to be in UTC. By default no filtering is done. Default: None.
after (bool) – If True, only return objects after the given date cutoff. Otherwise, return objects before. Default: True
with_dt (bool) – If True, yield a tuple (key, datetime.datetime(LastModified)) of the s3 Key and the object’s LastModified date as a datetime.datetime object, only yield s3 key otherwise. Default: False.
do_retry (bool) – If True, and no contents appear, try again in case there was simply a brief lag. If False, do not retry, and just accept the “directory” is empty.
- Returns
An iterator over s3 keys or (key, LastModified) tuples.
- Return type
iterator[key]|iterator[(key, datetime.datetime)]
- indra.util.aws.kill_all(job_queue, reason='None given', states=None, kill_list=None)[source]
Terminates/cancels all jobs on the specified queue.
- Parameters
job_queue (str) – The name of the Batch job queue on which you wish to terminate/cancel jobs.
reason (str) – Provide a reason for the kill that will be recorded with the job’s record on AWS.
states (None or list[str]) – A list of job states to remove. Possible states are ‘STARTING’, ‘RUNNABLE’, and ‘RUNNING’. If None, all jobs in all states will be ended (modulo the kill_list below).
kill_list (None or list[dict]) – A list of job dictionaries (as returned by the submit function) that you specifically wish to kill. All other jobs on the queue will be ignored. If None, all jobs on the queue will be ended (modulo the above).
- Returns
killed_ids – A list of the job ids for jobs that were killed.
- Return type
A utility to get the INDRA version (indra.util.get_version
)
This tool provides a uniform method for createing a robust indra version string, both from within python and from commandline. If possible, the version will include the git commit hash. Otherwise, the version will be marked with ‘UNHASHED’.
Define NestedDict (indra.util.nested_dict
)
- class indra.util.nested_dict.NestedDict[source]
A dict-like object that recursively populates elements of a dict.
More specifically, this acts like a recursive defaultdict, allowing, for example:
>>> nd = NestedDict() >>> nd['a']['b']['c'] = 'foo'
In addition, useful methods have been defined that allow the user to search the data structure. Note that the are not particularly optimized methods at this time. However, for convenience, you can for example simply call get_path to get the path to a particular key:
>>> nd.get_path('c') (('a', 'b', 'c'), 'foo')
and the value at that key. Similarly:
>>> nd.get_path('b') (('a', 'b'), NestedDict( 'c': 'foo' ))
get, gets, and get_paths operate on similar principles, and are documented below.