Belief prediction with sklearn models (indra.belief.skl)

class indra.belief.skl.CountsScorer(model, source_list, include_more_specific=False, use_stmt_type=False, use_num_members=False, use_num_pmids=False, use_promoter=False, use_avg_evidence_len=False, use_residue_position=False)[source]

Belief model learned from evidence counts and other stmt properties.

If using a DataFrame for Statement data, it should have the following columns:

  • stmt_type

  • source_counts

Alternatively, if the DataFrame doesn’t have a source_counts column, it should have columns with names matching the sources in self.source_list.

Parameters
  • model (BaseEstimator) – Any instance of a classifier object supporting the methods fit, predict_proba, predict, and predict_log_proba.

  • source_list (List[str]) – List of strings denoting the evidence sources (evidence.source_api values) to be used for prediction.

  • include_more_specific (bool) – If True, will add extra columns to the statement data matrix for the source counts drawn from more specific evidences; if use_num_pmids is True, will also add an additional column for the number of PMIDs from more specific evidences. If False, these columns will not be included even if the extra_evidence argument is passed to the stmts_to_matrix method. This is to ensure that the featurization of statements is consistent between training and prediction.

  • use_stmt_type (bool) – Whether to include statement type as a feature.

  • use_num_members (bool) – Whether to include a feature denoting the number of members of the statement. Primarily for stratifying belief predictions about Complex statements with more than two members. Cannot be used for statement data passed in as a DataFrame.

  • use_num_pmids (bool) – Whether to include a feature for the total number of unique PMIDs supporting each statement. Cannot be used for statement passed in as a DataFrame.

  • use_promoter (bool) – Whether to include a feature giving the fraction of evidence (0 to 1) containing the (case-insensitive) word “promoter”. Tends to improve misclassification of Complex statements that actually refer to protein-DNA binding.

  • use_avg_evidence_len (bool) – Whether to include a feature giving the average evidence sentence length (in space-separated tokens).

  • use_residue_position (bool) – Whether to include a feature indicating that a Statement has a (not-None) residue and position (i.e., for Modification Statements). When used to train and predict on site-mapped Statements, allows the correspondence between the residue/position and the target substrate to be exploited in predicting overall correctness.

Example

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
all_stmt_sources = CountsScorer.get_all_sources(stmts)
scorer = CountsScorer(clf, all_stmt_sources, use_stmt_type=True,
                      use_num_pmids=True)
scorer.fit(stmts, y_arr)
be = BeliefEngine(scorer)
be.set_hierarchy_probs(stmts)
df_to_matrix(df)[source]

Convert a DataFrame of statement data to a feature matrix.

Based on information available in a DataFrame of statement data, this implementation uses only source counts and statement type in building a feature matrix, and will raise a ValueError if either self.use_num_members or self.use_num_pmids is set.

Features are encoded as follows:

  • One column for every source listed in self.source_list, containing the number of statement evidences from that source. If extra_evidence is provided, these are used in combination with the Statement’s own evidence in determining source counts.

  • If self.use_stmt_type is set, statement type is included via one-hot encoding, with one column for each statement type.

Parameters

df (DataFrame) – A pandas DataFrame with statement metadata. It should have columns stmt_type and source_counts; alternatively, if it doesn’t have a source_counts column, it should have columns with names matching the sources in self.source_list.

Return type

ndarray

Returns

Feature matrix for the statement data.

static get_all_sources(stmts, include_more_specific=True, include_less_specific=True)[source]

Get a list of all the source_apis supporting the given statements.

Useful for determining the set of sources to be used for fitting and prediction.

Parameters
  • stmts (Sequence[Statement]) – A list of INDRA Statements to collect source APIs for.

  • include_more_specific (bool) – If True (default), then includes the source APIs for the more specific statements in the supports attribute of each statement.

  • include_less_specific (bool) – If True (default), then includes the source APIs for the less specific statements in the supported_by attribute of each statement.

Return type

List[str]

Returns

A list of (unique) source_apis found in the set of statements.

stmts_to_matrix(stmts, extra_evidence=None)[source]

Convert a list of Statements to a feature matrix.

Features are encoded as follows:

  • One column for every source listed in self.source_list, containing the number of statement evidences from that source. If self.include_more_specific is True and extra_evidence is provided, these are used in combination with the Statement’s own evidence in determining source counts.

  • If self.use_stmt_type is set, statement type is included via one-hot encoding, with one column for each statement type.

  • If self.use_num_members is set, a column is added for the number of agents in the Statement.

  • If self.use_num_pmids is set, a column is added with the total total number of unique PMIDs supporting the Statement. If extra_evidence is provided, these are used in combination with the Statement’s own evidence in determining the number of PMIDs.

Parameters
  • stmts (Sequence[Statement]) – A list or tuple of INDRA Statements to be used to generate a feature matrix.

  • extra_evidence (Optional[List[List[Evidence]]]) – A list corresponding to the given list of statements, where each entry is a list of Evidence objects providing additional support for the corresponding statement (i.e., Evidences that aren’t already included in the Statement’s own evidence list).

Return type

ndarray

Returns

Feature matrix for the statement data.

class indra.belief.skl.HybridScorer(counts_scorer, simple_scorer)[source]

Use CountsScorer for known sources, SimpleScorer priors for any others.

Allows the use of a CountsScorer to make belief predictions based on sources seen in training data, while falling back to SimpleScorer priors for any sources not accounted for by the CountsScorer. Like the SimpleScorer, uses an independence assumption to combine beliefs from the two scorers (i.e., hybrid_bel = 1 - (1 - cs_bel) * (1 - ss_bel)).

Parameters
check_prior_probs(statements)[source]

Check that sources in the set of statements are accounted for.

Return type

None

score_statements(statements, extra_evidence=None)[source]
Parameters
  • statements (Sequence[Statement]) – INDRA Statements whose belief scores are to be calculated.

  • extra_evidence (Optional[List[List[Evidence]]]) – A list corresponding to the given list of statements, where each entry is a list of Evidence objects providing additional support for the corresponding statement (i.e., Evidences that aren’t already included in the Statement’s own evidence list).

Return type

Sequence[float]

Returns

The computed probabilities for each statement.

class indra.belief.skl.SklearnScorer(model)[source]

Use a pre-trained Sklearn classifier to predict belief scores.

An implementing instance of this base class has two personalities: as a subclass of BeliefScorer, it implements the functions required by the BeliefEngine, score_statements and check_prior_probs. It also behaves like an sklearn model by composition, implementing methods fit, predict, predict_proba, and predict_log_proba, which are passed through to an internal sklearn model.

A key role of this wrapper class is to implement the preprocessing of statement properties into a feature matrix in a standard way, so that a classifier trained on one corpus of statement data will still work when used on another corpus.

Implementing subclasses must implement at least one of the methods for building the feature matrix, stmts_to_matrix or df_to_matrix.

Parameters

model (BaseEstimator) – Any instance of a classifier object supporting the methods fit, predict_proba, predict, and predict_log_proba.

check_prior_probs(statements)[source]

Empty implementation for now.

Return type

None

df_to_matrix(df)[source]

Convert a statement DataFrame to a feature matrix.

Return type

ndarray

fit(stmt_data, y_arr, extra_evidence=None, *args, **kwargs)[source]

Preprocess stmt data and run sklearn model fit method.

Additional args and kwargs are passed to the fit method of the wrapped sklearn model.

Parameters
  • stmt_data (Union[ndarray, Sequence[Statement], DataFrame]) – Statement content to be used to generate a feature matrix.

  • y_arr (Sequence[float]) – Class values for the statements (e.g., a vector of 0s and 1s indicating correct or incorrect).

  • extra_evidence (Optional[List[List[Evidence]]]) – A list corresponding to the given list of statements, where each entry is a list of Evidence objects providing additional support for the corresponding statement (i.e., Evidences that aren’t already included in the Statement’s own evidence list).

predict(stmt_data, extra_evidence=None, *args, **kwargs)[source]

Preprocess stmt data and run sklearn model predict method.

Additional args and kwargs are passed to the predict method of the wrapped sklearn model.

Parameters
  • stmt_data (Union[ndarray, Sequence[Statement], DataFrame]) – Statement content to be used to generate a feature matrix.

  • extra_evidence (Optional[List[List[Evidence]]]) – A list corresponding to the given list of statements, where each entry is a list of Evidence objects providing additional support for the corresponding statement (i.e., Evidences that aren’t already included in the Statement’s own evidence list).

Return type

ndarray

predict_log_proba(stmt_data, extra_evidence=None, *args, **kwargs)[source]

Preprocess stmt data and run sklearn model predict_log_proba.

Additional args and kwargs are passed to the predict method of the wrapped sklearn model.

Parameters
  • stmt_data (Union[ndarray, Sequence[Statement], DataFrame]) – Statement content to be used to generate a feature matrix.

  • extra_evidence (Optional[List[List[Evidence]]]) – A list corresponding to the given list of statements, where each entry is a list of Evidence objects providing additional support for the corresponding statement (i.e., Evidences that aren’t already included in the Statement’s own evidence list).

Return type

ndarray

predict_proba(stmt_data, extra_evidence=None, *args, **kwargs)[source]

Preprocess stmt data and run sklearn model predict_proba method.

Additional args and kwargs are passed to the predict_proba method of the wrapped sklearn model.

Parameters
  • stmt_data (Union[ndarray, Sequence[Statement], DataFrame]) – Statement content to be used to generate a feature matrix.

  • extra_evidence (Optional[List[List[Evidence]]]) – A list corresponding to the given list of statements, where each entry is a list of Evidence objects providing additional support for the corresponding statement (i.e., Evidences that aren’t already included in the Statement’s own evidence list).

Return type

ndarray

score_statements(statements, extra_evidence=None)[source]

Computes belief probabilities for a list of INDRA Statements.

The Statements are assumed to be de-duplicated. In other words, each Statement is assumed to have a list of Evidence objects that supports it. The probability of correctness of the Statement is generally calculated based on the number of Evidences it has, their sources, and other features depending on the subclass implementation.

Parameters
  • statements (Sequence[Statement]) – INDRA Statements whose belief scores are to be calculated.

  • extra_evidence (Optional[List[List[Evidence]]]) – A list corresponding to the given list of statements, where each entry is a list of Evidence objects providing additional support for the corresponding statement (i.e., Evidences that aren’t already included in the Statement’s own evidence list).

Return type

Sequence[float]

Returns

The computed prior probabilities for each statement.

stmts_to_matrix(stmts, extra_evidence=None)[source]

Convert a list of Statements to a feature matrix.

Return type

ndarray

to_matrix(stmt_data, extra_evidence=None)[source]

Get stmt feature matrix by calling appropriate method.

If stmt_data is already a matrix (e.g., obtained after performing a train/test split on a matrix generated for a full statement corpus), it is returned directly; if a DataFrame of Statement metadata, self.df_to_matrix is called; if a list of Statements, self.stmts_to_matrix is called.

Parameters
  • stmt_data (Union[ndarray, Sequence[Statement], DataFrame]) – Statement content to be used to generate a feature matrix.

  • extra_evidence (Optional[List[List[Evidence]]]) – A list corresponding to the given list of statements, where each entry is a list of Evidence objects providing additional support for the corresponding statement (i.e., Evidences that aren’t already included in the Statement’s own evidence list).

Return type

ndarray

Returns

Feature matrix for the statement data.