Human Protein Reference Database (`indra.sources.hprd`)

This module implements getting content from the Human Protein Reference Database (HPRD), a curated protein data resource, as INDRA Statements. In particular, the module supports extracting post-translational modifications, protein complexes, and (binary) protein-protein interactions from HPRD.

More information about HPRD can be obtained at http://www.hprd.org and in these publications:

Peri, S. et al. (2003). Development of Human Protein Reference Database as an initial platform for approaching systems biology in humans. Genome Research. 13, 2363-2371.
Prasad, T. S. K. et al. (2009). Human Protein Reference Database - 2009 Update. Nucleic Acids Research. 37, D767-72.

Data from the final release of HPRD (version 9) can be obtained at the following URLs:

http://www.hprd.org/RELEASE9/HPRD_FLAT_FILES_041310.tar.gz (text files)
http://www.hprd.org/RELEASE9/HPRD_XML_041310.tar.gz (XML)

This module is designed to process the text files obtained from the first link listed above.

HPRD API (`indra.sources.hprd.api`)

indra.sources.hprd.api.process_archive(fname)[source]

Get INDRA Statements from HPRD data in a single tar.gz file.

The latest release, HPRD_FLAT_FILES_041310.tar.gz can be downloaded from http://hprd.org/download after registration.

Parameters:: fname (str) – Path to HPRD tar.gz file.
Returns:: An HprdProcessor object which contains a list of extracted INDRA Statements in its statements attribute.
Return type:: HprdProcessor

indra.sources.hprd.api.process_flat_files(id_mappings_file, complexes_file=None, ptm_file=None, ppi_file=None, seq_file=None, motif_window=7)[source]

Get INDRA Statements from HPRD data in individual files.

Of the arguments, id_mappings_file is required, and at least one of complexes_file, ptm_file, and ppi_file must also be given. If ptm_file is given, seq_file must also be given.

Note that many proteins (> 1,600) in the HPRD content are associated with outdated RefSeq IDs that cannot be mapped to Uniprot IDs. For these, the Uniprot ID obtained from the HGNC ID (itself obtained from the Entrez ID) is used. Because the sequence referenced by the Uniprot ID obtained this way may be different from the (outdated) RefSeq sequence included with the HPRD content, it is possible that this will lead to invalid site positions with respect to the Uniprot IDs.

To allow these site positions to be mapped during assembly, the Modification statements produced by the HprdProcessor include an additional key in the annotations field of their Evidence object. The annotations field is called ‘site_motif’ and it maps to a dictionary with three elements: ‘motif’, ‘respos’, and ‘off_by_one’. ‘motif’ gives the peptide sequence obtained from the RefSeq sequence included with HPRD. ‘respos’ indicates the position in the peptide sequence containing the residue. Note that these positions are ONE-INDEXED (not zero-indexed). Finally, the ‘off-by-one’ field contains a boolean value indicating whether the correct position was inferred as being an off-by-one (methionine cleavage) error. If True, it means that the given residue could not be found in the HPRD RefSeq sequence at the given position, but a matching residue was found at position+1, suggesting a sequence numbering based on the methionine-cleaved sequence. The peptide included in the ‘site_motif’ dictionary is based on this updated position.

Parameters:

id_mappings_file (str) – Path to HPRD_ID_MAPPINGS.txt file.
complexes_file (Optional[str]) – Path to PROTEIN_COMPLEXES.txt file.
ptm_file (Optional[str]) – Path to POST_TRANSLATIONAL_MODIFICATIONS.txt file.
ppi_file (Optional[str]) – Path to BINARY_PROTEIN_PROTEIN_INTERACTIONS.txt file.
seq_file (Optional[str]) – Path to PROTEIN_SEQUENCES.txt file.
motif_window (int) – Number of flanking amino acids to include on each side of the PTM target residue in the ‘site_motif’ annotations field of the Evidence for Modification Statements. Default is 7.

Returns:

An HprdProcessor object which contains a list of extracted INDRA Statements in its statements attribute.

Return type:

HprdProcessor

HPRD Processor (`indra.sources.hprd.processor`)

class indra.sources.hprd.processor.HprdProcessor(id_df, cplx_df=None, ptm_df=None, ppi_df=None, seq_dict=None, motif_window=7)[source]

Get INDRA Statements from HPRD data.

See documentation for indra.sources.hprd.api.process_flat_files.

Parameters:

id_df (pandas.DataFrame) – DataFrame loaded from the HPRD_ID_MAPPINGS.txt file.
cplx_df (pandas.DataFrame) – DataFrame loaded from the PROTEIN_COMPLEXES.txt file.
ptm_df (pandas.DataFrame) – DataFrame loaded from the POST_TRANSLATIONAL_MODIFICATIONS.txt file.
ppi_df (pandas.DataFrame) – DataFrame loaded from the BINARY_PROTEIN_PROTEIN_INTERACTIONS.txt file.
seq_dict (dict) – Dictionary mapping RefSeq IDs to protein sequences, loaded from the PROTEIN_SEQUENCES.txt file.
motif_window (int) – Number of flanking amino acids to include on each side of the PTM target residue in the ‘site_motif’ annotations field of the Evidence for Modification Statements. Default is 7.

statements

INDRA Statements (Modifications and Complexes) produced from the HPRD content.

Type:: list of INDRA Statements

id_df

DataFrame loaded from HPRD_ID_MAPPINGS.txt file.

Type:: pandas.DataFrame

seq_dict: Dictionary mapping RefSeq IDs to protein sequences, loaded from the PROTEIN_SEQUENCES.txt file.

no_hgnc_for_egid

Counter listing Entrez gene IDs reference in the HPRD content that could not be mapped to a current HGNC ID, along with their frequency.

Type:: collections.Counter

no_up_for_hgnc

Counter with tuples of form (entrez_id, hgnc_symbol, hgnc_id) where the HGNC ID could not be mapped to a Uniprot ID, along with their frequency.

Type:: collections.Counter

no_up_for_refseq

Counter of RefSeq protein IDs that could not be mapped to any Uniprot ID, along with frequency.

Type:: collections.Counter

many_ups_for_refseq

Counter of RefSeq protein IDs that yielded more than one matching Uniprot ID. Note that in these cases, the Uniprot ID obtained from HGNC is used.

Type:: collections.Counter

invalid_site_pos

List of tuples of form (refseq_id, residue, position) indicating sites of post translational modifications where the protein sequences provided by HPRD did not contain the given residue at the given position.

Type:: list of tuples

off_by_one

The subset of sites contained in invalid_site_pos where the given residue can be found at position+1 in the HPRD protein sequence, suggesting an off-by-one error due to numbering based on the protein with initial methionine cleaved. Note that no mapping is performed by the processor.

Type:: list of tuples

motif_window

Number of flanking amino acids to include on each side of the PTM target residue in the ‘site_motif’ annotations field of the Evidence for Modification Statements. Default is 7.

Type:: int

get_complexes(cplx_df)[source]

Generate Complex Statements from the HPRD protein complexes data.

Parameters:: cplx_df (pandas.DataFrame) – DataFrame loaded from the PROTEIN_COMPLEXES.txt file.

get_ppis(ppi_df)[source]

Generate Complex Statements from the HPRD PPI data.

Parameters:: ppi_df (pandas.DataFrame) – DataFrame loaded from the BINARY_PROTEIN_PROTEIN_INTERACTIONS.txt file.

get_ptms(ptm_df)[source]

Generate Modification statements from the HPRD PTM data.

Parameters:: ptm_df (pandas.DataFrame) – DataFrame loaded from the POST_TRANSLATIONAL_MODIFICATIONS.txt file.

Human Protein Reference Database (indra.sources.hprd)

HPRD API (indra.sources.hprd.api)

HPRD Processor (indra.sources.hprd.processor)

Human Protein Reference Database (`indra.sources.hprd`)

HPRD API (`indra.sources.hprd.api`)

HPRD Processor (`indra.sources.hprd.processor`)