MedScan (indra.sources.medscan
)¶
MedScan is Elsevier’s proprietary text-mining system for reading the biological literature. This INDRA module enables processing output files (in CSXML format) from the MedScan system into INDRA Statements.
MedScan API (indra.sources.medscan.api
)¶
- indra.sources.medscan.api.process_directory(directory_name, lazy=False)[source]¶
Processes a directory filled with CSXML files, first normalizing the character encodings to utf-8, and then processing into a list of INDRA statements.
- Parameters
directory_name (str) – The name of a directory filled with csxml files to process
lazy (bool) – If True, the statements will not be generated immediately, but rather a generator will be formulated, and statements can be retrieved by using iter_statements. If False, the statements attribute will be populated immediately. Default is False.
- Returns
mp – A MedscanProcessor populated with INDRA statements extracted from the csxml files
- Return type
- indra.sources.medscan.api.process_directory_statements_sorted_by_pmid(directory_name)[source]¶
Processes a directory filled with CSXML files, first normalizing the character encoding to utf-8, and then processing into INDRA statements sorted by pmid.
- indra.sources.medscan.api.process_file(filename, interval=None, lazy=False)[source]¶
Process a CSXML file for its relevant information.
Consider running the fix_csxml_character_encoding.py script in indra/sources/medscan to fix any encoding issues in the input file before processing.
- indra.sources.medscan.api.interval¶
Select the interval of documents to read, starting with the `start`th document and ending before the `end`th document. If either is None, the value is considered undefined. If the value exceeds the bounds of available documents, it will simply be ignored.
- Type
(start, end) or None
- indra.sources.medscan.api.lazy¶
If True, the statements will not be generated immediately, but rather a generator will be formulated, and statements can be retrieved by using iter_statements. If False, the statements attribute will be populated immediately. Default is False.
- Type
- Returns
mp – A MedscanProcessor object containing extracted statements
- Return type
MedScan Processor (indra.sources.medscan.processor
)¶
- class indra.sources.medscan.processor.MedscanEntity(name, urn, type, properties, ch_start, ch_end)¶
- property ch_end¶
Alias for field number 5
- property ch_start¶
Alias for field number 4
- property name¶
Alias for field number 0
- property properties¶
Alias for field number 3
- property type¶
Alias for field number 2
- property urn¶
Alias for field number 1
- class indra.sources.medscan.processor.MedscanProcessor[source]¶
Processes Medscan data into INDRA statements.
The special StateEffect event conveys information about the binding site of a protein modification. Sometimes this is paired with additional event information in a seperate SVO. When we encounter a StateEffect, we don’t process into an INDRA statement right away, but instead store the site information and use it if we encounter a ProtModification event within the same sentence.
- statements¶
A list of extracted INDRA statements
- Type
list<str>
- sentence_statements¶
A list of statements for the sentence we are currently processing. Deduplicated and added to the main statement list when we finish processing a sentence.
- Type
list<str>
- num_entities¶
The total number of subject or object entities the processor attempted to resolve
- Type
- num_entities_not_found¶
The number of subject or object IDs which could not be resolved by looking in the list of entities or tagged phrases.
- Type
- last_site_info_in_sentence¶
Stored protein site info from the last StateEffect event within the sentence, allowing us to combine information from StateEffect and ProtModification events within a single sentence in a single INDRA statement. This is reset at the end of each sentence
- Type
SiteInfo
- agent_from_entity(relation, entity_id)[source]¶
Create a (potentially grounded) INDRA Agent object from a given Medscan entity describing the subject or object.
Uses helper functions to convert a Medscan URN to an INDRA db_refs grounding dictionary.
If the entity has properties indicating that it is a protein with a mutation or modification, then constructs the needed ModCondition or MutCondition.
- Parameters
relation (MedscanRelation) – The current relation being processed
entity_id (str) – The ID of the entity to process
- Returns
agent – A potentially grounded INDRA agent representing this entity
- Return type
indra.statements.Agent
- process_csxml_file(filename, interval=None, lazy=False)[source]¶
Processes a filehandle to MedScan csxml input into INDRA statements.
The CSXML format consists of a top-level <batch> root element containing a series of <doc> (document) elements, in turn containing <sec> (section) elements, and in turn containing <sent> (sentence) elements.
Within the <sent> element, a series of additional elements appear in the following order:
<toks>, which contains a tokenized form of the sentence in its text attribute
<textmods>, which describes any preprocessing/normalization done to the underlying text
<match> elements, each of which contains one of more <entity> elements, describing entities in the text with their identifiers. The local IDs of each entities are given in the msid attribute of this element; these IDs are then referenced in any subsequent SVO elements.
<svo> elements, representing subject-verb-object triples. SVO elements with a type attribute of CONTROL represent normalized regulation relationships; they often represent the normalized extraction of the immediately preceding (but unnormalized SVO element). However, in some cases there can be a “CONTROL” SVO element without its parent immediately preceding it.
- Parameters
filename (string) – The path to a Medscan csxml file.
interval ((start, end) or None) – Select the interval of documents to read, starting with the `start`th document and ending before the `end`th document. If either is None, the value is considered undefined. If the value exceeds the bounds of available documents, it will simply be ignored.
lazy (bool) – If True, only create a generator which can be used by the get_statements method. If True, populate the statements list now.
- process_relation(relation, last_relation)[source]¶
Process a relation into an INDRA statement.
- Parameters
relation (MedscanRelation) – The relation to process (a CONTROL svo with normalized verb)
last_relation (MedscanRelation) – The relation immediately proceding the relation to process within the same sentence, or None if there are no preceding relations within the same sentence. This proceeding relation, if available, will refer to the same interaction but with an unnormalized (potentially more specific) verb, and is used when processing protein modification events.
- class indra.sources.medscan.processor.MedscanProperty(type, name, urn)¶
- property name¶
Alias for field number 1
- property type¶
Alias for field number 0
- property urn¶
Alias for field number 2
- class indra.sources.medscan.processor.MedscanRelation(pmid, uri, sec, entities, tagged_sentence, subj, verb, obj, svo_type)[source]¶
A structure representing the information contained in a Medscan SVO xml element as well as associated entities and properties.
- entities¶
A dictionary mapping entity IDs from the same sentence to MedscanEntity objects.
- Type
- tagged_sentence¶
The sentence from which the relation was extracted, with some tagged phrases and annotations.
- Type
- class indra.sources.medscan.processor.ProteinSiteInfo(site_text, object_text)[source]¶
Represent a site on a protein, extracted from a StateEffect event.
- Parameters