MedScan (indra.sources.medscan)

MedScan is Elsevier’s proprietary text-mining system for reading the biological literature. This INDRA module enables processing output files (in CSXML format) from the MedScan system into INDRA Statements.

MedScan API (indra.sources.medscan.api)

indra.sources.medscan.api.process_directory(directory_name, lazy=False)[source]

Processes a directory filled with CSXML files, first normalizing the character encodings to utf-8, and then processing into a list of INDRA statements.

Parameters:
  • directory_name (str) – The name of a directory filled with csxml files to process
  • lazy (bool) – If True, the statements will not be generated immediately, but rather a generator will be formulated, and statements can be retrieved by using iter_statements. If False, the statements attribute will be populated immediately. Default is False.
Returns:

mp – A MedscanProcessor populated with INDRA statements extracted from the csxml files

Return type:

indra.sources.medscan.processor.MedscanProcessor

indra.sources.medscan.api.process_directory_statements_sorted_by_pmid(directory_name)[source]

Processes a directory filled with CSXML files, first normalizing the character encoding to utf-8, and then processing into INDRA statements sorted by pmid.

Parameters:directory_name (str) – The name of a directory filled with csxml files to process
Returns:pmid_dict – A dictionary mapping pmids to a list of statements corresponding to that pmid
Return type:dict
indra.sources.medscan.api.process_file(filename, interval=None, lazy=False)[source]

Process a CSXML file for its relevant information.

Consider running the fix_csxml_character_encoding.py script in indra/sources/medscan to fix any encoding issues in the input file before processing.

indra.sources.medscan.api.filename

The csxml file, containing Medscan XML, to process

Type:str
indra.sources.medscan.api.interval

Select the interval of documents to read, starting with the `start`th document and ending before the `end`th document. If either is None, the value is considered undefined. If the value exceeds the bounds of available documents, it will simply be ignored.

Type:(start, end) or None
indra.sources.medscan.api.lazy

If True, the statements will not be generated immediately, but rather a generator will be formulated, and statements can be retrieved by using iter_statements. If False, the statements attribute will be populated immediately. Default is False.

Type:bool
Returns:mp – A MedscanProcessor object containing extracted statements
Return type:MedscanProcessor
indra.sources.medscan.api.process_file_sorted_by_pmid(file_name)[source]

Processes a file and returns a dictionary mapping pmids to a list of statements corresponding to that pmid.

Parameters:file_name (str) – A csxml file to process
Returns:s_dict – Dictionary mapping pmids to a list of statements corresponding to that pmid
Return type:dict

MedScan Processor (indra.sources.medscan.processor)

class indra.sources.medscan.processor.MedscanEntity(name, urn, type, properties, ch_start, ch_end)
ch_end

Alias for field number 5

ch_start

Alias for field number 4

name

Alias for field number 0

properties

Alias for field number 3

type

Alias for field number 2

urn

Alias for field number 1

class indra.sources.medscan.processor.MedscanProcessor[source]

Processes Medscan data into INDRA statements.

The special StateEffect event conveys information about the binding site of a protein modification. Sometimes this is paired with additional event information in a seperate SVO. When we encounter a StateEffect, we don’t process into an INDRA statement right away, but instead store the site information and use it if we encounter a ProtModification event within the same sentence.

statements

A list of extracted INDRA statements

Type:list<str>
sentence_statements

A list of statements for the sentence we are currently processing. Deduplicated and added to the main statement list when we finish processing a sentence.

Type:list<str>
num_entities

The total number of subject or object entities the processor attempted to resolve

Type:int
num_entities_not_found

The number of subject or object IDs which could not be resolved by looking in the list of entities or tagged phrases.

Type:int
last_site_info_in_sentence

Stored protein site info from the last StateEffect event within the sentence, allowing us to combine information from StateEffect and ProtModification events within a single sentence in a single INDRA statement. This is reset at the end of each sentence

Type:SiteInfo
agent_from_entity(relation, entity_id)[source]

Create a (potentially grounded) INDRA Agent object from a given Medscan entity describing the subject or object.

Uses helper functions to convert a Medscan URN to an INDRA db_refs grounding dictionary.

If the entity has properties indicating that it is a protein with a mutation or modification, then constructs the needed ModCondition or MutCondition.

Parameters:
  • relation (MedscanRelation) – The current relation being processed
  • entity_id (str) – The ID of the entity to process
Returns:

agent – A potentially grounded INDRA agent representing this entity

Return type:

indra.statements.Agent

process_csxml_file(filename, interval=None, lazy=False)[source]

Processes a filehandle to MedScan csxml input into INDRA statements.

The CSXML format consists of a top-level <batch> root element containing a series of <doc> (document) elements, in turn containing <sec> (section) elements, and in turn containing <sent> (sentence) elements.

Within the <sent> element, a series of additional elements appear in the following order:

  • <toks>, which contains a tokenized form of the sentence in its text attribute
  • <textmods>, which describes any preprocessing/normalization done to the underlying text
  • <match> elements, each of which contains one of more <entity> elements, describing entities in the text with their identifiers. The local IDs of each entities are given in the msid attribute of this element; these IDs are then referenced in any subsequent SVO elements.
  • <svo> elements, representing subject-verb-object triples. SVO elements with a type attribute of CONTROL represent normalized regulation relationships; they often represent the normalized extraction of the immediately preceding (but unnormalized SVO element). However, in some cases there can be a “CONTROL” SVO element without its parent immediately preceding it.
Parameters:
  • filename (string) – The path to a Medscan csxml file.
  • interval ((start, end) or None) – Select the interval of documents to read, starting with the `start`th document and ending before the `end`th document. If either is None, the value is considered undefined. If the value exceeds the bounds of available documents, it will simply be ignored.
  • lazy (bool) – If True, only create a generator which can be used by the get_statements method. If True, populate the statements list now.
process_relation(relation, last_relation)[source]

Process a relation into an INDRA statement.

Parameters:
  • relation (MedscanRelation) – The relation to process (a CONTROL svo with normalized verb)
  • last_relation (MedscanRelation) – The relation immediately proceding the relation to process within the same sentence, or None if there are no preceding relations within the same sentence. This proceeding relation, if available, will refer to the same interaction but with an unnormalized (potentially more specific) verb, and is used when processing protein modification events.
class indra.sources.medscan.processor.MedscanProperty(type, name, urn)
name

Alias for field number 1

type

Alias for field number 0

urn

Alias for field number 2

class indra.sources.medscan.processor.MedscanRelation(pmid, uri, sec, entities, tagged_sentence, subj, verb, obj, svo_type)[source]

A structure representing the information contained in a Medscan SVO xml element as well as associated entities and properties.

pmid

The URI of the current document (such as a PMID)

Type:str
sec

The section of the document the relation occurs in

Type:str
entities

A dictionary mapping entity IDs from the same sentence to MedscanEntity objects.

Type:dict
tagged_sentence

The sentence from which the relation was extracted, with some tagged phrases and annotations.

Type:str
subj

The entity ID of the subject

Type:str
verb

The verb in the relationship between the subject and the object

Type:str
obj

The entity ID of the object

Type:str
svo_type

The type of SVO relationship (for example, CONTROL indicates that the verb is normalized)

Type:str
class indra.sources.medscan.processor.ProteinSiteInfo(site_text, object_text)[source]

Represent a site on a protein, extracted from a StateEffect event.

Parameters:
  • site_text (str) – The site as a string (ex. S22)
  • object_text (str) – The protein being modified, as the string that appeared in the original sentence
get_sites()[source]

Parse the site-text string and return a list of sites.

Returns:sites – A list of position-residue pairs corresponding to the site-text
Return type:list[Site]
indra.sources.medscan.processor.normalize_medscan_name(name)[source]

Removes the “complex” and “complex complex” suffixes from a medscan agent name so that it better corresponds with the grounding map.

Parameters:name (str) – The Medscan agent name
Returns:norm_name – The Medscan agent name with the “complex” and “complex complex” suffixes removed.
Return type:str