MedScan (indra.sources.medscan)

MedScan is Elsevier’s proprietary text-mining system for reading the biological literature. This INDRA module enables processing output files (in CSXML format) from the MedScan system into INDRA Statements.

MedScan API (indra.sources.medscan.api)

indra.sources.medscan.api.process_directory(directory_name, lazy=False)[source]

Processes a directory filled with CSXML files, first normalizing the character encodings to utf-8, and then processing into a list of INDRA statements.

Parameters
  • directory_name (str) – The name of a directory filled with csxml files to process

  • lazy (bool) – If True, the statements will not be generated immediately, but rather a generator will be formulated, and statements can be retrieved by using iter_statements. If False, the statements attribute will be populated immediately. Default is False.

Returns

mp – A MedscanProcessor populated with INDRA statements extracted from the csxml files

Return type

indra.sources.medscan.processor.MedscanProcessor

indra.sources.medscan.api.process_directory_statements_sorted_by_pmid(directory_name)[source]

Processes a directory filled with CSXML files, first normalizing the character encoding to utf-8, and then processing into INDRA statements sorted by pmid.

Parameters

directory_name (str) – The name of a directory filled with csxml files to process

Returns

pmid_dict – A dictionary mapping pmids to a list of statements corresponding to that pmid

Return type

dict

indra.sources.medscan.api.process_file(filename, interval=None, lazy=False)[source]

Process a CSXML file for its relevant information.

Consider running the fix_csxml_character_encoding.py script in indra/sources/medscan to fix any encoding issues in the input file before processing.

indra.sources.medscan.api.filename

The csxml file, containing Medscan XML, to process

Type

str

indra.sources.medscan.api.interval

Select the interval of documents to read, starting with the `start`th document and ending before the `end`th document. If either is None, the value is considered undefined. If the value exceeds the bounds of available documents, it will simply be ignored.

Type

(start, end) or None

indra.sources.medscan.api.lazy

If True, the statements will not be generated immediately, but rather a generator will be formulated, and statements can be retrieved by using iter_statements. If False, the statements attribute will be populated immediately. Default is False.

Type

bool

Returns

mp – A MedscanProcessor object containing extracted statements

Return type

MedscanProcessor

indra.sources.medscan.api.process_file_sorted_by_pmid(file_name)[source]

Processes a file and returns a dictionary mapping pmids to a list of statements corresponding to that pmid.

Parameters

file_name (str) – A csxml file to process

Returns

s_dict – Dictionary mapping pmids to a list of statements corresponding to that pmid

Return type

dict

MedScan Processor (indra.sources.medscan.processor)

class indra.sources.medscan.processor.MedscanEntity(name, urn, type, properties, ch_start, ch_end)
property ch_end

Alias for field number 5

property ch_start

Alias for field number 4

property name

Alias for field number 0

property properties

Alias for field number 3

property type

Alias for field number 2

property urn

Alias for field number 1

class indra.sources.medscan.processor.MedscanProcessor[source]

Processes Medscan data into INDRA statements.

The special StateEffect event conveys information about the binding site of a protein modification. Sometimes this is paired with additional event information in a seperate SVO. When we encounter a StateEffect, we don’t process into an INDRA statement right away, but instead store the site information and use it if we encounter a ProtModification event within the same sentence.

statements

A list of extracted INDRA statements

Type

list<str>

sentence_statements

A list of statements for the sentence we are currently processing. Deduplicated and added to the main statement list when we finish processing a sentence.

Type

list<str>

num_entities

The total number of subject or object entities the processor attempted to resolve

Type

int

num_entities_not_found

The number of subject or object IDs which could not be resolved by looking in the list of entities or tagged phrases.

Type

int

last_site_info_in_sentence

Stored protein site info from the last StateEffect event within the sentence, allowing us to combine information from StateEffect and ProtModification events within a single sentence in a single INDRA statement. This is reset at the end of each sentence

Type

SiteInfo

agent_from_entity(relation, entity_id)[source]

Create a (potentially grounded) INDRA Agent object from a given Medscan entity describing the subject or object.

Uses helper functions to convert a Medscan URN to an INDRA db_refs grounding dictionary.

If the entity has properties indicating that it is a protein with a mutation or modification, then constructs the needed ModCondition or MutCondition.

Parameters
  • relation (MedscanRelation) – The current relation being processed

  • entity_id (str) – The ID of the entity to process

Returns

agent – A potentially grounded INDRA agent representing this entity

Return type

indra.statements.Agent

process_csxml_file(filename, interval=None, lazy=False)[source]

Processes a filehandle to MedScan csxml input into INDRA statements.

The CSXML format consists of a top-level <batch> root element containing a series of <doc> (document) elements, in turn containing <sec> (section) elements, and in turn containing <sent> (sentence) elements.

Within the <sent> element, a series of additional elements appear in the following order:

  • <toks>, which contains a tokenized form of the sentence in its text attribute

  • <textmods>, which describes any preprocessing/normalization done to the underlying text

  • <match> elements, each of which contains one of more <entity> elements, describing entities in the text with their identifiers. The local IDs of each entities are given in the msid attribute of this element; these IDs are then referenced in any subsequent SVO elements.

  • <svo> elements, representing subject-verb-object triples. SVO elements with a type attribute of CONTROL represent normalized regulation relationships; they often represent the normalized extraction of the immediately preceding (but unnormalized SVO element). However, in some cases there can be a “CONTROL” SVO element without its parent immediately preceding it.

Parameters
  • filename (string) – The path to a Medscan csxml file.

  • interval ((start, end) or None) – Select the interval of documents to read, starting with the `start`th document and ending before the `end`th document. If either is None, the value is considered undefined. If the value exceeds the bounds of available documents, it will simply be ignored.

  • lazy (bool) – If True, only create a generator which can be used by the get_statements method. If True, populate the statements list now.

process_relation(relation, last_relation)[source]

Process a relation into an INDRA statement.

Parameters
  • relation (MedscanRelation) – The relation to process (a CONTROL svo with normalized verb)

  • last_relation (MedscanRelation) – The relation immediately proceding the relation to process within the same sentence, or None if there are no preceding relations within the same sentence. This proceeding relation, if available, will refer to the same interaction but with an unnormalized (potentially more specific) verb, and is used when processing protein modification events.

class indra.sources.medscan.processor.MedscanProperty(type, name, urn)
property name

Alias for field number 1

property type

Alias for field number 0

property urn

Alias for field number 2

class indra.sources.medscan.processor.MedscanRelation(pmid, uri, sec, entities, tagged_sentence, subj, verb, obj, svo_type)[source]

A structure representing the information contained in a Medscan SVO xml element as well as associated entities and properties.

pmid

The URI of the current document (such as a PMID)

Type

str

sec

The section of the document the relation occurs in

Type

str

entities

A dictionary mapping entity IDs from the same sentence to MedscanEntity objects.

Type

dict

tagged_sentence

The sentence from which the relation was extracted, with some tagged phrases and annotations.

Type

str

subj

The entity ID of the subject

Type

str

verb

The verb in the relationship between the subject and the object

Type

str

obj

The entity ID of the object

Type

str

svo_type

The type of SVO relationship (for example, CONTROL indicates that the verb is normalized)

Type

str

class indra.sources.medscan.processor.ProteinSiteInfo(site_text, object_text)[source]

Represent a site on a protein, extracted from a StateEffect event.

Parameters
  • site_text (str) – The site as a string (ex. S22)

  • object_text (str) – The protein being modified, as the string that appeared in the original sentence

get_sites()[source]

Parse the site-text string and return a list of sites.

Returns

sites – A list of position-residue pairs corresponding to the site-text

Return type

list[Site]

indra.sources.medscan.processor.normalize_medscan_name(name)[source]

Removes the “complex” and “complex complex” suffixes from a medscan agent name so that it better corresponds with the grounding map.

Parameters

name (str) – The Medscan agent name

Returns

norm_name – The Medscan agent name with the “complex” and “complex complex” suffixes removed.

Return type

str