MedScan (indra.sources.medscan)

MedScan is Elsevier’s proprietary text-mining system for reading the biological literature. This INDRA module enables processing output files (in CSXML format) from the MedScan system into INDRA Statements.

MedScan API (indra.sources.medscan.api)

indra.sources.medscan.api.process_directory(directory_name)[source]

Processes a directory filled with CSXML files, first normalizing the character encodings to utf-8, and then processing into a list of INDRA statements.

Parameters:directory_name (str) – The name of a directory filled with csxml files to process
Returns:mp – A MedscanProcessor populated with INDRA statements extracted from the csxml files
Return type:indra.sources.medscan.processor.MedscanProcessor
indra.sources.medscan.api.process_directory_statements_sorted_by_pmid(directory_name)[source]

Processes a directory filled with CSXML files, first normalizing the character encoding to utf-8, and then processing into INDRA statements sorted by pmid.

Parameters:directory_name (str) – The name of a directory filled with csxml files to process
Returns:pmid_dict – A dictionary mapping pmids to a list of statements corresponding to that pmid
Return type:dict
indra.sources.medscan.api.process_file(filename, num_documents=None)[source]

Process a CSXML file for its relevant information.

Consider running the fix_csxml_character_encoding.py script in indra/sources/medscan to fix any encoding issues in the input file before processing.

indra.sources.medscan.api.filename

str – The csxml file, containing Medscan XML, to process

indra.sources.medscan.api.num_documents

int – The number of documents to process, or None to process all of the documents within the csxml file.

Returns:mp – A MedscanProcessor object containing extracted statements
Return type:MedscanProcessor
indra.sources.medscan.api.process_file_sorted_by_pmid(file_name)[source]

Processes a file and returns a dictionary mapping pmids to a list of statements corresponding to that pmid.

Parameters:file_name (str) – A csxml file to process
Returns:s_dict – Dictionary mapping pmids to a list of statements corresponding to that pmid
Return type:dict

MedScan Processor (indra.sources.medscan.processor)

class indra.sources.medscan.processor.MedscanEntity(name, urn, type, properties, ch_start, ch_end)
ch_end

Alias for field number 5

ch_start

Alias for field number 4

name

Alias for field number 0

properties

Alias for field number 3

type

Alias for field number 2

urn

Alias for field number 1

class indra.sources.medscan.processor.MedscanProcessor[source]

Processes Medscan data into INDRA statements.

The special StateEffect event conveys information about the binding site of a protein modification. Sometimes this is paired with additional event information in a seperate SVO. When we encounter a StateEffect, we don’t process into an INDRA statement right away, but instead store the site information and use it if we encounter a ProtModification event within the same sentence.

statements

list<str> – A list of extracted INDRA statements

sentence_statements

list<str> – A list of statements for the sentence we are currently processing. Deduplicated and added to the main statement list when we finish processing a sentence.

num_entities

int – The total number of subject or object entities the processor attempted to resolve

num_entities_not_found

int – The number of subject or object IDs which could not be resolved by looking in the list of entities or tagged phrases.

last_site_info_in_sentence

SiteInfo – Stored protein site info from the last StateEffect event within the sentence, allowing us to combine information from StateEffect and ProtModification events within a single sentence in a single INDRA statement. This is reset at the end of each sentence

agent_from_entity(relation, entity_id)[source]

Create a (potentially grounded) INDRA Agent object from a given Medscan entity describing the subject or object.

Uses helper functions to convert a Medscan URN to an INDRA db_refs grounding dictionary.

If the entity has properties indicating that it is a protein with a mutation or modification, then constructs the needed ModCondition or MutCondition.

Parameters:
  • relation (MedscanRelation) – The current relation being processed
  • entity_id (str) – The ID of the entity to process
Returns:

agent – A potentially grounded INDRA agent representing this entity

Return type:

indra.statements.Agent

process_csxml_from_file_handle(f, num_documents)[source]

Processes a filehandle to MedScan csxml input into INDRA statements.

The CSXML format consists of a top-level <batch> root element containing a series of <doc> (document) elements, in turn containing <sec> (section) elements, and in turn containing <sent> (sentence) elements.

Within the <sent> element, a series of additional elements appear in the following order:

  • <toks>, which contains a tokenized form of the sentence in its text attribute
  • <textmods>, which describes any preprocessing/normalization done to the underlying text
  • <match> elements, each of which contains one of more <entity> elements, describing entities in the text with their identifiers. The local IDs of each entities are given in the msid attribute of this element; these IDs are then referenced in any subsequent SVO elements.
  • <svo> elements, representing subject-verb-object triples. SVO elements with a type attribute of CONTROL represent normalized regulation relationships; they often represent the normalized extraction of the immediately preceding (but unnormalized SVO element). However, in some cases there can be a “CONTROL” SVO element without its parent immediately preceding it.
Parameters:
  • f (file object) – A filehandle to a source of MedScan csxml data
  • num_documents (int) – The number of documents to process, or None to process all documents in the input stream
process_relation(relation, last_relation)[source]

Process a relation into an INDRA statement.

Parameters:
  • relation (MedscanRelation) – The relation to process (a CONTROL svo with normalized verb)
  • last_relation (MedscanRelation) – The relation immediately proceding the relation to process within the same sentence, or None if there are no preceding relations within the same sentence. This proceeding relation, if available, will refer to the same interaction but with an unnormalized (potentially more specific) verb, and is used when processing protein modification events.
class indra.sources.medscan.processor.MedscanProperty(type, name, urn)
name

Alias for field number 1

type

Alias for field number 0

urn

Alias for field number 2

class indra.sources.medscan.processor.MedscanRelation(uri, sec, entities, tagged_sentence, subj, verb, obj, svo_type)[source]

A structure representing the information contained in a Medscan SVO xml element as well as associated entities and properties.

uri

str – The URI of the current document (such as a PMID)

sec

str – The section of the document the relation occurs in

entities

dict – A dictionary mapping entity IDs from the same sentence to MedscanEntity objects.

tagged_sentence

str – The sentence from which the relation was extracted, with some tagged phrases and annotations.

subj

str – The entity ID of the subject

verb

str – The verb in the relationship between the subject and the object

obj

str – The entity ID of the object

svo_type

str – The type of SVO relationship (for example, CONTROL indicates that the verb is normalized)

class indra.sources.medscan.processor.ProteinSiteInfo(site_text, object_text)[source]

Represent a site on a protein, extracted from a StateEffect event.

Parameters:
  • site_text (str) – The site as a string (ex. S22)
  • object_text (str) – The protein being modified, as the string that appeared in the original sentence
get_sites()[source]

Parse the site-text string and return a list of sites.

Returns:sites – A list of position-residue pairs corresponding to the site-text
Return type:list[Site]
indra.sources.medscan.processor.is_statement_in_list(statement, statement_list)[source]

Return True of given statement is equivalent to on in a list

Determines whether the statement is equivalent to any statement in the given list of statements, with equivalency determined by Statement’s equals method.

Parameters:
Returns:

in_list – True if statement is equivalent to any statements in the list

Return type:

bool

indra.sources.medscan.processor.normalize_medscan_name(name)[source]

Removes the “complex” and “complex complex” suffixes from a medscan agent name so that it better corresponds with the grounding map.

Parameters:name (str) – The Medscan agent name
Returns:norm_name – The Medscan agent name with the “complex” and “complex complex” suffixes removed.
Return type:str