Literature clients (indra.literature)

indra.literature.get_full_text(paper_id, idtype, preferred_content_type='text/xml')[source]

Return the content and the content type of an article.

This function retreives the content of an article by its PubMed ID, PubMed Central ID, or DOI. It prioritizes full text content when available and returns an abstract from PubMed as a fallback.

  • paper_id (string) – ID of the article.
  • idtype ('pmid', 'pmcid', or 'doi) – Type of the ID.
  • preferred_content_type (Optional[st]r) – Preference for full-text format, if available. Can be one of ‘text/xml’, ‘text/plain’, ‘application/pdf’. Default: ‘text/xml’

  • content (str) – The content of the article.
  • content_type (str) – The content type of the article

indra.literature.id_lookup(paper_id, idtype)[source]

Take an ID of type PMID, PMCID, or DOI and lookup the other IDs.

If the DOI is not found in Pubmed, try to obtain the DOI by doing a reverse-lookup of the DOI in CrossRef using article metadata.

  • paper_id (str) – ID of the article.
  • idtype (str) – Type of the ID: ‘pmid’, ‘pmcid’, or ‘doi

ids – A dictionary with the following keys: pmid, pmcid and doi.

Return type:


Pubmed client (indra.literature.pubmed_client)

Search and get metadata for articles in Pubmed.


Convert a page number to long form, e.g., from 456-7 to 456-457.

indra.literature.pubmed_client.get_abstract(pubmed_id, prepend_title=True)[source]

Get the abstract of an article in the Pubmed database.


Get the XML metadata for a single article from the Pubmed database.


Get the number of citations in Pubmed for a search query.

Parameters:search_term (str) – A term for which the PubMed search should be performed.
Returns:The number of citations for the query, or None if the query fails.
Return type:int or None

Search Pubmed for paper IDs given a search term.

Search options can be passed as keyword arguments, some of which are custom keywords identified by this function, while others are passed on as parameters for the request to the PubMed web service For details on parameters that can be used in PubMed searches, see Some useful parameters to pass are db=’pmc’ to search PMC instead of pubmed reldate=2 to search for papers within the last 2 days mindate=‘2016/03/01’, maxdate=‘2016/03/31’ to search for papers in March 2016.

PubMed, by default, limits returned PMIDs to a small number, and this number can be controlled by the “retmax” parameter. This function uses a retmax value of 100,000 by default that can be changed via the corresponding keyword argument.

  • search_term (str) – A term for which the PubMed search should be performed.
  • use_text_word (Optional[bool]) – If True, the “[tw]” string is appended to the search term to constrain the search to “text words”, that is words that appear as whole in relevant parts of the PubMed entry (excl. for instance the journal name or publication date) like the title and abstract. Using this option can eliminate spurious search results such as all articles published in June for a search for the “JUN” gene, or journal names that contain Acad for a search for the “ACAD” gene. See also: Default : True
  • kwargs (kwargs) – Additional keyword arguments to pass to the PubMed search as parameters.

Get the curated set of articles for a gene in the Entrez database.

Search parameters for the Gene database query can be passed in as keyword arguments.

Parameters:hgnc_name (string) – The HGNC name of the gene. This is used to obtain the HGNC ID (using the hgnc_client module) and in turn used to obtain the Entrez ID associated with the gene. Entrez is then queried for that ID.

Get a list of the ISSN numbers for a journal given its NLM ID.

Information on NLM XML DTDs is available at

indra.literature.pubmed_client.get_metadata_for_ids(pmid_list, get_issns_from_nlm=False, get_abstracts=False, prepend_title=False)[source]

Get article metadata for up to 200 PMIDs from the Pubmed database.

  • pmid_list (list of PMIDs as strings) – Can contain 1-200 PMIDs.
  • get_issns_from_nlm (boolean) – Look up the full list of ISSN number for the journal associated with the article, which helps to match articles to CrossRef search results. Defaults to False, since it slows down performance.
  • get_abstracts (boolean) – Indicates whether to include the Pubmed abstract in the results.
  • prepend_title (boolean) – If get_abstracts is True, specifies whether the article title should be prepended to the abstract text.

Dictionary indexed by PMID. Each value is a dict containing the following fields: ‘doi’, ‘title’, ‘authors’, ‘journal_title’, ‘journal_abbrev’, ‘journal_nlm_id’, ‘issn_list’, ‘page’.

Return type:

dict of dicts

indra.literature.pubmed_client.get_metadata_from_xml_tree(tree, get_issns_from_nlm=False, get_abstracts=False, prepend_title=False)[source]

Get metadata for an XML tree containing PubmedArticle elements.

Documentation on the XML structure can be found at:
  • tree (xml.etree.ElementTree) – ElementTree containing one or more PubmedArticle elements.
  • get_issns_from_nlm (boolean) – Look up the full list of ISSN number for the journal associated with the article, which helps to match articles to CrossRef search results. Defaults to False, since it slows down performance.
  • get_abstracts (boolean) – Indicates whether to include the Pubmed abstract in the results.
  • prepend_title (boolean) – If get_abstracts is True, specifies whether the article title should be prepended to the abstract text.

Dictionary indexed by PMID. Each value is a dict containing the following fields: ‘doi’, ‘title’, ‘authors’, ‘journal_title’, ‘journal_abbrev’, ‘journal_nlm_id’, ‘issn_list’, ‘page’.

Return type:

dict of dicts


Get the title of an article in the Pubmed database.

Pubmed Central client (indra.literature.pmc_client)

indra.literature.pmc_client.extract_text(xml_string, contains=None)[source]

Get text from the body of the given NLM XML string.

Parameters:xml_string (str) – String containing valid NLM XML.
Returns:Extracted plaintext.
Return type:str
indra.literature.pmc_client.filter_pmids(pmid_list, source_type)[source]

Filter a list of PMIDs for ones with full text from PMC.

  • pmid_list (list of str) – List of PMIDs to filter.
  • source_type (string) – One of ‘fulltext’, ‘oa_xml’, ‘oa_txt’, or ‘auth_xml’.

PMIDs available in the specified source/format type.

Return type:

list of str


Returns XML for the article corresponding to a PMC ID.

indra.literature.pmc_client.id_lookup(paper_id, idtype=None)[source]

This function takes a Pubmed ID, Pubmed Central ID, or DOI and use the Pubmed ID mapping service and looks up all other IDs from one of these. The IDs are returned in a dictionary.

CrossRef client (indra.literature.crossref_client)

indra.literature.crossref_client.doi_query(pmid, search_limit=10)[source]

Get the DOI for a PMID by matching CrossRef and Pubmed metadata.

Searches CrossRef using the article title and then accepts search hits only if they have a matching journal ISSN and page number with what is obtained from the Pubmed database.

Return a list of links to the full text of an article given its DOI. Each list entry is a dictionary with keys: - URL: the URL to the full text - content-type: e.g. text/xml or text/plain - content-version - intended-application: e.g. text-mining


Returns the metadata of an article given its DOI from CrossRef as a JSON dict

Elsevier client (indra.literature.elsevier_client)

For information on the Elsevier API, see:

Check whether IP and credentials enable access to content for a doi.

This function uses the entitlement endpoint of the Elsevier API to check whether an article is available to a given institution. Note that this feature of the API is itself not available for all institution keys.

indra.literature.elsevier_client.download_article(id_val, id_type='doi', on_retry=False)[source]

Low level function to get an XML article for a particular id.

  • id_val (str) – The value of the id.
  • id_type (str) – The type of id, such as pmid (a.k.a. pubmed_id), doi, or eid.
  • on_retry (bool) – This function has a recursive retry feature, and this is the only time this parameter should be used.

content – If found, the content string is returned, otherwise, None is returned.

Return type:

str or None


Download an article in XML format from Elsevier matching the set of ids.

Parameters:<id_type> (str) – You can enter any combination of eid, doi, pmid, and/or pii. Ids will be checked in that order, until either content has been found or all ids have been checked.
Returns:content – If found, the content is returned as a string, otherwise None is returned.
Return type:str or None

Save raw text files based on a search for papers on ScienceDirect.

This performs a search to get PIIs, downloads the XML corresponding to the PII, extracts the raw text and then saves the text into a file in the designated folder.

  • query_str (str) – The query string to search with
  • folder (str) – The local path to an existing folder in which the text files will be dumped
  • do_extract_text (bool) – Choose whether to extract text from the xml, or simply save the raw xml files. Default is True, so text is extracted.
  • max_results (int or None) – Default is None. If specified, limit the number of results to the given maximum.
indra.literature.elsevier_client.extract_text(xml_string, contains=None)[source]

Get text from the body of the given Elsevier xml.


Get the abstract text of an article from Elsevier given a doi.

indra.literature.elsevier_client.get_article(doi, output_format='txt')[source]

Get the full body of an article from Elsevier.

  • doi (str) – The doi for the desired article.
  • output_format ('txt' or 'xml') – The desired format for the output. Selecting ‘txt’ (default) strips all xml tags and joins the pieces of text in the main text, while ‘xml’ simply takes the tag containing the body of the article and returns it as is . In the latter case, downstream code needs to be able to interpret Elsever’s XML format.

content – Either text content or xml, as described above, for the given doi.

Return type:



Search ScienceDirect through the API for articles.

See for constructing a query string to pass here. Example: ‘abstract(BRAF) AND all(“colorectal cancer”)’


Search ScienceDirect through the API for articles and return PIIs.

Note that ScienceDirect has a limitation in which a maximum of 6,000 PIIs can be retrieved for a given search and therefore this call is internally broken up into multiple queries by a range of years and the results are combined.

Parameters:query_str (str) – The query string to search with
Returns:piis – The list of PIIs identifying the papers returned by the search
Return type:list[str]

Search ScienceDirect with a query string constrained to a given year.

  • query_str (str) – The query string to search with
  • date (str) – The year to constrain the search to

piis – The list of PIIs identifying the papers returned by the search

Return type:


NewsAPI client (indra.literature.newsapi_client)

This module provides a client for the NewsAPI web service ( The web service requires an API key which is available after registering at This key can be set as NEWSAPI_API_KEY in the INDRA config file or as an environmental variable with the same name.

NewsAPI also requires attribution e.g. “powered by” for derived uses.

indra.literature.newsapi_client.send_request(endpoint, **kwargs)[source]

Return the response to a query as JSON from the NewsAPI web service.

The basic API is limited to 100 results which is chosen unless explicitly given as an argument. Beyond that, paging is supported through the “page” argument, if needed.

  • endpoint (str) – Endpoint to query, e.g. “everything” or “top-headlines”
  • kwargs (dict) – A list of keyword arguments passed as parameters with the query. The basic ones are “q” which is the search query, “from” is a start date formatted as for instance 2018-06-10 and “to” is an end date with the same format.

res_json – The response from the web service as a JSON dict.

Return type:


Deft Tools (indra.literature.deft_tools)

This file provides several functions helpful for acquiring texts for deft disambiguation.

It offers the ability to get text content for articles containing a particular gene. This is useful for aquiring training texts for genes genes that do not appear in a defining pattern with a problematic shortform.

General XML processing is also provided that allows for extracting text from a source that may be either of Elsevier XML, NLM XML or raw text. This is helpful because it avoids having to know in advance the source of text content from the database.

indra.literature.deft_tools.get_plaintexts(text_content, contains=None)[source]

Returns a corpus of plaintexts given text content from different sources

Converts xml files into plaintext, leaves abstracts as they are.

Parameters:sources (list of str) – lists of text content. each item should either be a plaintext, an an NLM xml or an Elsevier xml
Returns:plaintexts – list of plaintexts for input list of xml strings
Return type:list of str

Get articles that have been annotated to contain gene in entrez

Parameters:hgnc_name (str) – HGNC name for gene
Returns:text_content – xmls of fulltext if available otherwise abstracts for all articles that haven been annotated in entrez to contain the given gene
Return type:list of str

Get text content for articles given a list of their pmids

Parameters:pmids (list of str) –
Return type:list of str
indra.literature.deft_tools.universal_extract_text(xml, contains=None)[source]

Extract plaintext from xml

First try to parse the xml as if it came from elsevier. if we do not have valid elsevier xml this will throw an exception. the text extraction function in the pmc client may not throw an exception when parsing elsevier xml, silently processing the xml incorrectly

Parameters:xml (str) – Either an NLM xml, Elsevier xml or plaintext
Returns:plaintext – for NLM or Elsevier xml as input, this is the extracted plaintext otherwise the input is returned unchanged
Return type:str