Literature clients (`indra.literature`)

indra.literature.get_full_text(paper_id, idtype, preferred_content_type='text/xml')[source]

Return the content and the content type of an article.

This function retreives the content of an article by its PubMed ID, PubMed Central ID, or DOI. It prioritizes full text content when available and returns an abstract from PubMed as a fallback.

Parameters:

paper_id (string) – ID of the article.
idtype ('pmid', 'pmcid', or 'doi) – Type of the ID.
preferred_content_type (Optional[st]r) – Preference for full-text format, if available. Can be one of ‘text/xml’, ‘text/plain’, ‘application/pdf’. Default: ‘text/xml’

Returns:

content (str) – The content of the article.
content_type (str) – The content type of the article

indra.literature.id_lookup(paper_id, idtype)[source]

Take an ID of type PMID, PMCID, or DOI and lookup the other IDs.

If the DOI is not found in Pubmed, try to obtain the DOI by doing a reverse-lookup of the DOI in CrossRef using article metadata.

Parameters:

paper_id (str) – ID of the article.
idtype (str) – Type of the ID: ‘pmid’, ‘pmcid’, or ‘doi

Returns:

ids – A dictionary with the following keys: pmid, pmcid and doi.

Return type:

dict

Pubmed client (`indra.literature.pubmed_client`)

Search and get metadata for articles in Pubmed.

indra.literature.pubmed_client.download_package_for_pmid(pmid, out_dir, mapping=None)[source]

Return path to the PMC package downloaded for a given PMID.

Parameters:

pmid (str) – The PubMed ID for which the package should be downloaded.
out_dir (str) – The directory where the package should be downloaded.
mapping (Optional[Dict[str, str]]) – A mapping from PMIDs to PMC package URLs. If None, the mapping is fetched from the NCBI FTP server (slow). The mapping can be obtained from https://ftp.ncbi.nlm.nih.gov/pub/pmc/deprecated/oa_file_list.csv and loaded using get_pmid_to_package_url_mapping.

Returns:

The path to the downloaded package file.

Return type:

str

indra.literature.pubmed_client.download_package_for_pmids(pmid_list, out_dir, mapping=None)[source]

Return paths of PMC packages downloaded for a given list of PMIDs.

Parameters:

pmid_list (List[str]) – A list of PubMed IDs for which the packages should be downloaded.
out_dir (str) – The directory where the packages should be downloaded.
mapping (Optional[Dict[str, str]]) – A mapping from PMIDs to PMC package URLs. If None, the mapping is fetched from the NCBI FTP server (slow). The mapping can be obtained from https://ftp.ncbi.nlm.nih.gov/pub/pmc/deprecated/oa_file_list.csv and loaded using get_pmid_to_package_url_mapping.

Returns:

A dictionary mapping PMIDs to the paths of the downloaded package files. If a package could not be downloaded, the PMID key will not be present.

Return type:

dict

indra.literature.pubmed_client.ensure_xml_files(xml_path, retries=3, raise_http_error=True, raise_checksum_error=False, force=False, max_workers=1)[source]

Ensure that the XML files are downloaded and up to date.

This function downloads the full archive published by PubMed at https://ftp.ncbi.nlm.nih.gov/pubmed/baseline and https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles which contains citation records holding metadata and abstracts in XML format. The baseline archive is updated yearly, while the baseline archive is updated daily and includes new, revised, and deleted citations. After downloading this archive, it can be used to extract e.g. mesh annotation of articles, publication year, retractions, author information. The files in the archive constsist of a set of gzipped XML files, with each XML file containing multiple records for a set of publications. See https://dtd.nlm.nih.gov/ncbi/pubmed/doc/out/250101/index.html for more information about this archive.

Use this function to create a complete data set from all available citation records. If only a subset of records is needed, use e.g. get_metadata_for_all_ids in this module to get metadata from a list of pmids.

Parameters:

xml_path (str) – Path to the directory holding the PubMed XML files. The files will be globbed from this directory using the pattern ‘pubmed*.xml.gz’.
retries (int) – Number of times to retry downloading an individual XML file if there is an HTTP error. Default: 3.
raise_http_error (bool) – If True, raise an HTTPError if an XML file cannot be downloaded after the specified number of retries. If False, log a warning and skip the file.
raise_checksum_error (bool) – If True, raise a ValueError if the checksum of a downloaded XML file does not match the expected checksum. If False, log a warning and skip the file. Default: False.
force (bool) – If True, force re-download of all XML files, even if they already exist.
max_workers (int) – Number of parallel download threads. Default: 1 (serial). Maximum: 4.

Return type:

None

indra.literature.pubmed_client.expand_pagination(pages)[source]: Convert a page number to long form, e.g., from 456-7 to 456-457.

indra.literature.pubmed_client.generate_retractions_file(xml_path, download_missing=False, max_workers=1)[source]

Generate a CSV file of retracted papers from the PubMed XML.

Parameters:

xml_path (str) – Path to the directory holding the PubMed XML files. The files will be globbed from this directory using the pattern ‘pubmed*.xml.gz’.
download_missing (bool) – If True, download any missing XML files from the PubMed FTP server. Default: False. Note: A full download of the PubMed XML files takes up to 5 hours.
max_workers (int) – Number of parallel download threads. Default: 1 (serial). Maximum: 4.

indra.literature.pubmed_client.get_abstract(pubmed_id, prepend_title=True)[source]: Get the abstract of an article in the Pubmed database.

indra.literature.pubmed_client.get_all_ids(search_term)[source]

Return all PMIDs for a search term using the edirect CLI.

This function complements the get_id function which uses the PubMed REST API but is limited to 10k results and is very difficult to generalize to systematically fetch all IDs if there are more than 10k results. This function uses the edirect CLI which implements logic for paging over results.

This function only works if edirect is installed and is on your PATH. See https://www.ncbi.nlm.nih.gov/books/NBK179288/ for instructions.

Parameters:: search_term (str) – A term for which the PubMed search should be performed.
Returns:: A list of PMIDs for the given search term.
Return type:: list[str]

indra.literature.pubmed_client.get_article_xml(pubmed_id)[source]

Get the Article subtree a single article from the Pubmed database.

Parameters:: pubmed_id (str) – A PubMed ID.
Returns:: The XML ElementTree Element that represents the Article portion of the PubMed entry.
Return type:: xml.etree.ElementTree.Element

indra.literature.pubmed_client.get_full_xml(pubmed_id, fname=None)[source]

Get the full XML tree of a single article from the Pubmed database.

Parameters:

pubmed_id (str) – A PubMed ID.
fname (Optional[str]) – If given, the XML is saved to the given file name.

Returns:

The root element of the XML tree representing the PubMed entry. The root is a PubmedArticleSet with a single PubmedArticle element that contains the article metadata.

Return type:

xml.etree.ElementTree.Element

indra.literature.pubmed_client.get_full_xml_by_pmids(pubmed_ids, fname=None)[source]

Get the full XML tree for multiple articles from PubMed using edirect CLI.

Parameters:

pubmed_ids (List[str]) – A list of PubMed IDs.
fname (Optional[str]) – If given, the XML is saved to the given file name.

Returns:

The root element of the XML tree representing the PubMed entries. The root is a PubmedArticleSet containing multiple PubmedArticle elements.

Return type:

Element

Raises:

RuntimeError – If the edirect CLI utilities are not installed or not found on PATH.

Notes

This function requires the edirect command line utilities to be installed and visible on your PATH. See https://www.ncbi.nlm.nih.gov/books/NBK179288/ for instructions.
Note that the output is sorted by PMID numerically e.g., 10, 11, 20, 22, 1000 (and not lexicographically e.g., 10, 1000, 11, 20, 22) without regard to the order in which the pmids are passed in.

indra.literature.pubmed_client.get_id_count(search_term)[source]

Get the number of citations in Pubmed for a search query.

Parameters:: search_term (str) – A term for which the PubMed search should be performed.
Returns:: The number of citations for the query, or None if the query fails.
Return type:: int or None

indra.literature.pubmed_client.get_ids(search_term, **kwargs)[source]

Search Pubmed for paper IDs given a search term.

Search options can be passed as keyword arguments, some of which are custom keywords identified by this function, while others are passed on as parameters for the request to the PubMed web service For details on parameters that can be used in PubMed searches, see https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch Some useful parameters to pass are db=’pmc’ to search PMC instead of pubmed reldate=2 to search for papers within the last 2 days mindate=’2016/03/01’, maxdate=’2016/03/31’ to search for papers in March 2016.

PubMed, by default, limits returned PMIDs to a small number, and this number can be controlled by the “retmax” parameter. This function uses a retmax value of 10,000 by default (the maximum supported by PubMed) that can be changed via the corresponding keyword argument. Note also the retstart argument along with retmax to page across batches of IDs.

PubMed’s REST API makes it difficult to retrieve more than 10k PMIDs systematically. See the get_all_ids function in this module for a way to retrieve more than 10k IDs using the PubMed edirect CLI.

Parameters:

search_term (str) – A term for which the PubMed search should be performed.
use_text_word (Optional[bool]) – If True, the “[tw]” string is appended to the search term to constrain the search to “text words”, that is words that appear as whole in relevant parts of the PubMed entry (excl. for instance the journal name or publication date) like the title and abstract. Using this option can eliminate spurious search results such as all articles published in June for a search for the “JUN” gene, or journal names that contain Acad for a search for the “ACAD” gene. See also: https://www.nlm.nih.gov/bsd/disted/pubmedtutorial/020_760.html Default : True
kwargs (kwargs) – Additional keyword arguments to pass to the PubMed search as parameters.

indra.literature.pubmed_client.get_ids_for_gene(hgnc_name, **kwargs)[source]

Get the curated set of articles for a gene in the Entrez database.

Search parameters for the Gene database query can be passed in as keyword arguments.

Parameters:: hgnc_name (str) – The HGNC name of the gene. This is used to obtain the HGNC ID (using the hgnc_client module) and in turn used to obtain the Entrez ID associated with the gene. Entrez is then queried for that ID.

indra.literature.pubmed_client.get_ids_for_mesh(mesh_id, major_topic=False, **kwargs)[source]

Return PMIDs that are annotated with a given MeSH ID.

Parameters:

mesh_id (str) – The MeSH ID of a term to search for, e.g., D009101.
major_topic (bool) – If True, only papers for which the given MeSH ID is annotated as a major topic are returned. Otherwise all annotations are considered. Default: False
**kwargs – Any further PudMed search arguments that are passed to get_ids.

indra.literature.pubmed_client.get_ids_for_mesh_terms(mesh_terms, major_topics=None, **kwargs)[source]

Return PMIDs that are annotated with a given list of MeSH terms.

Parameters:

mesh_terms (list of str) – A list of MeSH IDs of terms to search for, e.g., [‘D009101’, ‘D009102’].
major_topics (Optional[list of bool]) – A list of booleans indicating whether the corresponding MeSH term should be considered as a major topic. If None, all terms are considered as major topics.
**kwargs – Any further PudMed search arguments that are passed to get_ids.

indra.literature.pubmed_client.get_issn_info(medline_citation, get_issns_from_nlm='never')[source]

Given a medline citation, get the issn info from the article

Parameters:

medline_citation (Element) – The MedlineCitation element of the PubMed XML tree.
get_issns_from_nlm (str) – Whether to recover ISSN values from the NLM catalog. Options are ‘never’, ‘missing’, and ‘always’. If ‘missing’, then the ISSN values will be recovered from the NLM catalog if they are not found in the XML. If ‘always’, then the ISSN values will be recovered from the NLM catalog regardless of whether they are found in the XML. Default is ‘never’ (i.e., never recover from NLM catalog regardless of whether they are found in the XML).

Returns:

A dictionary journal, issue, and ISSN info. The structure is as follows: {

”journal_title”: str, “journal_abbrev”: str, “journal_nlm_id”: str, “issn_dict”: {

”issn”: str, “issn_l”: str, “type”: “print”|”electronic”|”other”, “alternate_issns”: List[Tuple[str, str]] # Optional

}, “issue_dict”: {

”volume”: str, “issue”: str, “year”: int

}

}

Return type:

dict

indra.literature.pubmed_client.get_issns_for_journal(nlm_id)[source]

Get a dict of the ISSN numbers for a journal given its NLM ID.

Information on NLM XML DTDs is available at https://www.nlm.nih.gov/databases/dtd/

indra.literature.pubmed_client.get_mesh_annotations(pmid)[source]

Return a list of MeSH annotations for a given PubMed ID.

Parameters:: pmid (str) – A PubMed ID.
Returns:: A list of dicts that represent MeSH annotations with the following keys: “mesh” representing the MeSH ID, “text” the standrd name associated with the MeSH ID, “major_topic” a boolean flag set depending on whether the given MeSH ID is assigned as a major topic to the article, and “qualifier” which is a MeSH qualifier ID associated with the annotation, if available, otherwise None.
Return type:: list of dict

indra.literature.pubmed_client.get_mesh_term_search_str(mesh_id, major_topic=False)[source]

Return a search string for a given MeSH ID.

Parameters:

mesh_id (str) – The MeSH ID of a term to search for, e.g., D009101.
major_topic (bool) – If True, the given MeSH ID is considered as a major topic. Default: False

indra.literature.pubmed_client.get_metadata_for_all_ids(pmid_list, get_issns_from_nlm=False, get_abstracts=False, prepend_title=False, detailed_authors=False, references_included=None)[source]

Get article metadata for any number of PMIDs from the Pubmed database.

This differs from get_metadata_for_ids in that it can handle any number of PMIDs, and implements batch iteration to avoid the 200 PMID limit of the Pubmed API.

Parameters:

pmid_list (list of str) – Can contain any number of PMIDs.
get_issns_from_nlm (bool) – Look up the full list of ISSN number for the journal associated with the article, which helps to match articles to CrossRef search results. Defaults to False, since it slows down performance.
get_abstracts (bool) – Indicates whether to include the Pubmed abstract in the results.
prepend_title (bool) – If get_abstracts is True, specifies whether the article title should be prepended to the abstract text.
detailed_authors (bool) – If True, extract as many of the author details as possible, such as first name, identifiers, and institutions. If false, only last names are returned. Default: False
references_included (Optional[str]) – If ‘detailed’, include detailed references in the results. If ‘pmid’, only include the PMID of the reference. If None, don’t include references. Default: None

Returns:

Dictionary indexed by PMID. Each value is a dict containing the following fields: ‘doi’, ‘title’, ‘authors’, ‘journal_title’, ‘journal_abbrev’, ‘journal_nlm_id’, ‘issn_list’, ‘page’.

Return type:

dict of dicts

indra.literature.pubmed_client.get_metadata_for_ids(pmid_list, get_issns_from_nlm=False, get_abstracts=False, prepend_title=False, detailed_authors=False, references_included=None)[source]

Get article metadata for up to 200 PMIDs from the Pubmed database.

Parameters:

pmid_list (list of str) – Can contain 1-200 PMIDs.
get_issns_from_nlm (bool) – Look up the full list of ISSN number for the journal associated with the article, which helps to match articles to CrossRef search results. Defaults to False, since it slows down performance.
get_abstracts (bool) – Indicates whether to include the Pubmed abstract in the results.
prepend_title (bool) – If get_abstracts is True, specifies whether the article title should be prepended to the abstract text.
detailed_authors (bool) – If True, extract as many of the author details as possible, such as first name, identifiers, and institutions. If false, only last names are returned. Default: False
references_included (Optional[str]) – If ‘detailed’, include detailed references in the results. If ‘pmid’, only include the PMID of the reference. If None, don’t include references. Default: None

Returns:

Dictionary indexed by PMID. Each value is a dict containing the following fields: ‘doi’, ‘title’, ‘authors’, ‘journal_title’, ‘journal_abbrev’, ‘journal_nlm_id’, ‘issn_list’, ‘page’.

Return type:

dict[str, dict]

indra.literature.pubmed_client.get_metadata_from_pubmed_article(pubmed_article, get_issns_from_nlm=False, get_abstracts=False, prepend_title=False, mesh_annotations=True, detailed_authors=False, references_included=None)[source]

Get metadata for a single PubmedArticle element.

Parameters:

pubmed_article (xml.etree.ElementTree.Element) – A PubmedArticle element from a Pubmed XML tree.
get_issns_from_nlm (bool) – Look up the full list of ISSN number for the journal associated with the article, which helps to match articles to CrossRef search results. Defaults to False, since it slows down performance.
get_abstracts (bool) – Indicates whether to include the Pubmed abstract in the results. Default: False
prepend_title (bool) – If get_abstracts is True, specifies whether the article title should be prepended to the abstract text. Default: False
mesh_annotations (bool) – If True, extract mesh annotations from the pubmed entries and include in the returned data. If false, don’t. Default: True
detailed_authors (bool) – If True, extract as many of the author details as possible, such as first name, identifiers, and institutions. If false, only last names are returned. Default: False
references_included (str) – If ‘detailed’, include detailed references in the results. If ‘pmid’, only include the PMID of the reference. If None, don’t include references. Default: None

Returns:

A dict containing the following fields: ‘doi’, ‘title’, ‘authors’, ‘journal_title’, ‘journal_abbrev’, ‘journal_nlm_id’, ‘issn_list’, ‘page’, ‘volume’, ‘issue’, ‘issue_pub_date’, ‘mesh_annotations’, ‘publication_date’, ‘detailed_publication_dates’, ‘abstract’, ‘publication_types’ and ‘references’.

Return type:

Dict

indra.literature.pubmed_client.get_metadata_from_xml_tree(tree, get_issns_from_nlm=False, get_abstracts=False, prepend_title=False, mesh_annotations=True, detailed_authors=False, references_included=None)[source]

Get metadata for an XML tree containing PubmedArticle elements.

Documentation on the XML structure can be found at:

Parameters:

tree (xml.etree.ElementTree) – ElementTree containing one or more PubmedArticle elements.
get_issns_from_nlm (Optional[bool]) – Look up the full list of ISSN number for the journal associated with the article, which helps to match articles to CrossRef search results. Defaults to False, since it slows down performance.
get_abstracts (Optional[bool]) – Indicates whether to include the Pubmed abstract in the results. Default: False
prepend_title (Optional[bool]) – If get_abstracts is True, specifies whether the article title should be prepended to the abstract text. Default: False
mesh_annotations (Optional[bool]) – If True, extract mesh annotations from the pubmed entries and include in the returned data. If false, don’t. Default: True
detailed_authors (Optional[bool]) – If True, extract as many of the author details as possible, such as first name, identifiers, and institutions. If false, only last names are returned. Default: False
references_included (Optional[str]) – If ‘detailed’, include detailed references in the results. If ‘pmid’, only include the PMID of the reference. If None, don’t include references. Default: None

Returns:

Dictionary indexed by PMID. Each value is a dict containing the following fields: ‘doi’, ‘title’, ‘authors’, ‘journal_title’, ‘journal_abbrev’, ‘journal_nlm_id’, ‘issn_list’, ‘page’, ‘volume’, ‘issue’, ‘issue_pub_date’, ‘mesh_annotations’, ‘publication_date’, ‘abstract’, ‘publication_types’ and ‘references’.

Return type:

dict[str, dict]

indra.literature.pubmed_client.get_nct_ids_for_pmid(pmid)[source]

Get the NCT IDs for a given PubMed ID.

Parameters:: pmid (str) – A PubMed ID.
Returns:: A list of NCT IDs associated with the given PubMed ID.
Return type:: List[str]

indra.literature.pubmed_client.get_nct_ids_for_pmids(pmid_list, rest_api_fallback=True)[source]

Get the NCT IDs for a list of PubMed IDs.

Parameters:

pmid_list (List[str]) – A list of PubMed IDs.
rest_api_fallback (bool) – If True, fall back to the REST API if the full XML fetch using the edirect CLI fails.

Returns:

A dictionary mapping each PubMed ID to a list of NCT IDs associated with it.

Return type:

Dict[str, List[str]]

indra.literature.pubmed_client.get_nct_ids_from_article_xml(article)[source]

Extract NCT IDs from a PubMed article XML

Parameters:: article – An XML Element representing a PubMed article.
Return type:: List[str]
Returns:: The NCT IDs associated with the given PubMed article.

indra.literature.pubmed_client.get_nct_ids_from_full_xml(tree)[source]

Get the NCT IDs for a given PubMed ID from the full XML.

Parameters:: tree – An XML Element representing the full PubMed XML tree.
Returns:: A list of NCT IDs associated with the given PubMed ID.
Return type:: Dict[str, List[str]]

indra.literature.pubmed_client.get_pmid_to_package_url_mapping(fname=None)[source]

Return a mapping from PMID to a PMC .tar.gz package URL.

The assignment of PMIDs to specific PMC downloadable files in which extended article elements are available does not follow a specific pattern and therefore explicit mappings from PMIDs to PMC package URLs are required.

Parameters:: fname (Optional[str]) – Optional path to a CSV file containing the mappings data file serving as a cache. It can be obtained from https://ftp.ncbi.nlm.nih.gov/pub/pmc/deprecated/oa_file_list.csv. If not provided, it is downloaded from this URL.
Return type:: Dict[str, str]
Returns:: A dictionary mapping PMIDs to PMC package URLs.

indra.literature.pubmed_client.get_publication_types(article)[source]

Return the set of PublicationType for the article

Parameters:: article (Element) – The XML element for the article. Typically, this is a PubmedArticle node.
Returns:: A set of publication type
Return type:: set[str]

indra.literature.pubmed_client.get_substance_annotations(pubmed_id)[source]

Return substance MeSH ID for a given PubMedID.

Note that substance annotations often overlap with MeSH annotations, however, there are cases where a substance annotation is not available under MeSH annotations.

Parameters:: pubmed_id (str) – PubMedID ID whose substance MeSH ID will be returned.
Return type:: List[str]
Returns:: Substance MeSH IDs corresponding to the given PubMed paper or if None present or a failed query, an empty list will be returned.

indra.literature.pubmed_client.get_title(pubmed_id)[source]: Get the title of an article in the Pubmed database.

indra.literature.pubmed_client.is_retracted(pubmed_id)[source]

Return True if the article with the given PMID has been retracted.

Parameters:: pubmed_id (str) – The PMID of the paper to check.
Return type:: bool
Returns:: True if the paper has been retracted, False otherwise.

Pubmed Central client (`indra.literature.pmc_client`)

indra.literature.pmc_client.download_article_files_s3(pmcid, out_dir, version=None, include=None)[source]

Download a PMC article’s files from the PMC Cloud S3 bucket.

Files are saved under <out_dir>/PMC<id>.<version>/<filename>, mirroring the bucket’s prefix layout.

Parameters:

pmcid (str) – A PubMed Central ID in ‘PMC<digits>’ form.
out_dir (str) – Local directory where files will be written. Created if missing.
version (Optional[int]) – The article version to fetch. If None, the latest available version is used.
include (Optional[Iterable[str]]) – If given, only files whose lowercase extension matches one of these strings are downloaded (e.g. ['xml', 'txt']). Extensions should be given without the leading dot. If None, all files in the article’s prefix are downloaded.

Returns:

Paths to the downloaded files. Empty if the article (or requested version) is not present on the bucket.

Return type:

list of str

indra.literature.pmc_client.extract_paragraphs(xml_string)[source]

Returns list of paragraphs in an NLM XML.

This returns a list of the plaintexts for each paragraph and title in the input XML, excluding some paragraphs with text that should not be relevant to biomedical text processing.

Relevant text includes titles, abstracts, and the contents of many body paragraphs. Within figures, tables, and floating elements, only captions are retained (One exception is that all paragraphs within floating boxed-text elements are retained. These elements often contain short summaries enriched with useful information.) Due to captions, nested paragraphs can appear in an NLM XML document. Occasionally there are multiple levels of nesting. If nested paragraphs appear in the input document their texts are returned in a pre-ordered traversal. The text within child paragraphs is not included in the output associated to the parent. Each parent appears in the output before its children. All children of an element appear before the elements following sibling.

All tags are removed from each paragraph in the list that is returned. LaTeX surrounded by <tex-math> tags is removed entirely.

Note: Some articles contain subarticles which are processed slightly differently from the article body. Only text from the body element of a subarticle is included, and all unwanted elements are excluded along with their captions. Boxed-text elements are excluded as well.

Parameters:: xml_string (str) – String containing valid NLM XML.
Returns:: List of extracted paragraphs from the input NLM XML
Return type:: list of str

indra.literature.pmc_client.extract_text(xml_string)[source]

Get plaintext from the body of the given NLM XML string.

This plaintext consists of all paragraphs returned by indra.literature.pmc_client.extract_paragraphs separated by newlines and then finally terminated by a newline. See the DocString of extract_paragraphs for more information.

Parameters:: xml_string (str) – String containing valid NLM XML.
Returns:: Extracted plaintext.
Return type:: str

indra.literature.pmc_client.filter_pmids(pmid_list, source_type)[source]

Filter a list of PMIDs for ones with full text from PMC.

Parameters:

pmid_list (list of str) – List of PMIDs to filter.
source_type (string) – One of ‘fulltext’, ‘oa_xml’, ‘oa_txt’, or ‘auth_xml’.

Returns:

PMIDs available in the specified source/format type.

Return type:

list of str

indra.literature.pmc_client.get_latest_s3_version(pmcid)[source]

Return the latest available version of a PMC article on S3.

Parameters:: pmcid (str) – A PubMed Central ID in ‘PMC<digits>’ form.
Returns:: The highest available version number, or None if the article is not present on the bucket.
Return type:: Optional[int]

indra.literature.pmc_client.get_metadata_s3(pmcid, version=None)[source]

Return the JSON metadata for a PMC article from the PMC Cloud bucket.

Parameters:

pmcid (str) – A PubMed Central ID in ‘PMC<digits>’ form.
version (Optional[int]) – The article version to fetch. If None, the latest available version is used.

Returns:

The parsed JSON metadata dict, containing keys such as ‘pmid’, ‘doi’, ‘title’, ‘citation’, ‘license_code’, ‘is_retracted’, and s3:// URLs for the text/xml/pdf/media files. None if the article is not present on the bucket.

Return type:

Optional[dict]

indra.literature.pmc_client.get_pdf_s3(pmcid, version=None)[source]

Return the PDF for a PMC article from the PMC Cloud S3 bucket.

Parameters:

pmcid (str) – A PubMed Central ID in ‘PMC<digits>’ form.
version (Optional[int]) – The article version to fetch. If None, the latest available version is used.

Returns:

The PDF content or None if the article is not present on the bucket.

Return type:

Optional[str]

indra.literature.pmc_client.get_s3_versions(pmcid)[source]

Return available versions of a PMC article on the PMC Cloud S3 bucket.

Parameters:: pmcid (str) – A PubMed Central ID in ‘PMC<digits>’ form.
Returns:: Sorted tuple of available version numbers, or an empty tuple if the article is not present on the bucket.
Return type:: tuple of int

indra.literature.pmc_client.get_text_s3(pmcid, version=None)[source]

Return the plain text for a PMC article from the PMC Cloud S3 bucket.

Parameters:

pmcid (str) – A PubMed Central ID in ‘PMC<digits>’ form.
version (Optional[int]) – The article version to fetch. If None, the latest available version is used.

Returns:

The plain-text content as a unicode string, or None if the article is not present on the bucket.

Return type:

Optional[str]

indra.literature.pmc_client.get_xml(pmc_id, raise_for_status=False, max_retries=4)[source]

Returns XML for the article corresponding to a PMC ID

Parameters:

pmc_id (str) – A PubMed Central ID in ‘PMC<digits>’ form.
raise_for_status (bool) – If True, raise an HTTPError if the request fails. If False, return None on failure.
max_retries (int) – Maximum number of retries to make if the request fails with a 429 error.

Returns:

The XML content as a unicode string, or None if the request fails and raise_on_status is False.

Return type:

str | None

Notes

The endpoint this function relies on is aggressively rate limited and should only be used for single requests. To do bulk requesting, consider using the PMC Cloud S3 endpoints instead, which are not rate limited and with a more robust API. See https://pmc.ncbi.nlm.nih.gov/tools/oai/ for more information.

bioRxiv client (`indra.literature.biorxiv_client`)

A client to obtain metadata and text content from bioRxiv (and to some extent medRxiv) preprints.

indra.literature.biorxiv_client.get_collection_dois(collection_id, min_date=None)[source]

Get list of DOIs from a biorxiv/medrxiv collection.

Parameters:

collection_id (str) – The identifier of the collection to fetch.
min_date (Optional[datetime.datetime]) – A datetime object representing an cutoff. If given, only publications that were released on or after the given date are returned. By default, no date constraint is applied.

Returns:

The list of DOIs in the collection.

Return type:

list of dict

indra.literature.biorxiv_client.get_collection_pubs(collection_id, min_date=None)[source]

Get list of DOIs from a biorxiv/medrxiv collection.

Parameters:

collection_id (str) – The identifier of the collection to fetch.
min_date (Optional[datetime.datetime]) – A datetime object representing an cutoff. If given, only publications that were released on or after the given date are returned. By default, no date constraint is applied.

Returns:

A list of the publication entries which include the abstract and other metadata.

Return type:

list of dict

indra.literature.biorxiv_client.get_content_from_pub_json(pub, format)[source]

Get text content based on a given format from a publication JSON.

In the case of abstract, the content is returned from the JSON directly. For pdf, the content is returned as bytes that can be dumped into a file. For txt and xml, the text is processed out of either the raw XML or text content that rxiv provides.

Parameters:

pub (dict) – The JSON dict description a publication.
format (str) – The format, if available, via which the content should be obtained.

indra.literature.biorxiv_client.get_formats(pub)[source]

Return formats available for a publication JSON.

Parameters:: pub (dict) – The JSON dict description a publication.
Returns:: A dict with available formats as its keys (abstract, pdf, xml, txt) and either the content (in case of abstract) or the URL (in case of pdf, xml, txt) as the value.
Return type:: dict

indra.literature.biorxiv_client.get_pdf_xml_url_base(content)[source]

Return base URL to PDF/XML based on the content of the landing page.

Parameters:: content (str) – The content of the landing page for an rxiv paper.
Returns:: The base URL if available, otherwise None.
Return type:: str or None

indra.literature.biorxiv_client.get_text_from_rxiv_text(rxiv_text)[source]

Return clean text from the raw rxiv text content.

This function parses out the title, headings and subheadings, and the content of sections under headings/subheadings. It filters out some irrelevant content e.g., references and footnotes.

Parameters:: rxiv_text (str) – The content of the rxiv full text as obtained from the web.
Returns:: The text content stripped out from the raw full text.
Return type:: str

indra.literature.biorxiv_client.get_text_from_rxiv_xml(rxiv_xml)[source]

Return clean text from the raw rxiv xml content.

Parameters:: rxiv_xml (str) – The content of the rxiv full xml as obtained from the web.
Returns:: The text content stripped out from the raw full xml.
Return type:: str

indra.literature.biorxiv_client.get_text_url_base(content)[source]

Return base URL to full text based on the content of the landing page.

Parameters:: content (str) – The content of the landing page for an rxiv paper.
Returns:: The base URL if available, otherwise None.
Return type:: str or None

CrossRef client (`indra.literature.crossref_client`)

indra.literature.crossref_client.doi_query(pmid, search_limit=10)[source]

Get the DOI for a PMID by matching CrossRef and Pubmed metadata.

Searches CrossRef using the article title and then accepts search hits only if they have a matching journal ISSN and page number with what is obtained from the Pubmed database.

indra.literature.crossref_client.get_fulltext_links(doi)[source]: Return a list of links to the full text of an article given its DOI. Each list entry is a dictionary with keys: - URL: the URL to the full text - content-type: e.g. text/xml or text/plain - content-version - intended-application: e.g. text-mining

indra.literature.crossref_client.get_metadata(doi)[source]: Returns the metadata of an article given its DOI from CrossRef as a JSON dict

COCI client (`indra.literature.coci_client`)

Client to COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations.

For more information on the COCI, see: https://opencitations.net/index/coci with API documentation at https://opencitations.net/index/coci/api/v1/.

indra.literature.coci_client.get_citation_count_for_doi(doi)[source]

Return the citation count for a given DOI.

Note that the COCI API returns a count of 0 for DOIs that are not indexed.

Parameters:: doi (str) – The DOI to get the citation count for.
Return type:: int
Returns:: The citation count for the DOI.

indra.literature.coci_client.get_citation_count_for_pmid(pmid)[source]

Return the citation count for a given PMID.

This uses the CrossRef API to get the DOI for the PMID, and then calls the COCI API to get the citation count for the DOI.

If the DOI lookup failed, this returns None. Note that the COCI API returns a count of 0 for DOIs that are not indexed.

Parameters:: pmid (str) – The PMID to get the citation count for.
Return type:: Optional[int]
Returns:: The citation count for the PMID.

Elsevier client (`indra.literature.elsevier_client`)

For information on the Elsevier API, see:

API Specification: http://dev.elsevier.com/api_docs.html
Authentication: https://dev.elsevier.com/tecdoc_api_authentication.html

indra.literature.elsevier_client.check_entitlement(doi)[source]

Check whether IP and credentials enable access to content for a doi.

This function uses the entitlement endpoint of the Elsevier API to check whether an article is available to a given institution. Note that this feature of the API is itself not available for all institution keys.

indra.literature.elsevier_client.download_article(id_val, id_type='doi', max_retries=2, on_retry=False)[source]

Low level function to get an XML article for a particular id.

Parameters:

id_val (str) – The value of the id.
id_type (str) – The type of id, such as pmid (a.k.a. pubmed_id), doi, or eid.
max_retries (int) – The maximum number of retries for connection errors.
on_retry (bool) – This function has a recursive retry feature, and this is the only time this parameter should be used.

Returns:

content – If found, the content string is returned, otherwise, None is returned.

Return type:

str or None

indra.literature.elsevier_client.download_article_from_ids(**id_dict)[source]

Download an article in XML format from Elsevier matching the set of ids.

Parameters:: <id_type> (str) – You can enter any combination of eid, doi, pmid, and/or pii. Ids will be checked in that order, until either content has been found or all ids have been checked.
Returns:: content – If found, the content is returned as a string, otherwise None is returned.
Return type:: str or None

indra.literature.elsevier_client.download_from_search(query_str, folder, do_extract_text=True, max_results=None)[source]

Save raw text files based on a search for papers on ScienceDirect.

This performs a search to get PIIs, downloads the XML corresponding to the PII, extracts the raw text and then saves the text into a file in the designated folder.

Parameters:

query_str (str) – The query string to search with
folder (str) – The local path to an existing folder in which the text files will be dumped
do_extract_text (bool) – Choose whether to extract text from the xml, or simply save the raw xml files. Default is True, so text is extracted.
max_results (int or None) – Default is None. If specified, limit the number of results to the given maximum.

indra.literature.elsevier_client.extract_paragraphs(xml_string)[source]: Get paragraphs from the body of the given Elsevier xml.

indra.literature.elsevier_client.extract_text(xml_string)[source]: Get text from the body of the given Elsevier xml.

indra.literature.elsevier_client.get_abstract(doi)[source]: Get the abstract text of an article from Elsevier given a doi.

indra.literature.elsevier_client.get_article(doi, output_format='txt')[source]

Get the full body of an article from Elsevier.

Parameters:

doi (str) – The doi for the desired article.
output_format ('txt' or 'xml') – The desired format for the output. Selecting ‘txt’ (default) strips all xml tags and joins the pieces of text in the main text, while ‘xml’ simply takes the tag containing the body of the article and returns it as is . In the latter case, downstream code needs to be able to interpret Elsever’s XML format.

Returns:

content – Either text content or xml, as described above, for the given doi.

Return type:

str

indra.literature.elsevier_client.get_dois(query_str, year=None, loaded_after=None)[source]

Search ScienceDirect through the API for articles and return DOIs.

Parameters:

query_str (str) – The query string to search with.
year (Optional[str]) – The year to constrain the search to.
loaded_after (Optional[str]) – Date formatted as ‘yyyy-MM-dd’T’HH:mm:ssX’ to constrain the search to articles loaded after this date. Example: 2019-06-01T00:00:00Z

Returns:

dois – The list of DOIs identifying the papers returned by the search.

Return type:

list[str]

indra.literature.elsevier_client.get_piis(query_str)[source]

Search ScienceDirect through the API for articles and return PIIs.

Note that ScienceDirect has a limitation in which a maximum of 6,000 PIIs can be retrieved for a given search and therefore this call is internally broken up into multiple queries by a range of years and the results are combined.

Parameters:: query_str (str) – The query string to search with
Returns:: piis – The list of PIIs identifying the papers returned by the search
Return type:: list[str]

indra.literature.elsevier_client.get_piis_for_date(query_str, year=None, loaded_after=None)[source]

Search ScienceDirect through the API for articles and return PIIs.

Parameters:

query_str (str) – The query string to search with.
year (Optional[str]) – The year to constrain the search to.
loaded_after (Optional[str]) – Date formatted as ‘yyyy-MM-dd’T’HH:mm:ssX’ to constrain the search to articles loaded after this date. Example: 2019-06-01T00:00:00Z

Returns:

piis – The list of PIIs identifying the papers returned by the search.

Return type:

list[str]

indra.literature.elsevier_client.has_full_text(xml_content)[source]: Determines if the given Elsevier XML contains full text.

indra.literature.elsevier_client.search_science_direct(query_str, field_name, year=None, loaded_after=None)[source]

Search ScienceDirect for a given field with a query string.

Users can specify which field they are interested in and only values from that field will be returned. It is also possible to restrict the search either to a specific year of publication or to papers published after a specific date.

Parameters:

query_str (str) – The query string to search with.
field_name (str) – A name of the field of interest to be returned. Accepted values are: authors, doi, loadDate, openAccess, pages, pii, publicationDate, sourceTitle, title, uri, volumeIssue.
year (Optional[str]) – The year to constrain the search to.
loaded_after (Optional[str]) – Date formatted as ‘yyyy-MM-dd’T’HH:mm:ssX’ to constrain the search to articles loaded after this date.

Returns:

all_parts – The list of values from the field of interest identifying the papers returned by the search.

Return type:

list[str]

NewsAPI client (`indra.literature.newsapi_client`)

This module provides a client for the NewsAPI web service (https://newsapi.org/). The web service requires an API key which is available after registering at https://newsapi.org/account. This key can be set as NEWSAPI_API_KEY in the INDRA config file or as an environmental variable with the same name.

NewsAPI also requires attribution e.g. “powered by NewsAPI.org” for derived uses.

indra.literature.newsapi_client.send_request(endpoint, **kwargs)[source]

Return the response to a query as JSON from the NewsAPI web service.

The basic API is limited to 100 results which is chosen unless explicitly given as an argument. Beyond that, paging is supported through the “page” argument, if needed.

Parameters:

endpoint (str) – Endpoint to query, e.g. “everything” or “top-headlines”
kwargs (dict) – A list of keyword arguments passed as parameters with the query. The basic ones are “q” which is the search query, “from” is a start date formatted as for instance 2018-06-10 and “to” is an end date with the same format.

Returns:

res_json – The response from the web service as a JSON dict.

Return type:

dict

Adeft Tools (`indra.literature.adeft_tools`)

This file provides several functions helpful for acquiring texts for Adeft disambiguation.

It offers the ability to get text content for articles containing a particular gene. This is useful for aquiring training texts for genes genes that do not appear in a defining pattern with a problematic shortform.

General XML processing is also provided that allows for extracting text from a source that may be either of Elsevier XML, NLM XML or raw text. This is helpful because it avoids having to know in advance the source of text content from the database.

indra.literature.adeft_tools.filter_paragraphs(paragraphs, contains=None)[source]

Filter paragraphs to only those containing one of a list of strings

Parameters:

paragraphs (list of str) – List of plaintext paragraphs from an article
contains (str or list of str) – Exclude paragraphs not containing this string as a token, or at least one of the strings in contains if it is a list

Returns:

Plaintext consisting of all input paragraphs containing at least one of the supplied tokens.

Return type:

str

indra.literature.adeft_tools.get_text_content_for_gene(hgnc_name)[source]

Get articles that have been annotated to contain gene in entrez

Parameters:: hgnc_name (str) – HGNC name for gene
Returns:: text_content – xmls of fulltext if available otherwise abstracts for all articles that haven been annotated in entrez to contain the given gene
Return type:: list of str

indra.literature.adeft_tools.get_text_content_for_pmids(pmids)[source]

Get text content for articles given a list of their pmids

Parameters:: pmids (list of str)
Returns:: text_content
Return type:: list of str

indra.literature.adeft_tools.universal_extract_paragraphs(xml)[source]

Extract paragraphs from xml that could be from different sources

First try to parse the xml as if it came from elsevier. if we do not have valid elsevier xml this will throw an exception. the text extraction function in the pmc client may not throw an exception when parsing elsevier xml, silently processing the xml incorrectly

Parameters:: xml (str) – Either an NLM xml, Elsevier xml or plaintext
Returns:: paragraphs – Extracted plaintext paragraphs from NLM or Elsevier XML
Return type:: str

indra.literature.adeft_tools.universal_extract_text(xml, contains=None)[source]

Extract plaintext from xml that could be from different sources

Parameters:

xml (str) – Either an NLM xml, Elsevier xml, or plaintext
contains (str or list of str) – Exclude paragraphs not containing this string, or at least one of the strings in contains if it is a list

Returns:

The concatentation of all paragraphs in the input xml, excluding paragraphs not containing one of the tokens in the list contains. Paragraphs are separated by new lines.

Return type:

str

Literature clients (indra.literature)

Pubmed client (indra.literature.pubmed_client)

Pubmed Central client (indra.literature.pmc_client)

bioRxiv client (indra.literature.biorxiv_client)

CrossRef client (indra.literature.crossref_client)

COCI client (indra.literature.coci_client)

Elsevier client (indra.literature.elsevier_client)

NewsAPI client (indra.literature.newsapi_client)

Adeft Tools (indra.literature.adeft_tools)

Literature clients (`indra.literature`)

Pubmed client (`indra.literature.pubmed_client`)

Pubmed Central client (`indra.literature.pmc_client`)

bioRxiv client (`indra.literature.biorxiv_client`)

CrossRef client (`indra.literature.crossref_client`)

COCI client (`indra.literature.coci_client`)

Elsevier client (`indra.literature.elsevier_client`)

NewsAPI client (`indra.literature.newsapi_client`)

Adeft Tools (`indra.literature.adeft_tools`)