Saturday, June 5, 2021

Using Biomedical Text Mining to Uncover Potential Use of Existing Drugs for Inflammatory Breast Cancer (IBC)

Using Biomedical Text Mining to Uncover Potential Use of Existing Drugs for Inflammatory Breast Cancer (IBC)

Introduction

Inflammatory Breast Cancer (IBC) is the most aggressive and lethal form of breast cancer [1-3]. IBC represents ~2-4% of breast cancers in the USA [4] and accounts for ~7-10% of all breast cancer deaths [5]. The 5-year survival rate for IBC is less than 50%, significantly less than patients with non-IBC breast cancer (85%) [6,7]. Further, IBC affects young, African-American and AmericanIndian women at a higher rate than in other groups [8-10]. The diagnosis criteria used for IBC patients are based on clinical characteristics. In IBC, the tumor cells are typically distributed diffusely in the breast instead of forming a solid tumor. These cells cluster to form emboli that block the lymphatic vessels in the skin covering the breast, causing a characteristic red, warm and thickened appearance of the breast termed peau d’orange. Hence, IBC tumors are not easily detected by mammogram with detection typically after the cancer has spread [1,3].
Despite polychemotherapy regimens, women with IBC continue to have worse survival outcomes than non-IBC breast cancer patients [11,12]. Most IBC tumors are negative for Estrogen Receptor (ER) and nearly 40% of IBC patients exhibit the triple negative phenotype (ER-/PR-/HER2-) rendering hormonal and HER2-targeted therapies ineffective. The major clinical challenges facing patients with IBC are the development of resistance to initially effective therapeutics, and metastatic spread of the disease [7,13-15]. To date, no therapeutics have been developed that target IBC specifically. Thus, novel research is required to identify effective therapies for treating IBC, among which repurposing existing drugs via word embedding analysis constitutes an attractive strategy. Word embedding technologies have emerged from the text mining field in recent years as a novel drug repurposing strategy in various therapeutic areas. For example, Ngo et al. [16] employed Word2Vec [17] to analyze a corpus of biomedical text that consisted of a subset of the PubMed abstracts filtered by the keyword “cancer”. From over 3 million abstracts, 14 million sentences were extracted. Using the Word2Vec embedding technology, over 1.7 M words were embedded into word vectors, including those for 2303 drugs and 3069 diseases. Combining the word vectors and known drug-disease relations from drugbank [17], machine learning technology was employed to build models to uncover new drug-disease relations. Their trained model achieved > 87% accuracy in the prediction of known drug-disease relations and succeeded in discovering novel drug-disease relations that were reported in the literature. Here, we report a perspective that describes a similar but unique approach that combines two text mining technologies to derive word vectors for drugs, cancers and their potential targets. Existing relations among drugs, diseases and targets will be extracted from Chem2Bio2RDF, a systems chemical biology database [18]; newer relations will be directly extracted from current version of Drug Bank [17]. Unknown relations that can potentially link various drugs to IBC will be predicted based on simple similarity principle.

Literature-Wide Association Study (LWAS) For IBC Drug Repurposing

Since the proposed approach conducts PubMed literature wide survey to identify associations between biological concepts, we term this approach LWAS (Literature-Wide Association Study). The overall workflow for LWAS-IBC drug repurposing is shown in Figure 1. The individual components are detailed as follows. Collecting and Preprocessing of PubMed Abstracts A search of the PubMed database with “cancer” as the search term revealed over 3 million abstracts. This body of literature will be periodically updated and further processed to construct sentences, which will be subject to the word embedding analyses by two complementary technologies below to derive word vectors for all the terms (vocabulary) in it. Performing Word Embedding Analyses First, the Word2Vec technology [17] originally developed by Google for semantic relationship analysis will be used to establish relationships among biological terms.
Figure 1: Frequency of different immune cell populations of Mr M. R’s PBMCs during the clinical study.
All these graph results were obtained from flow cytometry assay.
a) Frequency of CD3+ cells (black) and CD3- cells (white) in total PBMCs.
b) Frequency of different cell types in Mr M. R’s PBMCs: CD4 +, CD8 +, CD11c +, CD14+, CD19+ and CD335+.
c) Frequency of nTregs (in black) and iTregs (CD4+CD18+CD49b+ in white and CD4+Lag3+CD49b+ in grey) within CD4+ T lymphocytes. D) Frequency of dendritic cells (in black) and B lymphocytes (in white) in CD3- sub-population.
The Gensim implementation of this algorithm is used in this analysis. According to this technology, each biological term will be converted and represented by a high dimensional vector (e.g. 200 dimension), and the similarity (relatedness) among different terms will be captured by the cosine similarity for any two terms. This analysis will generate potential relatedness between IBC and other biological concepts, such as gene/protein names, clinical parameters, diseases and drugs. This “relatedness” information will be used to generate new hypotheses for drug repurposing. Secondly, the word association tool GloVe [19], originally developed by the Stanford natural language processing group, will be employed to analyze the same textual corpus. Like how Word2Vec works, GloVe will generate word associations based on the frequency of co-occurrence between all pairs of interested biological concepts. The co-occurrence information will then be used to derive word embedding vectors.
Both techniques are based on the intuitive notion that biomedical terms (drug names, disease names and target names) co-occur frequently in the same contextual environment (in proximity in sentences) often imply their inter-relatedness and are combined with neural network learning to discover highly related drugs and targets.

Similarity Analysis of Drug Vectors, Disease Vectors and Target Vectors

Our approach will be to first generate critical terms that will allow us to explore the above embedding results related to IBC. For example, a clinical hallmark of IBC is dermal lymphatic invasion by clusters of tumor cells which migrate collectively, termed tumor emboli. These emboli block the lymphatic vessels in the skin covering the breast, causing a characteristic red, warm and thickened appearance termed peau d’orange. These terms (e.g. tumor emboli, dermal lymphatic invasion, peau d’orange) can be explored by similarity search in the embedded (both Word2Vec and GloVe) space to identify related concepts such as drugs, targets and other diseases, which in turn provide additional hypotheses for mechanism studies and drug repurposing. Other molecular markers/genes, clinical features and pathological features/terms directly related to IBC [12,20-26] can be explored as well by similarity analysis based on the embedded word vectors.

Database Search for Known Drug-Disease and Drugtarget Relations

Systems biology database such as Chem2Bio2RDF [18] has integrated data from different sources. Web based resources such as Drug Bank [17] also provide useful information for drug repurposing efforts. For example, we have identified > 10,000 known drug-target pairs from Chem2Bio2RDF, and over 1000 drug-disease pairs from combined Drug Bank and KEGG. Similarity analysis from the previous step can reveal which drugs are like each other based on drug vectors and which diseases may be similar based on disease vectors. We can search the known drug-target pairs and drug-disease pairs to identify which diseases and drugs may be useful for IBC. This combined word embedding analysis and database search can afford interesting hypotheses for follow-up experiments. For example, if we find that IBC is related to “ovarian cancer”, then drugs known for “ovarian cancer” can be tested for their potential use in treating IBC, which serves as a new starting point for drug discovery programs [27-39].

Concluding Remarks

Currently there is a lack of therapeutics for IBC. We described a perspective that a text mining approach may help discover novel relations that can establish potential links between IBC and other diseases, targets and existing drugs. The outcome of the proposed analyses includes [1] potential targets for IBC mechanism research and [2] a list of drugs that can be tested for their efficacy in in vitro and in vivo IBC models [40-44]. To our best knowledge, this work would be the first to employ Natural Language Processing (NLP) technologies to generate word embedding models that allow us to mine for IBC-specific topics. The software used in this proposed study has been previously published, verified and is broadly applicable; thus, it also provides a generic protocol that can be applied to other cancers (other diseases) as well.
With the new experimental data, we can then validate and refine the word embedding models. In addition, various novel machine learning techniques in scikit-learn will be employed to analyze the new data. We will also perform drug similarity analysis based on chemical descriptors (rather than embeddings), which further expands the approach to enable the discovery of New Chemical Entities (NCE) beyond just drug repurposing for IBC. This combined text mining and cheminformatics approach should afford a greater opportunity for IBC therapeutic discovery

Acknowledgement

This study was supported in part by NIH awards P20CA202924 and U54CA137844, and Komen Graduate Training in Disparities Research award GTDR16377604 (KPW). It was also supported in part by the subaward from UNC-CH to WZ (via NIH U01CA207160).

More BJSTR Articles: https://biomedres01.blogspot.com/

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Subscribe to: Post Comments (Atom)