CN116110509A

CN116110509A - Method and device for predicting drug sensitivity based on histology consistency pretraining

Info

Publication number: CN116110509A
Application number: CN202211422775.9A
Authority: CN
Inventors: 曹戟; 陈文博; 欧阳振球; 杨波; 何俏军; 吴健
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2023-05-12
Anticipated expiration: 2042-11-15
Also published as: CN116110509B

Abstract

The invention discloses a method and a device for predicting drug sensitivity based on histology consistency pre-training, wherein the method comprises the following steps: constructing a medicine map, a gene expression map and a gene mutation map, and acquiring sensitivity data of a tumor cell line to medicines; constructing a tumor cell line coding module, and pre-training the tumor cell line coding module based on the histology consistency according to the gene expression diagram and the gene mutation diagram, wherein the histology consistency is any one or at least two of the histology consistency based on a predictive formula, the histology consistency based on a contrast formula and the histology consistency based on a generating formula; constructing a drug sensitivity prediction model based on a pre-trained tumor cell line coding module; parameter optimization is carried out on the drug sensitivity prediction model according to the gene diagram, the drug diagram and the sensitivity data; and predicting the drug sensitivity by using the parameter-optimized drug sensitivity prediction model. The method and the device can improve the accuracy of drug sensitivity prediction.

Description

Method and device for predicting drug sensitivity based on histology consistency pretraining

Technical Field

The invention belongs to the technical field of drug sensitivity detection and evaluation, and particularly relates to a drug sensitivity prediction method and device based on histology consistency pre-training.

Background

Because of the rise of individual medical treatment, scientific researchers and doctors put the eyes on the accurate treatment. Because of the temporal and spatial heterogeneity of tumors, each cancer patient may respond differently to the same drug or therapy, which in turn may lead to toxic side effects or even exacerbate tumor progression. Therefore, there is a great need in clinic for a method for rapidly and accurately predicting drug sensitivity of individual patients, so as to guide clinical medication. With the development of high-throughput technology, various sequencing means bring about explosive growth of histology data, and the use of multiple sets of histology information to predict drug sensitivity has become an extremely important task in personalized medicine due to the extremely strong individual differences of patient histology data.

Many existing public datasets, such as anticancer drug sensitive genome, transcriptome datasets CCLE (Cancer Cell Line Encyclopedia) and GDSC (Genomics of Drug Sensitivity in Cancer), cancer patient profile TCGA (The Cancer Genome Atlas), and proteomics dataset of protein-protein interactions (STRING database), provide a rich molecular level data and clinical sample data basis for studying disease occurrence, development, prognosis, etc. Based on these large data sets, researchers have proposed several machine learning methods to explore the relationship between histology information and drug response, using genetic information of tumor cell lines for prediction of half-inhibitory concentrations. However, these methods do not provide good extraction of the characteristics of the cell lines and thus characterize the cell lines due to insufficient data volume of the tumor cell lines used, resulting in insufficient training of the tumor cell line coding module and reduced drug sensitivity prediction performance. Meanwhile, in order to solve the above-mentioned problems, some methods employ more than one type of histology data, for example, using gene expression level, gene mutation, gene copy number, etc. simultaneously to characterize a cell line, so as to achieve better prediction accuracy, for example, a drug susceptibility prediction method and device based on multi-group chemical similarity guidance as disclosed in patent document publication No. CN114255886a, and a drug susceptibility prediction method and device based on transfer learning and graph neural network as disclosed in patent document publication No. CN112863696 a. However, these methods may overfit the tumor cell line coding modules and still do not adequately address the above-described problems. Therefore, no better model exists at present to accurately extract the characteristics of a tumor cell line, so that the drug sensitivity prediction with higher accuracy is realized.

Along with the development of big data and high-performance hardware, the pre-training model has been greatly successful in various fields of deep learning by utilizing massive unlabeled data, but there are few pre-training models for tumor cell lines in the field of drug sensitivity prediction.

Disclosure of Invention

In view of the above, the present invention aims to provide a method and a device for predicting drug sensitivity for consistent pretraining in histology, so as to solve the problem of poor performance of a drug sensitivity prediction model caused by insufficient training of a tumor cell line coding module.

In order to achieve the above object, an embodiment provides a method for predicting drug sensitivity based on a pretraining of histology consistency, comprising the steps of:

obtaining small molecular data of a drug, constructing a drug graph, obtaining histology information and proteomics data of a tumor cell line including gene expression quantity and gene mutation information, constructing a gene graph, a gene expression graph and a gene mutation graph, and obtaining sensitivity data of the tumor cell line to the drug as tag data;

constructing a tumor cell line coding module, and pre-training the tumor cell line coding module based on the histology consistency according to the gene expression diagram and the gene mutation diagram, wherein the histology consistency is any one or at least two of the histology consistency based on a predictive formula, the histology consistency based on a contrast formula and the histology consistency based on a generating formula;

constructing a drug sensitivity prediction model, wherein the drug sensitivity prediction model comprises a pre-trained tumor cell line coding module, a drug small molecule coding module and a drug sensitivity prediction module, the pre-trained tumor cell line coding module is used for extracting cell line representation of a gene map, the drug small molecule coding module is used for extracting drug representation of the drug map, and the drug sensitivity prediction module is used for calculating a sensitivity prediction result of a drug after acting on a tumor cell line according to the cell line representation and the drug representation;

and taking the gene diagram and the drug diagram as input, carrying out parameter optimization on the drug sensitivity prediction model under the supervision of the label data, and carrying out drug sensitivity prediction by using the drug sensitivity prediction model after parameter optimization.

In one embodiment, when a tumor cell line coding module is pre-trained based on the predictive type histology consistency according to a gene expression diagram, a predictive type training system is constructed, wherein the predictive type training system comprises a tumor cell line coding module, a first mapping head and a first regularization operation which are connected to the output end of the tumor cell line coding module, and a second mapping head and a second regularization operation;

pretraining a tumor cell line encoding module with a predictive training system, comprising:

acquiring inherent characteristics related to a tumor cell line and taking the inherent characteristics as a first supervision tag and taking gene mutation information as a second supervision tag, wherein the inherent characteristics comprise cancer type, tissue source, tissue type, sex or age;

inputting a gene expression diagram into a predictive training system, extracting a gene expression representation by a tumor cell line coding module, mapping and transforming the gene expression representation by a first mapping head, then predicting inherent characteristics by a first regularization treatment operation, and simultaneously, mapping and transforming the gene expression representation by a second mapping head and then predicting gene mutation information by a second regularization treatment operation;

calculating a first cross entropy loss according to the predicted inherent characteristics and the first supervision tag, constructing a second cross entropy loss according to the predicted gene mutation information and the second supervision tag, and pre-training a tumor cell line coding module by taking weighted summation of the first cross entropy loss and the second cross entropy loss as a prediction-based histology consistency loss.

In one embodiment, when the tumor cell line encoding module is pre-trained based on comparative genomic consistency from the gene expression profile and the gene mutation profile, the gene expression profile and the gene mutation profile are input to the tumor cell line encoding module to extract a gene expression characterization and a gene mutation characterization, respectively, a contrast loss is calculated based on the gene expression characterization and the gene mutation characterization, and the tumor cell line encoding module is pre-trained with minimized contrast loss as comparative based genomic consistency loss.

In one embodiment, when the tumor cell line coding module is pre-trained based on the generated histology consistency according to the gene expression map and the gene mutation map, a generated training system is constructed, wherein the generated training system comprises the tumor cell line coding module, a first variation self-encoder and a second variation self-encoder which are connected to the output end of the tumor cell line coding module;

pretraining a tumor cell line coding module with a generative training system, comprising:

inputting a gene expression diagram into a generating training system, extracting gene expression characterization from the gene expression diagram through a tumor cell line coding module, and predicting gene mutation data from the gene expression characterization through coding and decoding of a first variation self-coder;

inputting the gene mutation map into a generating training system, extracting gene mutation characterization from the gene expression map through a tumor cell line coding module, and predicting gene expression data through coding and decoding of a second variation self-coder by the gene mutation characterization;

calculating a first mean square error loss according to the predicted gene mutation data and the gene mutation information serving as a supervision tag, constructing a second mean square error loss according to the predicted gene expression data and the gene expression quantity serving as the supervision tag, and pre-training a tumor cell line coding module by taking weighted summation of the first mean square error loss and the second mean square error loss as a generation-based histology consistency loss.

In one embodiment, constructing a gene map, a gene expression map, and a gene mutation map from tumor cell line histology information, proteomics data including gene expression amounts and gene mutation information, includes:

the method comprises the steps of taking genes as nodes of a gene diagram, a gene expression diagram and a gene mutation diagram, determining protein-protein interaction encoded by the genes according to proteomics data, determining connection relations among the genes according to the protein-protein interaction, and constructing connecting edges among the nodes according to the connection relations;

the method comprises the steps of taking tumor cell line histology information as node characteristics, taking gene expression quantity as node characteristics for a gene expression diagram, and taking gene mutation information as node characteristics for a gene mutation diagram.

In one embodiment, the tumor cell line encoding module employs a graph attention network and the drug small molecule encoding module employs a graph attention network.

In one embodiment, the parameters of the pre-trained tumor cell line coding module, the drug small molecule coding module and the drug susceptibility prediction module are optimized by taking the genetic map and the drug map as inputs and taking the cross entropy of the susceptibility prediction result output by the drug susceptibility prediction model and the label data as a total loss function when the drug susceptibility prediction model is optimized under the supervision of the label data.

In one embodiment, acquiring drug small molecule data and constructing a drug map includes: and constructing a drug graph by taking atoms of drug small molecules as nodes and chemical bonds among the atoms as connecting edges.

To achieve the above object, an embodiment of the present invention further provides a device for predicting drug susceptibility based on a pretraining of histology consistency, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method for predicting drug susceptibility based on pretraining of histology consistency when executing the computer program.

Compared with the prior art, the invention has the beneficial effects that at least the following steps are included:

pretraining of the tumor cell line coding module is completed through the histology consistency among the histology information of the tumor cell line so as to fully mine potential links of different histology data of the cell line, thereby enabling the tumor cell line coding module to have the capability of extracting cell line characterization more accurately; meanwhile, the data of the gene expression quantity and the gene mutation information can be fully utilized, and a consistency pre-training mode based on a predictive formula, a consistency pre-training mode based on a contrast formula and a consistency pre-training mode based on a generating formula are provided, so that robustness and generalization of histology data and potential hierarchical structural semantic information in histology information of tumor cell lines are fully considered.

The provided drug sensitivity prediction model based on the histology consistency pretraining utilizes the relativity among the histology to enable the tumor cell line coding module to contain more abundant biological information while extracting the cell line characteristics efficiently so as to train a more accurate drug sensitivity prediction model and improve the prediction accuracy of the drug sensitivity prediction model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for predicting drug susceptibility based on a histologic consistency pre-training provided in an embodiment;

FIG. 2 is a pre-training flow chart for predictive-based omic consistency provided by an embodiment;

FIG. 3 is a diagram of a pretraining flow based on comparative omic consistency provided by the embodiments;

FIG. 4 is a pre-training flow chart based on generative team consistency provided by an embodiment;

fig. 5 is a schematic structural diagram of a drug susceptibility prediction model provided in the examples.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description is presented by way of example only and is not intended to limit the scope of the invention.

In order to solve the problem of poor performance of a drug sensitivity prediction model caused by insufficient training of a tumor cell line coding module, a great deal of researches are carried out. It was found that the tumor cell line histology information, which contains the gene expression level in transcriptome and the gene mutation information in genomics, can well reflect the molecular level characteristics of one cell line, and that potential links and similarities exist between different tumor cell line histology information. Based on the above, the embodiment of the invention provides a drug sensitivity prediction method and device based on histology consistency pre-training, which are used for fully training a tumor cell line coding module by utilizing the histology consistency of a tumor cell line so as to extract more accurate cell line characterization and improve the prediction accuracy of a drug sensitivity prediction model.

FIG. 1 is a flow chart of a method for predicting drug susceptibility based on a histologic consistency pre-training provided in an embodiment. As shown in fig. 1, the method for predicting drug sensitivity based on the pretraining of the histology consistency provided in the embodiment includes the following steps:

step 1, obtaining histology information of tumor cell lines, proteomics data and constructing a gene map, a gene expression map and a gene mutation map.

In an embodiment, the obtained histology information of the tumor cell line is derived from various data sets, for example, from a TCGA data set, which records the histology information of the tumor cell line, including gene expression level and gene mutation information. Proteomic data is obtained from various data sets, for example from the sting data set, which records protein-protein interactions.

In the embodiment, when the gene map, the gene expression map and the gene mutation map are constructed according to the tumor cell line histology information and the proteomics data, nodes of the gene map, the gene expression map and the gene mutation map are all genes, but node characteristics are different, the tumor cell line histology information is used as node characteristics for the gene map, the gene expression amount is used as node characteristics for the gene expression map, and the gene mutation information is used as node characteristics for the gene mutation map. The construction modes of the connecting edges among the nodes in the gene diagram, the gene expression diagram and the gene mutation diagram are the same, the protein-protein interaction encoded by the genes is determined according to proteomics data, the connection relation among the genes is determined according to the protein-protein interaction, and when the correlation coefficient determined according to the protein-protein interaction is more than a threshold value, the genes are considered to have the connection relation, and the connecting edges are constructed among the nodes according to the connection relation.

And 2, acquiring small molecular data of the medicine and constructing a medicine graph.

In an embodiment, the obtained drug small molecule data is typically displayed in the form of a name or drug ID, and in order to facilitate extraction of a drug map, it is necessary to obtain a drug SMILES type from a database (e.g., pubChem database) for construction of a subsequent drug map. And when the drug graph is constructed, the drug small molecule data is characterized as a 2D graph, namely, nodes and sides are respectively constructed according to atoms and chemical bonds of the drug small molecules, so as to obtain the drug graph, wherein the atomic information of the drug is encoded into node characteristics, and the information on the chemical bonds is encoded into side information.

And step 3, acquiring sensitivity data of the tumor cell line to the drug as tag data.

In an embodiment, the sensitivity data of the tumor cell line to the drug is derived from various data sets, for example from the TCGA data set, which records the sensitivity data of the tumor cell line to a certain drug, i.e. sensitive/insensitive. These sensitivity data are used as label data for training a drug sensitivity prediction model.

And 4, pretraining the coding module of the tumor cell line based on the histology consistency according to the gene expression diagram and the gene mutation diagram.

In an embodiment, a tumor cell line coding module is constructed, wherein the tumor cell line coding module can employ a graph attention network (Graph Attention Network, GAT). After the structure of the tumor cell line coding module is built, pretraining the tumor cell line coding module based on the histology consistency according to the gene expression diagram and the gene mutation diagram, wherein the histology consistency is any one or at least two of the histology consistency based on a predictive formula, the histology consistency based on a contrast formula and the histology consistency based on a generating formula. That is, when the pre-training is performed, the pre-training may be performed by prediction-based histology consistency, contrast-based histology consistency, or generation-based histology consistency alone, or may be performed by combination of prediction-based histology consistency and weighted sum of contrast-based histology consistency, or may be performed by combination of prediction-based histology consistency and weighted sum of generation-based histology consistency, or may be performed by combination of prediction-based histology consistency training, contrast-based histology consistency training, and generation-based histology consistency weighted sum. The following is a detailed description of three separate pretrains.

In an embodiment, as shown in fig. 2, when a tumor cell line coding module is pre-trained based on the histology consistency of a predictive expression according to a gene expression diagram, a predictive expression training system is constructed, wherein the predictive expression training system comprises a tumor cell line coding module, a first mapping head and a first regularization operation which are connected to the output end of the tumor cell line coding module, and a second mapping head and a second regularization operation;

In an example, as shown in fig. 3, when a tumor cell line coding module is pretrained based on comparative genomic consistency from a gene expression profile and a gene mutation profile, the gene expression profile and the gene mutation profile are input to the tumor cell line coding module to extract a gene expression characterization and a gene mutation characterization, respectively, a comparative loss is calculated based on the gene expression characterization and the gene mutation characterization, and the tumor cell line coding module is pretrained with minimized comparative loss as comparative-based genomic consistency loss.

In an embodiment, as shown in fig. 4, when the tumor cell line coding module is pre-trained according to the gene expression diagram and the gene mutation diagram based on the generated group consistency, a generated training system is constructed, wherein the generated training system comprises a tumor cell line coding module, a first variation self-encoder and a second variation self-encoder which are connected to the output end of the tumor cell line coding module;

In an embodiment, pre-training of the tumor cell line coding module is accomplished through the histology consistency between the histology information of the tumor cell line, so that the potential links of different histology data of the cell line are fully mined, and the tumor cell line coding module has the capability of extracting the cell line characterization more accurately. The three pre-training modes fully consider robustness and generalization of the histology data and potential hierarchical structural semantic information in the histology information of the tumor cell line, so that the accuracy of cell line representation and extraction of the tumor cell line coding module can be improved.

And 5, constructing a drug sensitivity prediction model based on the pre-trained tumor cell line coding module.

In an embodiment, after obtaining the pre-trained tumor cell line coding module, a drug susceptibility prediction model is constructed according to the pre-trained tumor cell line coding module, as shown in fig. 5, where the constructed drug susceptibility prediction model includes the pre-trained tumor cell line coding module, the drug small molecule coding module, and the drug susceptibility prediction module, the pre-trained tumor cell line coding module is used for extracting a cell line representation of a gene map, the drug small molecule coding module is used for extracting a drug representation of the drug map, and the drug susceptibility prediction module is used for calculating a susceptibility prediction result after the drug acts on the tumor cell line according to the cell line representation and the drug representation. Wherein the sensitivity prediction result comprises a sensitivity prediction result or a insensitivity prediction result.

In an embodiment, the drug small molecule coding module adopts a graph attention network, and the drug sensitivity prediction module adopts a full connection layer.

And 6, carrying out parameter optimization on the drug sensitivity prediction model according to the gene diagram, the drug diagram and the label data.

In the embodiment, when the parameter optimization is performed on the drug sensitivity prediction model, a gene diagram and a drug diagram are taken as input, a cell line representation of the gene diagram is extracted by using a pre-trained tumor cell line coding module, a drug representation of the drug diagram is extracted by using a drug small molecule coding module, the cell line representation and the drug representation are spliced and then input into the drug sensitivity prediction module, a sensitivity prediction result of a drug small molecule on the tumor cell line is calculated and output, and cross entropy of the sensitivity prediction result and tag data is taken as a total loss function to optimize parameters of the pre-trained tumor cell line coding module, the drug small molecule coding module and the drug sensitivity prediction module.

And 7, predicting the drug sensitivity by using a parameter-optimized drug sensitivity prediction model.

In the embodiment, when the drug sensitivity prediction model with optimized parameters is used for carrying out the drug sensitivity prediction, firstly, the histology information of a tumor cell line is converted into a gene diagram, the small molecular data of a drug is converted into a drug diagram, the input gene diagram is subjected to feature extraction by using a tumor cell line coding module to obtain cell line characterization, and the drug diagram is subjected to feature extraction by using the small molecular coding module to obtain drug characterization; and the drug sensitivity prediction module calculates and outputs a sensitivity prediction result of the drug after acting on the tumor cell line according to the cell line characterization and the splicing result of the drug characterization, so as to realize drug sensitivity prediction.

Based on the same inventive concept, the embodiment also provides a drug sensitivity prediction device based on the histology consistency pre-training, which comprises a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to realize the drug sensitivity prediction method based on the histology consistency pre-training, and the method comprises the following steps:

In specific application, the memory may be a volatile memory at the near end, such as a RAM, or a nonvolatile memory, such as a ROM, a FLASH, a floppy disk, a mechanical hard disk, or a remote storage cloud. The processor may be a Central Processing Unit (CPU), a Microprocessor (MPU), a Digital Signal Processor (DSP), or a Field Programmable Gate Array (FPGA), i.e. the steps of the drug susceptibility prediction method based on the pretraining of the histologic consistency may be implemented by these processors.

According to the method and the device provided by the embodiment, the medicine sensitivity prediction model based on the histology consistency pre-training is utilized, the histology information of the cell line is extracted efficiently, and meanwhile, the correlation among the histology is utilized to enable the tumor cell line coding module to contain richer biological information, so that the more accurate medicine sensitivity prediction model is trained, and the prediction accuracy of the medicine sensitivity prediction model is improved.

The foregoing detailed description of the preferred embodiments and advantages of the invention will be appreciated that the foregoing description is merely illustrative of the presently preferred embodiments of the invention, and that no changes, additions, substitutions and equivalents of those embodiments are intended to be included within the scope of the invention.

Claims

1. A method for predicting drug sensitivity based on a histologic consistency pretraining, comprising the steps of:

2. The method for predicting drug sensitivity based on the predictive model consistency pre-training according to claim 1, wherein when the tumor cell line coding module is pre-trained based on the predictive model consistency according to the gene expression diagram, a predictive model training system is constructed, and the predictive model training system comprises a tumor cell line coding module, a first mapping head and a first regularization operation, and a second mapping head and a second regularization operation which are connected to the output end of the tumor cell line coding module;

3. The method of claim 1, wherein when the tumor cell line coding module is pretrained based on comparative genomic consistency based on the gene expression profile and the gene mutation profile, the gene expression profile and the gene mutation profile are input to the tumor cell line coding module to extract the gene expression characterization and the gene mutation characterization, respectively, a contrast loss is calculated based on the gene expression characterization and the gene mutation characterization, and the tumor cell line coding module is pretrained with the minimized contrast loss as the comparative-based genomic consistency loss.

4. The method for predicting drug sensitivity based on the histologic consistency pre-training of claim 1, wherein when the tumor cell line coding module is pre-trained based on the histologic consistency of the generation formula according to the gene expression diagram and the gene mutation diagram, a generation formula training system is constructed, and the generation formula training system comprises the tumor cell line coding module, a first variation self-encoder and a second variation self-encoder which are connected to the output end of the tumor cell line coding module;

5. The method for predicting drug sensitivity based on the pretraining of the genomic consistency according to claim 1, wherein constructing the gene map, the gene expression map and the gene mutation map based on the tumor cell line histology information including the gene expression amount and the gene mutation information, the proteomics data comprises:

6. The method for predicting drug sensitivity based on histologic consistency pre-training of claim 1, wherein the tumor cell line coding module employs a graph attention network and the drug small molecule coding module employs a graph attention network.

7. The method for predicting drug susceptibility based on histology consistency pre-training of claim 1, wherein the parameters of the pre-trained tumor cell line coding module, the drug small molecule coding module and the drug susceptibility prediction module are optimized by using cross entropy of the drug susceptibility prediction result output by the drug susceptibility prediction model and the tag data as a total loss function when the drug susceptibility prediction model is optimized under the supervision of the tag data with the gene map and the drug map as inputs.

8. The method for predicting drug susceptibility based on histologic consistency pre-training of claim 1, wherein obtaining drug small molecule data and constructing a drug map comprises: and constructing a drug graph by taking atoms of drug small molecules as nodes and chemical bonds among the atoms as connecting edges.

9. A drug susceptibility prediction apparatus based on a histologic consistency pre-training, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the drug susceptibility prediction method steps based on a histologic consistency pre-training of any of claims 1-8.