CN113782089A

CN113782089A - Drug sensitivity prediction method and device based on multigroup chemical data fusion

Info

Publication number: CN113782089A
Application number: CN202111349387.8A
Authority: CN
Inventors: 吴健; 冯芮苇; 谢雨峰; 赖泯汕; 郭越; 曹戟; 何俏军; 杨波
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2021-12-10
Anticipated expiration: 2041-11-15
Also published as: CN113782089B

Abstract

The invention discloses a method and a device for predicting drug sensitivity based on multigroup chemical data fusion, which belong to the field of drug sensitivity detection and comprise the following steps: the method comprises the steps of integrating three groups of chemical information, namely genomics data, proteomics data and metabonomics data of an individual cell line through a cell line graph characteristic module to obtain a cell line polygonal graph, fully considering the groups of chemical information of the cell line and potential relation among products expressed by genes in the groups of chemical layers, then extracting characteristics of the cell line polygonal graph through a cell line graph characteristic extraction module to fully extract node characteristics and side characteristics in the cell line polygonal graph as cell line characteristics, and finally predicting the semi-inhibitory concentration of a drug by a drug sensitivity prediction module according to the cell line characteristics and the drug characteristics extracted by the drug characteristic extraction module, so that the prediction accuracy of drug sensitivity is improved on the basis of comprehensively considering the genomics data, the proteomics data and the metabonomics data.

Description

Drug sensitivity prediction method and device based on multigroup chemical data fusion

Technical Field

The invention belongs to the technical field of drug sensitivity detection and evaluation, and particularly relates to a drug sensitivity prediction method and device based on multigroup chemical data fusion.

Background

The treatment of cancer is a great problem which is solved in an effort all over the world, and the development of high-throughput sequencing technology and artificial intelligence technology provides infinite possibility for the precise treatment of cancer. How to utilize abundant biological information of individuals and efficient analysis means such as deep learning and artificial intelligence to automatically learn the specific characteristics of the individuals and formulate a specific diagnosis and treatment scheme for each individual so as to realize accurate diagnosis and accurate treatment is an important problem which is very concerned by researchers and industries all over the world. Many researchers have made much effort and contribution to this problem, trying to apply individual genomic data to personalized diagnosis and medication recommendations for patients. However, the existing research still faces an important problem, and how to fully utilize the complex and diverse omics aggregates of each individual to realize more accurate prediction of drug efficacy and drug recommendation is still an important problem to be solved urgently.

With The progress of The research on Genomics, some public datasets are beginning to be applied more and more to bioinformatics research, such as Cancer Cell Line Encyclopedia (CCLE) and (Genomics of Drug Sensitivity in Cancer, GDSC), Cancer Genome map (The Cancer Genome Atlas, TCGA), etc., and proteomics dataset (STRING) for studying The interaction between human genes/proteins, metabolomics dataset for studying The human information pathway (GSEA dataset), etc. For example, a tumor cell drug sensitivity assessment method based on genetic material specificity disclosed in patent application publication No. CN105005693A, which uses a tumor cell sample set alone to predict the half inhibitory concentration (IC 50 value), and a drug sensitivity prediction method based on a self-expression model disclosed in patent application publication No. CN112164474A, and uses GDSC data set and cancer cell line encyclopedia to predict the half inhibitory concentration (IC 50 value).

The data sets are still continuously expanded and developed, and a rich sample data basis is provided for researching occurrence, development, prognosis, regression and the like of diseases. However, the existing data is rarely fully utilized, thereby solving the problems of drug susceptibility prediction and drug recommendation. For example, existing methods only use individual genomics data provided in the CCLE and GDSC databases to predict the semi-inhibitory concentration through genomics analysis, however, such methods often ignore the possible association of individual genes at other omics levels. Therefore, although such a method has been advanced to some extent, the accuracy of the semi-inhibitory concentration prediction is still insufficient. Therefore, at present, no good model is available which can sufficiently fuse multiple sets of individual mathematical information so as to predict drug sensitivity (half inhibitory concentration) more accurately.

Disclosure of Invention

In view of the above, the present invention aims to provide a method and a device for predicting drug sensitivity based on multigroup chemical data fusion, so as to solve the problem of poor accuracy of drug sensitivity prediction caused by neglecting potential connection between genes.

In order to achieve the purpose, the invention provides the following technical scheme:

in a first aspect, an embodiment provides a drug sensitivity prediction method based on multigroup chemical data fusion, including the following steps:

acquiring multiple groups of chemical data, drug data and half-inhibition concentration data of drugs on the cell line, wherein the multiple groups of chemical data of the cell line comprise genomics data, proteomics data and metabonomics data;

constructing a drug sensitivity prediction model, which comprises a cell line graph characteristic module, a cell line graph characteristic extraction module, a drug characteristic extraction module and a drug sensitivity prediction module, wherein the cell line graph characteristic module is used for encoding multiple groups of chemical data of a cell line into a cell line polygonal graph, namely, genes of each sample are used as nodes of the cell line polygonal graph, and gene expression quantity, gene mutation condition and copy number variation condition corresponding to the genes are used as node characteristics, so that the connection edges between the nodes are constructed according to the correlation among the genes determined by genomics data, the protein interaction among the genes determined by proteomics data and the metabolic pathway information among the genes determined by the metabonomics data; the cell line image feature extraction module is used for extracting cell line features from a cell line polygonal image; the medicine characteristic extraction module is used for extracting medicine characteristics from the medicine data; the drug sensitivity prediction module is used for predicting the semi-inhibitory concentration of the drug according to the cell line characteristics and the drug characteristics;

performing parameter optimization on a drug sensitivity prediction model by taking multigroup mathematical data and drug data of a cell line as sample data and taking semi-inhibitory concentration data of a drug on the cell line as a truth label;

and (5) performing drug sensitivity prediction by using the drug sensitivity prediction model after parameter optimization.

In one embodiment, in the cell line graph characterization module, a pearson correlation coefficient between gene expression data of two genes is calculated according to genomics data to determine correlation between the genes, and when the pearson correlation coefficient is greater than a set threshold, a connecting edge between nodes corresponding to the two genes is constructed;

acquiring the interaction between two genes according to proteomics data as protein interaction, constructing a connecting edge between nodes corresponding to the two genes with the protein interaction, and simultaneously taking the interaction score of the interaction as the weight of the connecting edge;

metabolic pathway information among genes is obtained according to metabonomics data, and when multiple genes simultaneously appear in a certain metabolic pathway, a super edge is constructed between nodes corresponding to the genes to serve as a connecting edge.

In one embodiment, the cell line map feature extraction module comprises a first map neural network unit and a gating cycle unit, wherein the first map neural network unit is composed of a plurality of map convolutional layers, and two adjacent map convolutional layers are connected through the gating cycle unit, the first map neural network unit is used for extracting cell line features from a cell line polygon, and the gating cycle unit is used for performing feature attention on the extracted cell line features.

In one embodiment, in each convolutional layer, a three-step feature aggregation is performed on the node features, including:

the method comprises the steps of firstly, performing feature aggregation, namely determining all first-order neighbor nodes of a current node according to a first connecting edge constructed according to the correlation among genes, and performing feature aggregation through the following formula (1);

（1）

wherein,

is shown asiThe current node characteristics of the current node,

representing the new node characteristics after the first step of characteristic aggregation,

is shown asjThe node characteristics of the first one-order neighbor node,

is shown asiA current node andjthe weight of the first connecting edge between the first one-order neighboring nodes,

representing the number of first order neighbor nodes,

representing the new node characteristics after the first step of characteristic aggregation;

secondly, performing feature aggregation, namely determining all second-first-order neighbor nodes of the current node according to a second connecting edge constructed by protein interaction between genes, and performing feature aggregation through the following formula (2);

（2）

wherein,

is shown asiNew node characteristics of current node

The new node features after attention by the node gating unit,

is shown askThe node characteristics of the second-order neighbor nodes,

is shown asiA current node andkthe weight of the second connecting edge between the second first-order neighbor nodes,

indicating the number of second-order neighbor nodes,

representing the new node characteristics after the second step of characteristic aggregation;

thirdly, feature aggregation, namely determining all third-order neighbor nodes of the current node according to a third connecting edge constructed by metabolic pathway information among genes, and performing feature aggregation through the following formula (3);

（3）

wherein,

is shown asiNew node characteristics of current node

The new node features after attention by the node gating unit,

is shown astThe node characteristics of the third-first-order neighbor nodes,

is shown asiCurrent node characteristics andtthe weight of the third connecting edge between the third-first-order neighbor nodes,

indicating the number of third-order neighbor nodes,

representing the new node characteristics after the third step of characteristic aggregation;

new node characteristics of the current node pass through

Attention-off via node gating unitAnnotated new node features

As the current node characteristic of the next convolution layer.

In one embodiment, the drug feature extraction module comprises a conversion unit and a second graph neural network unit, wherein the conversion unit is used for converting the drug data into a drug score graph, and the second graph neural network unit is used for extracting the drug features from the input drug score graph.

In one embodiment, the conversion unit encodes the drug data into a drug molecular graph using an open source library RDKit; the second graph neural network unit is constructed based on graph isomorphism principle.

In one embodiment, the drug sensitivity prediction module comprises a plurality of fully connected layers for performing feature fusion and regression on the input cell line features and the splicing features of the drug features to predict the semi-inhibitory concentration of the drug.

In one embodiment, after acquiring multiple sets of mathematical data, drug data and half-inhibitory concentration data of drug on cell lines, the data are subjected to outlier and missing value elimination, and the processed data are used for constructing a training sample.

In one embodiment, when the drug sensitivity prediction model is optimized, the model parameters of the drug sensitivity prediction model are updated by taking the predicted value of the half inhibitory concentration and the mean square error of the corresponding truth label as a loss function.

In a second aspect, embodiments provide a drug sensitivity prediction device based on multigroup chemical data fusion, including:

the data acquisition unit is used for acquiring multiple groups of chemical data, drug data and semi-inhibitory concentration data of drugs on the cell line, wherein the multiple groups of chemical data of the cell line comprise genomics data, proteomics data and metabonomics data;

the model construction unit is used for constructing a drug sensitivity prediction model and comprises a cell line graph characteristic module, a cell line graph characteristic extraction module, a drug characteristic extraction module and a drug sensitivity prediction module, wherein the cell line graph characteristic module is used for encoding multiple groups of chemical data of a cell line into a cell line polygonal graph, namely, genes of each sample are used as nodes, and gene expression quantity, gene mutation condition and copy number variation condition corresponding to the genes are used as node characteristics, so that the connection edges between the nodes are constructed according to the correlation among the genes determined by the genomics data, the protein interaction among the genes determined by the proteomics data and the metabolic pathway information among the genes determined by the metabolic data; the cell line image feature extraction module is used for extracting cell line features from a cell line polygonal image; the medicine characteristic extraction module is used for extracting medicine characteristics from the medicine data; the drug sensitivity prediction module is used for predicting the semi-inhibitory concentration of the drug according to the cell line characteristics and the drug characteristics;

the optimization learning unit is used for performing parameter optimization on the drug sensitivity prediction model by taking multigroup chemical data and drug data of the cell line as sample data and taking semi-inhibitory concentration data of the drug on the cell line as a truth label;

and the prediction unit is used for predicting the drug sensitivity by using the drug sensitivity prediction model after parameter optimization.

In a third aspect, embodiments provide a drug sensitivity prediction apparatus based on multi-set chemical data fusion, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the drug sensitivity prediction method based on multi-set chemical data fusion of the first aspect.

Compared with the prior art, the invention has the beneficial effects that at least:

the method comprises the steps of integrating three groups of chemical information, namely genomics data, proteomics data and metabonomics data of an individual cell line through a cell line graph characterization module to obtain a cell line polygonal graph, fully considering the groups of chemical information of the cell line and potential relation among products expressed by genes in the groups of chemical layers, then extracting the characteristics of the cell line polygonal graph through a cell line graph characteristic extraction module to fully extract node characteristics and side characteristics in the cell line polygonal graph as cell line characteristics, and finally predicting the semi-inhibitory concentration of a drug by a drug sensitivity prediction module according to the cell line characteristics and the drug characteristics extracted by the drug characteristic extraction module, so that the prediction accuracy of the drug sensitivity is improved on the basis of comprehensively considering the genomics data, the proteomics data and the metabonomics data.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a method for predicting drug sensitivity based on multigroup chemical data fusion provided by an embodiment;

FIG. 2 is a schematic structural diagram of a drug sensitivity prediction model provided in an embodiment;

FIG. 3 is a schematic diagram of a cell line profile constructed in a cell line profile characterization module according to an embodiment;

FIG. 4 is a schematic diagram of feature extraction in a cell line map feature extraction module according to an embodiment;

fig. 5 is a schematic structural diagram of a drug sensitivity prediction device based on multigroup chemical data fusion provided by an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

The method aims to solve the problems that the accuracy of a drug sensitivity prediction model is low and the drug sensitivity is difficult to predict accurately due to the existence of complex characteristics of individual multigroup information which are not considered and various potential relations of genes possibly existing on the multigroup level. The embodiment provides a drug sensitivity prediction method and a drug sensitivity prediction device in multigroup chemical data fusion. The potential relation possibly existing among multiple groups of chemical information and genes such as genomics, proteomics, metabonomics and the like is considered, and data characteristics are extracted by combining a graph neural network unit and a gate control circulation unit, so that the prediction accuracy of the drug sensitivity prediction model is improved.

Fig. 1 is a flowchart of a drug sensitivity prediction method based on multigroup chemical data fusion provided in the embodiment. As shown in fig. 1, the embodiment provides a method for predicting drug sensitivity based on multigroup chemical data fusion, which comprises the following steps:

s110, acquiring multigroup data, medicine data and semi-inhibitory concentration data of medicines to the cell line of the cell line, and constructing a training sample.

For each cell line sample, corresponding sets of mathematical data, drug data, and half-inhibitory concentration data of the drug on the cell line can be obtained. Wherein the plurality of sets of chemical data comprises genomic data, proteomic data, and metabolomic data. Drug data refers to the name of the drug acting on the cell line from which the drug molecular formula can be obtained. The semi-inhibitory concentration data characterize the resistance of the cell line to the drug, and the smaller the semi-inhibitory concentration data, the stronger the antibody specificity of the cell line to the drug. For all genes contained in the cell line, the genomics data comprise gene expression level, gene mutation condition and copy number variation condition; protein Interactions (PPIs) between proteomic data-responsive genes at the protein level; metabonomics data reflect the correspondence between genes at the level of the metabolic pathway, i.e., whether multiple genes are present on the same metabolic pathway.

In an embodiment, the acquisition data may be from multiple sets of mathematical data, such as: the CCLE data set records genomics data of cell lines, including gene expression level, copy number variation condition and gene mutation condition; the STRING data set records the interaction between human genes/proteins, the GSEA data set records the metabonomics information of a human metabonomics information channel, and the GDSC data set records the semi-inhibitory concentration value of a cell line to a certain drug. The general drug data in these data sets are expressed by name, and for convenience of extracting drug molecular graph, it is also necessary to obtain drug molecular formula from database (such as PubChem database) as drug research object.

Sample partitioning of the acquired data is required for optimizing drug sensitivity prediction model parameters. Specifically, cell line information, drug data and semi-inhibitory concentration data are extracted from each piece of record information of a GDSC dataset, and then multiple sets of mathematical data corresponding to each cell line are obtained from CCLE, STRING and GSEA datasets, wherein the data corresponding to each record is used as a training sample, namely the multiple sets of mathematical data and drug data of the cell line are used as sample data, and the semi-inhibitory concentration data of the drug on the cell line is used as a truth label.

In one possible embodiment, in order to improve the quality of the training sample and further improve the training effect of the model, after acquiring multiple sets of mathematical data of the cell line, drug data and semi-inhibitory concentration data of the drug on the cell line, the data is further subjected to outlier and missing value elimination processing, and the processed data is used for constructing the training sample.

And S120, constructing a drug sensitivity prediction model.

FIG. 2 is a schematic structural diagram of a drug sensitivity prediction model provided in the example. As shown in fig. 2, the drug sensitivity prediction model provided in the embodiment includes a cell line map characterization module, a cell line map feature extraction module, a drug feature extraction module, and a drug sensitivity prediction module. The cell line graph characterization module is used for encoding multiple sets of mathematical data of the cell line into a cell line polygon graph and realizing cell line polygon graph characterization based on fusion of the multiple sets of mathematical data; the cell line image feature extraction module is used for extracting cell line features from a cell line polygonal image; the medicine characteristic extraction module is used for extracting medicine characteristics from the medicine data; and the drug sensitivity prediction module is used for predicting the semi-inhibitory concentration of the drug according to the cell line characteristics and the drug characteristics.

FIG. 3 is a schematic diagram of the construction of a cell line polygon in the cell line graph characterization module provided in the examples. As shown in fig. 3, first, nodes and node features are constructed, specifically, genes of each sample are used as nodes of a cell line polygonal diagram, and accordingly, three features of the nodes are constructed according to genomic data of each gene, that is, a gene expression level, a gene mutation condition and a copy number variation condition corresponding to the gene are used as the node features, wherein the gene mutation condition is understood as whether a gene mutation occurs, and the copy number variation condition is understood as whether a copy number variation exists.

And then constructing connection edge information between the nodes, specifically, constructing connection edges between the nodes according to the correlation between genes determined according to the genomics data, the protein interaction between the genes determined according to the proteomics data and the metabolic pathway information between the genes determined according to the metabonomics data.

When constructing a connecting edge between nodes according to the correlation between genes, calculating a Pearson correlation coefficient between gene expression data of two genes to determine the correlation between the genes, and when the Pearson correlation coefficient is larger than a set threshold value, constructing a connecting edge between nodes corresponding to the two genes, wherein the corresponding weight is set to be 1, and the weight of the non-existing connecting edge is 0.

When the continuous edge between the nodes is established according to the protein interaction between the genes, the interaction between the two genes is obtained according to proteomics data to be used as the protein interaction, the continuous edge is established between the nodes corresponding to the two genes with the protein interaction, and meanwhile, the interaction score of the interaction is used as the continuous edge weight.

When the connecting edges between the nodes are established according to the metabolic pathway information between the genes, the metabolic pathway information between the genes is obtained according to the metabonomics data, and when a plurality of genes simultaneously appear in a certain metabolic pathway, a super edge is established between the nodes corresponding to the genes to be used as the connecting edge. In the examples, the linking and weighting between any two genes are obtained by two steps of super Edge Expansion (Clique Expansion) and Edge Merging (Edge Merging). Specifically, for a super edge formed by a plurality of genes, the super edge is firstly unfolded to obtain a full-connectivity graph in which every two nodes between nodes corresponding to the plurality of genes are interconnected, and a common connection edge is formed between every two nodes. After the super-edge unfolding operation is carried out on all super-edges, a plurality of connecting edges may be formed between nodes corresponding to two genes, all the connecting edges are subjected to edge merging operation, and the number of the connecting edges existing between the two nodes is used as the connecting edge weight of the two nodes.

Fig. 4 is a schematic diagram of feature extraction in the cell line map feature extraction module according to the embodiment. As shown in fig. 4, in consideration of the specific structure of the constructed cell line polygonal diagram, the cell line diagram feature extraction module provided by the embodiment includes a first diagram neural network unit and a gating cycle unit, wherein the first diagram neural network unit includes a plurality of diagram convolution layers, such as 8 diagram convolution layers, for extracting node features from the cell line polygonal diagram as cell line features. The gating circulation unit is connected with two adjacent graph convolution layers and used for giving different attention to the extracted node features to pay feature attention, namely the node features extracted by the previous graph convolution layer are used as the basis for feature extraction of the next graph convolution layer after being subjected to feature attention by the gating circulation unit, and therefore high attention of effective features in the feature extraction process can be achieved.

As shown in fig. 4, in each convolutional layer, three-step feature aggregation is performed on the node features, which are respectively used to implement information aggregation of the node features in the cell line multi-edge graph through different types of edges, and specifically includes:

（1）

wherein,

is shown asiThe current node characteristics of the current node,

representing new sections after the first step of feature aggregationThe characteristics of the points are such that,

is shown asjThe node characteristics of the first one-order neighbor node,

representing the number of first order neighbor nodes,

（2）

wherein,

is shown asiNew node characteristics of current node

The new node features after attention by the node gating unit,

is shown askThe node characteristics of the second-order neighbor nodes,

is shown asiA current node andksecond connections between second first-order neighbor nodesThe weight of the edge(s) is,

indicating the number of second-order neighbor nodes,

（3）

wherein,

is shown asiNew node characteristics of current node

The new node features after attention by the node gating unit,

is shown astThe node characteristics of the third-first-order neighbor nodes,

indicating the number of third-order neighbor nodes,

new node of current nodeFeature(s)

New node features after attention by node gating unit

As the current node characteristic of the next convolution layer.

The feature aggregation of the three steps respectively aggregates the node features of the first-order neighbor nodes formed by the three edges, so that each graph convolution layer can aggregate the node features of all the first-order neighbor nodes formed by the three edges once, and after each step, feature attention is paid through a node gating unit on a node level, so that different weights are properly given to the node features aggregated by the different kinds of continuous edges.

As shown in fig. 2, the drug feature extraction module includes a conversion unit and a second graph neural network unit, wherein the conversion unit is used for converting the drug data into a drug score graph, and the second graph neural network unit is used for extracting the drug features from the input drug score graph. In one possible embodiment, the conversion unit encodes the drug data into a drug molecular graph by using the open source library RDKit, and the second graph neural network unit is constructed based on a graph isomorphism principle, that is, after the drug data is encoded into the drug molecular graph by using the open source library RDKit, the second graph neural network unit constructed based on the graph isomorphism principle is used for performing feature extraction on the drug molecular graph to obtain the drug features.

The second Graph neural Network unit constructed based on the Graph Isomorphism principle comprises a plurality of Graph Isomorphism Network (GIN) structures, each GIN structure comprises a convolutional layer (GINConv), a batch normalization layer (BN) and a ReLU activation layer (ReLU), and each GAT module comprises a convolutional layer (GATConv), a batch normalization layer (BN) and a ReLU activation layer (ReLU) of GAT.

As shown in fig. 2, the drug sensitivity prediction module includes a plurality of full-junction layers, such as 3 full-junction layers, the cell line characteristics and the drug characteristics are spliced and input to the drug sensitivity prediction module, and feature fusion and regression prediction are performed on the input spliced characteristics by using the plurality of full-junction layers to output the predicted half-inhibitory concentration of the drug-cell line pair.

And S130, performing parameter optimization on the drug sensitivity prediction model by using the training sample.

In the embodiment, parameter optimization is performed on the drug sensitivity prediction model by using multigroup mathematical data and drug data of a cell line as sample data and using half-inhibitory concentration data of a drug on the cell line as a truth label. Specifically, multigroup mathematical data of a cell line are input into a cell line graph characteristic module, a cell line polygon graph after characterization is input into a cell line graph characteristic extraction module, and cell line characteristics are obtained through information characterization and characteristic extraction; inputting the drug data into a drug feature extraction module, and obtaining drug features through information characterization and feature extraction; inputting the cell line characteristics and the drug characteristics into a drug sensitivity prediction module, outputting a predicted value of the semi-inhibitory concentration through calculation, and updating model parameters of a drug sensitivity prediction model by taking the predicted value of the semi-inhibitory concentration and the mean square error of a corresponding truth label as a loss function.

And S140, performing drug sensitivity prediction by using the drug sensitivity prediction model after parameter optimization.

When prediction is applied, multigroup mathematical data of a cell line are input into a drug sensitivity prediction model, the multigroup mathematical data of the cell line are encoded into a cell line polygonal diagram by using a cell line diagram characteristic module and input into a cell line diagram characteristic extraction module, and cell line characteristics are obtained through information characterization and characteristic extraction; inputting the drug data into a drug feature extraction module, and obtaining drug features through information characterization and feature extraction; inputting the cell line characteristics and the drug characteristics into a drug sensitivity prediction module, and outputting the predicted value of the semi-inhibitory concentration through calculation. For example, with the training and testing of drug sensitivity prediction models on 564 cell lines of pan-cancer species and 170 drugs, the RMSE on the test set was found to be only 0.7943, much better than the existing classes of models.

Fig. 5 is a schematic structural diagram of a drug sensitivity prediction device based on multigroup chemical data fusion provided by an embodiment. As shown in fig. 5, an embodiment provides a drug sensitivity prediction apparatus 500, including:

the data acquisition unit 510 is configured to acquire multiple sets of chemical data, drug data, and half-inhibitory concentration data of a drug on a cell line, where the multiple sets of chemical data include genomics data, proteomics data, and metabonomics data;

the model construction unit 520 is used for constructing a drug sensitivity prediction model and comprises a cell line graph characteristic module, a cell line graph characteristic extraction module, a drug characteristic extraction module and a drug sensitivity prediction module, wherein the cell line graph characteristic module is used for encoding multiple groups of chemical data of a cell line into a cell line polygonal graph, namely, genes of each sample are used as nodes, and gene expression quantity, gene mutation condition and copy number variation condition corresponding to the genes are used as node characteristics, so that the connection edges between the nodes are constructed according to the correlation among the genes determined by the genomics data, the protein interaction among the genes determined by the proteomics and the metabolic pathway information among the genes determined by the metabonomics; the cell line image feature extraction module is used for extracting cell line features from a cell line polygonal image; the medicine characteristic extraction module is used for extracting medicine characteristics from the medicine data; the drug sensitivity prediction module is used for predicting the semi-inhibitory concentration of the drug according to the cell line characteristics and the drug characteristics;

the optimization learning unit 530 is configured to perform parameter optimization on the drug sensitivity prediction model by using multiple sets of mathematical data and drug data of the cell line as sample data and using half-inhibitory concentration data of the drug on the cell line as a true value label;

and the predicting unit 540 is used for predicting the drug sensitivity by using the drug sensitivity prediction model after parameter optimization.

It should be noted that, when the drug sensitivity prediction device based on multi-set chemical data fusion provided in the above embodiments is used to perform drug sensitivity prediction, the division of the above functional units is taken as an example, and the above function assignment may be performed by different functional units according to needs, that is, the internal structure of the terminal or the server is divided into different functional units to perform all or part of the above described functions. In addition, the drug sensitivity prediction device based on the multigroup chemical data fusion provided in the above embodiments and the drug sensitivity prediction method based on the multigroup chemical data fusion provided in the above embodiments belong to the same concept, and the specific implementation process thereof is described in detail in the drug sensitivity prediction method based on the multigroup chemical data fusion, and is not described herein again.

The embodiment also provides a drug sensitivity prediction device based on multigroup chemical data fusion, which comprises a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to realize the drug sensitivity prediction method based on multigroup chemical data fusion.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. In the embodiments provided in the present application, the memory may be a volatile memory at the near end, such as a RAM, a non-volatile memory, such as a ROM, a FLASH, a floppy disk, a mechanical hard disk, or the like, or a remote storage cloud. The processor may be a Central Processing Unit (CPU), a Microprocessor (MPU), a Digital Signal Processor (DSP), or a Field Programmable Gate Array (FPGA).

The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A drug sensitivity prediction method based on multigroup chemical data fusion is characterized by comprising the following steps:

2. The method for predicting drug sensitivity based on multigroup chemical data fusion according to claim 1, wherein in the cell line graph characterization module, a pearson correlation coefficient between gene expression data of two genes is calculated according to genomic data to determine correlation between the genes, and when the pearson correlation coefficient is greater than a set threshold, a connecting edge between nodes corresponding to the two genes is constructed;

3. The method for predicting drug sensitivity based on multigroup chemical data fusion according to claim 1, wherein the cell line map feature extraction module comprises a first map neural network unit and a gated cycle unit, the first map neural network unit and the gated cycle unit are composed of a plurality of map convolution layers, and two adjacent map convolution layers are connected through the gated cycle unit, wherein the first map neural network unit is used for extracting cell line features from a cell line polygon map, and the gated cycle unit is used for performing feature attention on the extracted cell line features.

4. The method of claim 3, wherein in each convolutional layer, node features are subjected to three-step feature aggregation, comprising:

（1）

wherein,

is shown asiThe current node characteristics of the current node,

is shown asjThe node characteristics of the first one-order neighbor node,

representing the number of first order neighbor nodes,

（2）

wherein,

is shown asiNew node characteristics of current node

The new node features after attention by the node gating unit,

is shown askThe node characteristics of the second-order neighbor nodes,

indicating the number of second-order neighbor nodes,

（3）

wherein,

is shown asiNew node characteristics of current node

The new node features after attention by the node gating unit,

is shown astThe node characteristics of the third-first-order neighbor nodes,

indicating the number of third-order neighbor nodes,

new node characteristics of the current node pass through

New node features after attention by node gating unit

As the current node characteristic of the next convolution layer.

5. The method for predicting drug sensitivity based on multigroup chemical data fusion of claim 1, wherein the drug feature extraction module comprises a conversion unit and a second graph neural network unit, wherein the conversion unit is used for converting the drug data into a drug score graph, and the second graph neural network unit is used for extracting the drug features from the input drug score graph.

6. The method for predicting drug sensitivity based on multigroup chemical data fusion of claim 5, wherein the transformation unit encodes the drug data into a drug molecular graph by using an open-source library RDkit; the second graph neural network unit is constructed based on graph isomorphism principle.

7. The method for predicting drug sensitivity based on multigroup chemical data fusion of claim 1, wherein the drug sensitivity prediction module comprises a plurality of fully-connected layers for performing feature fusion and regression prediction on the input cell line features and the splicing features of the drug features to obtain the semi-inhibitory concentration of the drug.

8. The method for predicting drug sensitivity based on multigroup chemical data fusion according to claim 1, characterized in that after multigroup chemical data of a cell line, drug data and semi-inhibitory concentration data of a drug on the cell line are obtained, outlier and missing value elimination processing is further performed on the data, and the processed data are used for constructing a training sample;

and when the parameters of the drug sensitivity prediction model are optimized, updating the model parameters of the drug sensitivity prediction model by taking the predicted value of the half inhibitory concentration and the mean square error of the corresponding truth label as a loss function.

9. A drug sensitivity prediction device based on multigroup chemical data fusion, comprising:

10. A drug sensitivity prediction device based on multigroup chemical data fusion, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the drug sensitivity prediction method based on multigroup chemical data fusion of any one of claims 1 to 8 when executing the computer program.