CN115985413A

CN115985413A - Method, device and equipment for constructing drug sensitivity prediction model sample

Info

Publication number: CN115985413A
Application number: CN202211605339.5A
Authority: CN
Inventors: 姜山; 汤忞; 汤德平
Original assignee: Saiye Shanghai Intelligent Technology Co ltd
Current assignee: Saiye Shanghai Intelligent Technology Co ltd
Priority date: 2022-12-14
Filing date: 2022-12-14
Publication date: 2023-04-18

Abstract

The application provides a method, a device and equipment for constructing drug sensitivity prediction model samples. The method comprises the following steps: acquiring gene expression data and drug sensitivity data of cells; obtaining a score for the cell on each gene set of the database of gene sets based on gene expression data for the cell; normalizing the score of each gene set to obtain first characteristic data of the cell, and obtaining second characteristic data of the cell based on the first characteristic data; and collecting and processing the first characteristic data and the second characteristic data to obtain third characteristic data of the cell, and taking the cell, the third characteristic data corresponding to the cell and the drug sensitive data as sample data. The gene expression data based on cell origin can be converted into gene set scores with biological significance, and mathematical processing is carried out to obtain non-single characteristic data, so that the sample comprising the characteristic data can embody the relevance of different action mechanisms of the medicine.

Description

Method, device and equipment for constructing drug sensitivity prediction model sample

Technical Field

The application relates to the technical field of machine learning, in particular to a technology for constructing drug sensitivity prediction model samples.

Background

With the continuous development of artificial intelligence, machine learning models are applied to various fields. In the field of medical basic clinical, there are various machine learning model-based prediction models, such as a drug sensitivity prediction model for predicting drug sensitivity.

In the existing drug sensitivity prediction based on a machine learning model, a Principal Component Analysis (PCA) Analysis method is generally used, gene expression data or manually selected related data are directly adopted as samples of prediction characteristics, a drug sensitivity prediction model is obtained after the samples are input to a single machine learning model selected in advance for training, and a large amount of calculation power is concentrated on selection of model hyper-parameters to obtain drug sensitivity prediction data. Because a single mathematical logic, the same features are used to make the sensitivity predictions for all drugs, the prediction accuracy is not high.

Moreover, the existing drug sensitivity prediction model based on the machine learning model is usually a single machine learning model which is used for simulating the human brain, or a machine learning model is selected subjectively, the fitting practical problem is not considered, and a large amount of computing power is used for selecting the hyper-parameters of the model, so that the overfitting condition is easily caused. As the selection of the hyper-parameters needs a large amount of repeated training, a black box model with poor interpretability is finally obtained, and a plurality of models with poor training effects are discarded in the process of finally obtaining the drug sensitivity prediction model, thereby causing the waste of training resources.

Disclosure of Invention

The application aims to provide a method, a device and equipment for constructing a drug susceptibility prediction model sample, so that a drug susceptibility prediction model obtained based on sample training is used for drug susceptibility prediction, and the technical problem that the prediction accuracy of the drug susceptibility prediction model based on a single machine learning model in the prior art is not high is at least partially solved.

According to one aspect of the present application, there is provided a method for drug sensitivity prediction model sample construction, wherein the method comprises:

acquiring gene expression data and drug sensitivity data of cells;

obtaining a score for the cell on each gene set of a database of gene sets based on gene expression data for the cell;

normalizing the score of each gene set to obtain first characteristic data of the cell, and obtaining second characteristic data of the cell based on the first characteristic data;

and collecting and processing the first characteristic data and the second characteristic data to obtain third characteristic data of the cell, and taking the cell, the third characteristic data corresponding to the cell and the drug sensitive data as sample data to construct a sample of a drug sensitive prediction model.

Optionally, wherein before obtaining the score of the cell on each gene set of the database of gene sets, the gene expression data of a gene is complemented if the cell has no corresponding gene expression data for the or some gene on a certain gene set of the database of gene sets.

Optionally, wherein the obtaining a score for the cell on each gene set of a database of gene sets based on the gene expression data of the cell comprises:

and (3) for the gene expression data of the cells, obtaining the score of the cells on each gene set of the gene set database by adopting a single-sample gene set enrichment analysis method.

Optionally, wherein the obtaining second characteristic data of the cell based on the first characteristic data comprises:

performing mathematical processing on the first characteristic data to obtain second characteristic data of the cell, wherein the mathematical processing comprises at least one of:

square operation;

performing cross operation;

performing cubic operation;

natural logarithm operation;

box-cox transforms.

Optionally, the method for constructing the drug sensitivity prediction model sample further includes:

obtaining sample data of a plurality of different cells to construct a first sample data set, and dividing the first sample data set into a first training sample data set and a first test sample data set;

training a plurality of preset machine learning models based on the first training sample data set, testing based on the first testing sample data set when the MSE error of the preset machine models meets a preset threshold, and finishing the training of each preset machine learning model if the MSE error meets the preset threshold;

and fusing each preset machine learning model after training to obtain a first drug sensitivity prediction model.

Optionally, wherein said dividing the first sample data set into a first training sample data set and a first test sample data set comprises:

the KS test is employed to divide the first sample data set into a first training sample data set and a first test sample data set.

Optionally, wherein said training a number of preset machine learning models based on said first set of training sample data comprises:

acquiring training sample data from the first training sample data set by adopting cross inspection and bootstrap sampling, and training a plurality of preset machine learning models;

the training is repeated for a preset number of times.

Optionally, the fusing each preset machine learning model after the training to obtain the first drug sensitivity prediction model includes:

and fusing each preset machine learning model after pre-training by adopting a greedy forward selection method to obtain a first drug sensitivity prediction model.

replacing a characteristic value of one characteristic of third characteristic data of different samples in the first sample data set to obtain a second sample data set;

predicting the set, and obtaining the importance score of the characteristics according to the change of the accuracy of the prediction result;

traversing each feature of third feature data of the samples in the first sample data set, and repeating the steps to obtain an importance score of each feature;

based on the first sample data set, selecting a preset number of features with the highest importance scores, and constructing a third sample data set, wherein each sample data in the third sample data set comprises cells and feature data corresponding to the preset number of features with the highest importance scores in the third feature data.

Optionally, the method for constructing a drug sensitivity prediction model sample further includes:

dividing the third sample data set into a third training sample data set and a third test sample data set;

training the plurality of preset machine learning models based on the third training sample data set, testing based on the third testing sample data set when the MSE error of the preset machine model meets a preset threshold, and finishing the training of each preset machine learning model if the MSE error meets the preset threshold;

and fusing each preset machine learning model after training to obtain a second drug sensitivity prediction model.

acquiring gene expression data of a cell to be detected, and acquiring a score of the cell to be detected on each gene set of a gene set database based on the gene expression data of the cell to be detected;

normalizing the score of each gene set to obtain first characteristic data of the cell to be detected, and obtaining second characteristic data of the cell to be detected based on the first characteristic data;

processing the first characteristic data and the second characteristic data to obtain third characteristic data of the cell to be detected;

inputting the third characteristic data of the cell to be detected into the second drug sensitivity prediction model to predict the drug sensitivity data of the cell to be detected.

According to another aspect of the present application, there is provided an apparatus for drug sensitive prediction model sample construction, wherein the apparatus comprises:

the first module is used for acquiring gene expression data and drug sensitivity data of cells;

a second module for obtaining a score for the cell on each gene set of a database of gene sets based on gene expression data for the cell;

a third module, configured to perform normalization processing on the score of each gene set to obtain first feature data of the cell, and obtain second feature data of the cell based on the first feature data;

and the fourth module is used for collecting and processing the first characteristic data and the second characteristic data to obtain third characteristic data of the cell, and taking the cell, the third characteristic data corresponding to the cell and drug sensitivity data as sample data to construct a sample of the drug sensitivity prediction model.

Optionally, the apparatus for drug sensitivity prediction model sample construction further includes:

a fifth module, configured to obtain sample data of a plurality of different cells to construct a first sample dataset, and divide the first sample dataset into a first training sample dataset and a first test sample dataset;

a sixth module, configured to train a plurality of preset machine learning models based on the first training sample data set, perform a test based on the first test sample data set when an MSE error of the preset machine model satisfies a preset threshold, and complete training of each preset machine learning model if the MSE error satisfies the preset threshold;

and the seventh module is used for fusing each preset machine learning model which completes training to obtain a first drug sensitivity prediction model.

an eighth module, configured to replace a feature value of a feature of third feature data of a different sample in the first sample data set, to obtain a second sample data set;

a ninth module, configured to predict the first sample data set and the second sample data set by using the first drug sensitivity prediction model, and obtain an importance score of the feature according to a change in accuracy of a prediction result;

a tenth module, configured to traverse each feature of third feature data of the samples in the first sample dataset, and repeat the foregoing steps to obtain an importance score of each feature;

an eleventh module, configured to select, based on the first sample data set, a preset number of features with highest importance scores, and construct a third sample data set, where each sample data in the third sample data set includes a cell and a feature data corresponding to the preset number of features with highest importance scores in a third feature data corresponding to the cell.

a twelfth module for dividing the third sample data set into a third training sample data set and a third test sample data set;

a thirteenth module, configured to train the multiple preset machine learning models based on the third training sample data set, perform a test based on the third test sample data set when an MSE error of the preset machine model meets a preset threshold, and complete training of each preset machine learning model if the MSE error meets the preset threshold;

and the fourteenth module is used for fusing each preset machine learning model which completes training to obtain a second drug sensitivity prediction model.

a fifteenth module, configured to obtain gene expression data of a cell to be detected, and obtain, based on the gene expression data of the cell to be detected, a score of the cell to be detected on each gene set in a gene set database;

a sixteenth module, configured to perform normalization processing on the score in each gene set to obtain first feature data of the cell to be detected, and obtain second feature data of the cell to be detected based on the first feature data;

a seventeenth module, configured to aggregate the first characteristic data and the second characteristic data to obtain third characteristic data of the cell to be detected;

and the eighteenth module is used for inputting the third characteristic data of the cell to be detected into the second drug sensitivity prediction model so as to predict the drug sensitivity data of the cell to be detected.

Compared with the prior art, the method, the device and the equipment for constructing the drug sensitivity prediction model sample are provided. The method comprises the following steps: acquiring gene expression data and drug sensitivity data of cells; obtaining a score for the cell on each gene set of the database of gene sets based on gene expression data for the cell; normalizing the score of each gene set to obtain first characteristic data of the cell, and obtaining second characteristic data of the cell based on the first characteristic data; and collecting and processing the first characteristic data and the second characteristic data to obtain third characteristic data of the cell, and taking the cell, the third characteristic data corresponding to the cell and the drug sensitive data as sample data. Optionally, the method further comprises: obtaining sample data of a plurality of different cells to construct a first sample data set, and dividing the first sample data set into a first training sample data set and a first test sample data set; training a plurality of preset machine learning models based on a first training sample data set; when the MSE error of the preset machine model meets a preset threshold, testing based on a first test sample data set, and finishing the training of each preset machine learning model if the MSE error meets the preset threshold; and fusing each preset machine learning model after training to obtain a first drug sensitivity prediction model. Optionally, the method further comprises: replacing a characteristic value of one characteristic of third characteristic data of different samples in the first sample data set to obtain a second sample data set; predicting the first sample data set and the second sample data set by adopting a first drug sensitivity prediction model, and obtaining an importance score of the characteristic according to the change of the accuracy of a prediction result; traversing each feature of third feature data of the samples in the first sample data set, and repeating the steps to obtain an importance score of each feature; and selecting a preset number of features with the highest importance scores based on the first sample data set, and constructing a third sample data set, wherein each sample data in the third sample data set comprises cells and feature data corresponding to the preset number of features with the highest importance scores in the third feature data. Optionally, the method further comprises: dividing the third sample data set into a third training sample data set and a third test sample data set; training the plurality of preset machine learning models based on a third training sample data set; when the MSE error of the preset machine model meets a preset threshold, testing based on a third test sample data set, and finishing the training of each preset machine learning model if the MSE error meets the preset threshold; and fusing each preset machine learning model after training to obtain a second drug sensitivity prediction model.

The method, the device and the equipment for constructing the drug sensitivity prediction model sample have the following technical effects:

the gene expression data based on cell origin is converted into gene set scores with biological significance, and mathematical processing is carried out to obtain non-single characteristic data, so that the sample comprising the characteristic data can embody the relevance of different action mechanisms of the medicine. Optionally, the drug sensitivity prediction model is obtained by fusing a plurality of machine learning models obtained by training the sample data set constructed by the samples, so that the training efficiency and the prediction accuracy can be improved, the number of machine learning models participating in fusion can be adjusted as required, and the expandability and the applicability are provided. Optionally, through feature replacement, the importance of different features can be determined, a plurality of features with high importance are selected to construct a new sample data set, a plurality of new machine learning models are obtained through training, a new drug sensitivity prediction model is obtained through fusion, the training data volume can be reduced, the training efficiency is improved, and the prediction accuracy is not reduced.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

FIG. 1 illustrates a flow diagram of a method for drug sensitivity prediction model sample construction according to one aspect of the present application;

FIG. 2 illustrates a flow diagram of a method for drug susceptibility prediction model construction according to an aspect of the present application;

FIG. 3 illustrates a flow diagram of a method for drug sensitive prediction model sample dataset construction according to another aspect of the present application;

FIG. 4 illustrates a flow diagram of a method for drug sensitivity prediction model construction according to another aspect of the present application;

FIG. 5 illustrates a flow diagram of a method of drug susceptibility prediction according to yet another aspect of the present application;

FIG. 6 illustrates a schematic diagram of an apparatus for drug sensitivity prediction model sample construction according to yet another aspect of the present application;

the same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present application is described in further detail below with reference to the attached figures.

In a typical configuration of the embodiments of the present application, the execution main body of the method, each trusted party of the system, and/or each module of the apparatus may include one or more processors (CPUs), input/output interfaces, network interfaces, and memories.

The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

In order to further illustrate the technical means and effects adopted by the present application, the following description is made in combination with the accompanying drawings and various embodiments for clearly and completely describing the technical solutions of the present application.

FIG. 1 illustrates a method for drug sensitive prediction model sample construction according to an aspect of the present application, wherein the method of an embodiment comprises:

s101, acquiring gene expression data and drug sensitivity data of cells;

s102, obtaining a score of the cell on each gene set of a gene set database based on the gene expression data of the cell;

s103, normalizing the scores of each gene set to obtain first characteristic data of the cells, and obtaining second characteristic data of the cells based on the first characteristic data;

s104, the first characteristic data and the second characteristic data are subjected to collective processing to obtain third characteristic data of the cell, and the cell, the third characteristic data corresponding to the cell and drug sensitive data are used as sample data.

The method embodiment of the present application may be implemented by the device 100, where the device 100 is a computer device and/or cloud with certain computing power and installed with related hardware and software. Wherein the computer device includes but is not limited to a personal computer, a notebook computer, an industrial computer, a network host, a single network server or a plurality of network server sets; the Cloud is made up of a large number of computers or web servers based on Cloud Computing (Cloud Computing), which is one type of distributed Computing, a virtual supercomputer consisting of a collection of loosely coupled computers.

The computers and/or clouds are merely examples, and other existing or future devices and/or resource sharing platforms, as applicable, are intended to be included within the scope of the present application and are hereby incorporated by reference.

In this embodiment, in step S101, the apparatus 100 acquires gene expression data of a certain cell and drug sensitivity data thereof.

The gene expression data of a Cell may be gene expression data in a public data set, for example, a CCLE (Cancer Cell Line Encyclopedia) database or a CellMiner database. The CCLE database is a tumor genomics research project which is led by the Border Institute (Broad Institute) belonging to the national Institute of science and technology and Harvard university of Massachusetts, and covers gene expression, mutation, copy number, methylation and the like of thousands of cell lines from over thirty tissue sources. The CellMiner database is mainly based on 60 cancer cells listed in the cancer research center of the national cancer institute of America, and is used for establishing a gene expression database, wherein the gene expression database comprises gene expression data obtained by sequencing of transcription profiles.

The Drug Sensitivity data of the cell can be Drug Sensitivity data of the cell in a CTRP (Cancer Therapeutics Response Portal) database or a GDSC (Genomics of Drug Sensitivity in Cancer) database or other public Drug Sensitivity databases. The CTRP database is a database that links the inheritance, lineage, and other cellular characteristics of cancer cell lines to small molecule sensitivity with the goal of accelerating the discovery of patient-matched cancer therapeutic molecular drugs. The GDSC database was developed by the university of cambridge, england, research institute, vie Kang Sangge (Wellcome Sanger), where the sensitivity and response of tumor cells to drugs was collected. These published data sets contain drug sensitivity experimental data for a variety of tumor cells to compounds, targeted drugs.

Wherein, a certain cell can be obtained in the experiment, and the cell is subjected to transcription profile sequencing to obtain the gene expression data of the cell; drug susceptibility testing was performed to obtain drug susceptibility data for the cells.

Wherein, if the gene expression data of a certain cell and the drug sensitive data corresponding to the cell cannot be obtained simultaneously, the cell is not selected.

Continuing in this embodiment, in step S102, the apparatus 100 obtains a score for the cell on each gene set of the database of gene sets based on the gene expression data for the cell.

Wherein the apparatus 100 performs gene data comparison with each gene set in the gene set database based on the gene expression data of the cell to obtain a score of the cell on each gene set in the gene set database.

Illustratively, a score for the cell on each gene set in the MsigDB Database can be obtained by performing a gene data comparison with each gene set in the MsigDB (Molecular Signatures Database) Database based on the gene expression data of the cell. One of the most popular public gene set databases in the MsigDB database currently summarizes data of other common gene set databases, is continuously updated and perfected, and is freely and openly used by the public. Currently, 2022 year versions of the MsigDB database include 9 major sets in total: C1-C8, and additionally adding a Hallmark set, wherein the total number of the Hallmark sets is 33000+ gene sets, and the gene sets have good annotation and contain biological significance.

The score of the cell on each gene set in other gene set databases can also be obtained, and the other gene set databases can be other well-known gene set databases, such as a KEGG database, a Reactome database, a Wikipath database, a GO database, and the like. The data in these gene set databases are also partially stored in the MsigDB database, for example, the KEGG database, the Reactome database, and the Wikipath database are part of the MsigDB database C2 gene set, respectively, and the GO database is stored in the MsigDB database C5 gene set. Wherein, the data side in KEGG database, reactome database, the Wikipath database is emphatically in the signal path, and GO database has contained 3 subclasses: molecular functions, biological processes, and cellular components.

The MsigDB database and other gene set databases are merely examples, and other existing or future gene set databases, as applicable to the present application, are also included in the scope of the present application and are herein incorporated by reference.

Since the genes that make up a cell may not include all known genes, the genes that a cell includes may not include one or more genes on a gene set of the gene set database. In order to make the cell able to obtain a score on each gene set of the gene set database, optionally, in step S102, before obtaining the score of the cell on each gene set of the gene set database, if a certain gene or genes of the cell on a certain gene set of the gene set database have no corresponding gene expression data, the gene expression data of the missing gene is complemented in the gene expression data of the cell.

When the gene expression data of the cell is compared with the gene data of each gene set in the gene set database, if the gene expression data of the cell does not contain the data of a certain gene or certain genes on a certain gene set, the data of the gene or the genes is complemented in the gene expression data of the cell to obtain the gene expression data with complete structure of the cell, and in addition, the method is favorable for constructing diversified sample data based on the subsequent steps and is used for training a machine learning model. In consideration of the fact that the missing data is often caused by an excessively small expression amount, the data of the gene or genes to be complemented in the gene expression data of the cell may be set to 0 to be complemented.

Optionally, wherein the step S102 includes:

Wherein, the ssGSEA (single sample Gene Set Analysis) method is a differential Gene Analysis method for performing differential Gene Analysis by using Gene expression data of a certain cell as a sample. The analysis of differential genes usually obtains a large number of genes, but the massive genes are not easy to systematically analyze and find similar laws, and need to be annotated and analyzed in relation to which pathways related to diseases, so as to further mine useful information. General differential gene analysis (for example, based on GO database or based on Pathway level) focuses on comparing gene expression differences between two groups, focuses on a few genes with significant up-regulation or down-regulation, and for the detection threshold of the differential genes, the sensitivity to abnormality requires a client to give a definite definition (threshold) of the differential genes, for example, abs (logFC) ≧ 2.0&fdr ≦ 0.05, which easily omits some genes whose differential expression is not significant but has important biological significance, and ignores valuable information such as biological characteristics of some genes, relationships between gene regulatory networks, and gene functions and significance. The ssGSEA method is adopted for differential gene analysis, a clear differential gene threshold is not required to be specified, and the differential analysis can be carried out on genes on the whole level of an expression profile according to the whole trend of actual data even if the threshold is not defined in advance, so that the gene expression data and the biological significance are well linked together in mathematical statistics, and the result can be read more easily and reasonably. Specifically, based on the gene expression data of the cell, the genes in a certain known pathway of its corresponding gene set database can be obtained, and the genes of interest are scored to obtain a score on the gene set in the known pathway.

In step S102, gene expression data of a cell is analyzed by ssGSEA method, for example, to obtain a score for the cell on each gene set in the MsigDB database.

Continuing with this embodiment, in step S103, the apparatus 100 normalizes the score on each gene set of the gene set database to obtain first feature data of the cell, and obtains second feature data of the cell based on the first feature data.

In order to facilitate subsequent data processing, the score on each gene set of the obtained gene set database can be normalized to obtain the first characteristic data of the cell, so that dimensionless data is realized, and the influence of specific cells and/or an acquisition process on the score is reduced. Wherein, if the gene set database includes n gene sets, or n gene sets of the gene set database are selected, and the score of the gene expression data of the cell on the n gene sets is obtained, the first characteristic data of the cell may include: and n characteristics, namely the normalized scores of the gene expression data of the cell on n gene sets, wherein the specific numerical value of each normalized score is the characteristic value of each characteristic.

For example, scores of ACHN cells (human renal cell adenocarcinoma cells) on a partial gene set in the MsigDB database are obtained by the ssGSEA method, and the scores are normalized to obtain corresponding feature values of each feature in the first feature data, as shown in table 1 below.

TABLE 1

Note: the normalized score is actually the score on each gene set normalized and then added with 1 to facilitate subsequent mathematical processing.

Optionally, in step S103, the obtaining second feature data of the cell based on the first feature data includes:

square operation;

performing cross operation;

performing cubic operation;

natural logarithm operation;

box-cox transforms.

Wherein, in order to obtain training for facilitating subsequent machine learning model, the apparatus 100 further performs mathematical processing on the obtained first characteristic data to obtain second characteristic data of the cell, wherein the mathematical processing may include at least one of:

carrying out square operation on the first characteristic data, and/or carrying out cross operation on the first characteristic data, and/or carrying out cubic operation on the first characteristic data, and/or carrying out natural logarithm operation on the first characteristic data, and/or carrying out box-cox transformation on the first characteristic data.

For example, the following table 2 is obtained by performing square operation, cubic operation, natural logarithm operation, and box-cox conversion processing on each feature value of the first feature data of ACHN cells in table 1.

TABLE 2

For example, table 3 below shows the cross-over operation of the first characteristic values of ACHN cells in table 1.

TABLE 3

For convenience of description, only 6 gene sets are listed in tables 1 to 3 above.

Continuing in this embodiment, in step S104, the apparatus 100 performs aggregate processing on the first characteristic data and the second characteristic data to obtain third characteristic data of the cell, and uses the cell and its corresponding third characteristic data and drug sensitive data as sample data to construct a sample of the drug sensitive prediction model.

The device 100 collects the first characteristic data and the second characteristic data of the cell to form a data set, and obtains third characteristic data of the cell. And taking the cell and the corresponding third characteristic data and drug sensitivity data as sample data to construct a sample of a drug sensitivity prediction model, and using the sample to train a machine learning model to obtain the drug sensitivity prediction model.

According to the method, the sample data of different cells can be obtained, the sample data set is constructed, and the sample data set can be used for training a machine learning model to obtain a drug sensitivity prediction model. Wherein the sample data comprises: cell identification, such as the name of the cell; third characteristic data; drug sensitivity data corresponding to the cells. Illustratively, first feature data of the ACHN cells are obtained on the basis of scores of the ACHN cells (human kidney cell adenocarcinoma cells) obtained by the ssGSEA method on each gene set of the MsigDB database, a squaring operation, a cubic operation, a natural logarithm operation, a box-cox transform and a crossover operation are performed to obtain second feature data of the ACHN cells, and then the first feature data and the second feature data of the ACHN cells are collected to obtain third feature data of the ACHN cells (if scores on each gene set of n gene sets of the MsigDB database are selected, the first feature data comprise n features, each feature is a normalized score on each gene set, and a feature value is a corresponding score value thereof. The second feature data comprise (4 n + n (n-1)/2) features, wherein the feature values of each feature correspond to n features of the squaring operation, and the feature value of each feature is a value obtained by performing an operation on the feature value of each feature in the first feature data; the feature value of each feature is a value obtained by performing cubic operation on the feature value of each feature in the first feature data, the feature value of each feature is a value obtained by performing natural logarithmic operation on the feature value of each feature in the first feature data, the feature value of each feature is a value obtained by performing box-cox conversion on the feature value of each feature in the first feature data, the feature value of each feature is a value obtained by performing operation on the feature values of corresponding two features in the first feature data, the third feature data is the first feature data and the second feature data, and the feature value of each feature is a value obtained by performing cubic operation on the feature value of each feature in the first feature data and the feature value of each feature in the second feature data According to the total collection, comprises (5n + n (n-1)/2) characteristics and characteristic values thereof. If the AUC (Area Under the Curve) data of the ACHN cells in the CTRP database or other public databases is used as the drug sensitivity data (the AUC drug sensitivity data of some drugs of the ACHN cells in the CTRP database is shown in table 4 below), the sample data corresponding to the ACHN cells includes: ACHN cell name, third characterization data of ACHN cells, AUC values of ACHN cells.

TABLE 4

Medicine	AUC value
		Zebra linn (zebularine)	14.89
Azacitidine (azacitidine)	14.92
		Nelarabine (nellabaine)	14.83
Myricetin (Myricetin)	14.80
		Britinib (brivanib)	14.41
Lei Te Ge Wei (ersmodetib)	14.86
		……	……

Optionally, fig. 2 shows a flowchart of a method for drug sensitivity prediction model construction according to an aspect of the present application, wherein the method comprises:

s201, obtaining sample data of a plurality of different cells to construct a first sample data set, and dividing the first sample data set into a first training sample data set and a first test sample data set;

s202, training a plurality of preset machine learning models based on the first training sample data set, when the MSE error of the preset machine models meets a preset threshold, testing based on the first testing sample data set, and if the MSE error meets the preset threshold, finishing the training of each preset machine learning model;

s203, fusing each preset machine learning model after training is completed to obtain a first drug sensitivity prediction model.

Wherein the method embodiment may also be implemented by the apparatus 100.

In this embodiment, in step S201, the apparatus 100 may obtain sample data of different cells obtained by using the method embodiment shown in fig. 1, including each cell and its corresponding third feature data and drug sensitive data, to construct a first sample data set, and partition the sample data in the first sample data set to obtain a first training sample data set and a first test sample data set. For example, sample data in a first sample data set is divided according to the proportion of 8:2, wherein 80% of samples are divided into a first training sample data set for training of a machine learning model, and 20% of samples are divided into a first test sample data set for testing the training effect of the machine learning model.

the first sample data set is partitioned into a first training sample data set and a first test sample data set using a KS test.

The method comprises the steps of dividing a first sample data set into a first training sample data set and a first test sample data set by adopting a random division method, carrying out KS inspection after division, if the first training sample data set does not pass the KS inspection, carrying out KS inspection after random division again, and repeating random division until the KS inspection is passed. And dividing the first sample data set into a first training sample data set and a first test sample data set by adopting KS (key-value verification) so as to ensure that the sample data in the first training sample data set and the first test sample data set obey the same distribution, thereby avoiding the problem of data migration, ensuring the randomness of sample data division and simultaneously increasing the reliability of model effect check.

In this embodiment, in step S202, the apparatus 100 trains a plurality of preset machine learning models based on the first training sample data set, when the MSE error of a certain preset machine model after training meets a preset threshold, the apparatus 100 tests the trained model by using the first testing sample data set, and if the MSE error meets the preset threshold, the training of the preset machine learning model is completed. And traversing each preset machine model until the training of each preset machine learning model is completed, and obtaining a plurality of trained machine learning models.

The existing drug sensitivity prediction model is usually directed at a single machine learning model, so that some characteristic information which is not suitable for the machine learning model is ignored, and the prediction accuracy is low. In the embodiment, a plurality of machine learning models with preset hyper-parameter sets are selected, each machine learning model is trained by adopting a first training sample data set, and then the training effect is tested by adopting a first test sample data set.

For example, the several preset machine learning models may select existing classical basic machine learning models which are verified to have beneficial effects in a certain application range or certain application ranges, and may include the following 7 broad-class and 13 specific machine learning models, wherein different names of the same broad-class model represent different selected classical hyperparameter sets:

1. KNeighborsUnif and KNeighborsDist in KNN (K-nearest neighbor) model

2. LightGBM, lightGBMXT, and LightGBMLage in LightGBM (light Gradient Boosting Machine) model

3. RandomForestGini and RandomForestEntr in random forest model

4. ExtraTreesGini, extraTreesEntr in the ExtraTree (extreme Tree) model

5、CatBoost

6、XGBoost

7. NeralNetFastAI and NeralNetTorch in artificial neural networks

Optionally, in step S202, the training a plurality of preset machine learning models based on the first training sample data set includes:

the training is repeated for a preset number of times.

And when training each preset machine learning model, a cross-checking method can be adopted. Illustratively, a cross-checking method is adopted to divide the first training sample data set into k subsets, each time (k-1) subsets are adopted for training, and evaluation is carried out on the remaining subset.

And (3) adopting a bootstrap sampling method for the (k-1) subsets. Illustratively, a bootstrap sampling method is adopted to sample training samples from a subset to construct a training sample set for the machine learning model, if m rounds are to be extracted, several training sample data are extracted from each round as a training sample subset (put back after each extraction, so that some sample data may be extracted multiple times in the extracted subset, and some sample data may not be extracted at one time), and m rounds of extraction are performed altogether to obtain m mutually independent training sample subsets. And training a preset machine learning model by adopting 1 training sample subset every time to obtain one training model of the machine learning model, wherein m training sample data subsets obtain m training models of the machine learning model. The mean of the m training models is calculated as the trained model of the machine learning model.

In the training process of the preset machine learning model, the preset times can be repeated, and the cross test and bootstrap sampling method is adopted to train towards the direction that MSE (Mean Square Error) is decreased progressively, or the training is repeated until the MSE is not reduced any more.

Illustratively, cross inspection and bootstrap sampling are adopted, training sample data are obtained from the first training sample data set, 13 preset machine learning models of the 7 types are trained, the training is repeated for preset times, when the MSE of each trained preset machine learning model meets a preset threshold value, and after the first test sample data set passes the test of each trained preset machine learning model, the training of the 13 preset machine learning models of the 7 types is completed, and the 13 trained machine learning models are obtained.

Continuing in this embodiment, in step S203, the apparatus 100 fuses each preset machine learning model that has completed training to obtain a first drug susceptibility prediction model.

The obtained machine learning models after training may have different drug sensitivity degrees to different drugs, based on a single machine learning model after training, the accuracy of drug sensitivity prediction to different drugs is not high, the plurality of machine learning models after training can be fused into one model, and for prediction of different drugs, the weights (contribution degrees) of the machine learning models in the fused model may be different, so that the accuracy of drug sensitivity prediction of the fused model is improved. In the fusion process, if a certain machine learning model does not contribute to the fused model, the weight of the model is 0.

Optionally, in step S203, the fusing each preset machine learning model that completes training to obtain the first drug sensitivity prediction model includes:

And performing linear fusion on each preset machine learning model after training by adopting a greedy forward selection method, determining the weight of each machine learning model after training, and obtaining a fused first drug sensitivity prediction model.

If too much feature data adopted in the sample data often occupies too many resources of hard disks, memories and the like of the equipment 100, the training time of the model may be too long, overfitting may also occur, the accuracy of drug sensitivity prediction of the trained fusion model is not remarkably improved, and even the accuracy of drug sensitivity prediction is reduced due to low-quality feature data, so that the actual application value of the model may not be high. In drug sensitivity prediction, the contribution of a small number of gene-related characteristics is often larger, and the characteristic data in sample data can be considered to be screened, the most important gene-related characteristic data is screened out, and the sample data is constructed by the most important characteristic data.

Optionally, fig. 3 shows a flowchart of a method for drug sensitive prediction model sample dataset construction according to another aspect of the present application, wherein the method comprises:

s301, replacing a characteristic value of one characteristic of third characteristic data of different samples in the first sample data set to obtain a second sample data set;

s302, predicting the first sample data set and the second sample data set by adopting the first drug sensitivity prediction model, and obtaining an importance score of the characteristics according to the change of the accuracy of a prediction result;

s303, traversing each feature of the third feature data of the samples in the first sample data set, and repeating the steps to obtain an importance score of each feature;

s304, based on the first sample data set, selecting a preset number of features with the highest importance scores, and constructing a third sample data set, wherein each sample data in the third sample data set comprises cells and feature data corresponding to the preset number of features with the highest importance scores in the third feature data.

Wherein the method embodiment may also be implemented by the apparatus 100. By randomly disorganizing the position of a certain feature in the sample data, the feature data corresponding to the feature in each sample data is changed, and other feature data are not changed, so that the accuracy of the prediction result can be changed by comparing the prediction result of the first drug sensitive prediction model on the disorganized sample data set with the prediction result on the original sample data set, and the importance degree of the feature can be determined.

In this embodiment, in step S301, the apparatus 100 may replace the feature value of one feature of the third feature data of different samples in the first sample data set obtained by the method embodiment shown in fig. 2, to obtain the second sample data set.

The feature value of a feature of the third feature data of each sample in the first sample data set may be randomly replaced to obtain the second sample data set.

Continuing with this embodiment, in step S302, the apparatus 100 predicts the first sample data set and the second sample data set by using the first drug sensitivity prediction model obtained in the foregoing embodiment of the method shown in fig. 2, and obtains the importance score of the feature according to the variation of the accuracy of the prediction result.

The apparatus 100 predicts the third feature data of each cell in the first sample data set obtained by the method embodiment shown in fig. 2, compares the prediction result with the corresponding drug sensitivity data to obtain the drug sensitivity prediction deviation of each cell, and calculates the MSE of the drug sensitivity prediction deviation of all cells as the accuracy of the prediction result of the first sample data set. The apparatus 100 further predicts the third feature data of each cell in the obtained second sample data set by using the first drug sensitivity prediction model, compares the prediction result with the corresponding drug sensitivity data to obtain the drug sensitivity prediction deviation of each cell, and calculates the MSE of the drug sensitivity prediction deviation of all cells as the accuracy of the prediction result of the second sample data set. And comparing the accuracy of the predicted result of the first sample data set with the accuracy of the predicted result of the second sample data set, and taking the difference value of the two as the importance score of the feature.

Continuing in this embodiment, in step S303, the device 100 traverses each feature of the third feature data of the samples in the first sample dataset, and repeats steps S301 and S302, and may obtain an importance score for each feature of the third feature data in the samples in the first sample dataset.

Continuing in this embodiment, in step S304, according to the importance score of each feature of the third feature data in the samples of the first sample data set obtained in step S303, a preset number of features with the highest importance score are selected, the preset number of features and their data in the third feature data are retained, other features and their data in the third feature data in each sample are deleted, a new feature data of each cell is obtained, the cell, its corresponding new feature data and its corresponding drug sensitive data are used as new sample data, the number of the sample data is the same as that of the first sample data set, and a third sample data set is constructed, where each sample data in the third sample data set includes the cell, its corresponding feature data with the highest importance score in the third feature data, and its corresponding drug sensitive data.

By adopting the third sample data set to train a plurality of preset machine learning models in the method embodiment shown in the figure 2, the prediction accuracy can be expected, and the drug sensitivity prediction model with acceptable resource expenditure has practical application value.

Optionally, fig. 4 shows a flowchart of a method for drug sensitivity prediction model construction according to another aspect of the present application, wherein the method comprises:

s401, dividing the third sample data set into a third training sample data set and a third test sample data set;

s402, training the plurality of preset machine learning models based on the third training sample data set, when the MSE error of the preset machine model meets a preset threshold value, testing based on the third testing sample data set, and if the MSE error meets the preset threshold value, finishing the training of each preset machine learning model;

s403, fusing each preset machine learning model after training to obtain a second drug sensitivity prediction model.

Wherein the method embodiment may also be implemented by the apparatus 100. In this embodiment of the method, the operations in steps S401 to S403 are similar to or the same as the corresponding operations in steps S201 to S203 in the embodiment of the method and/or the optional embodiment shown in fig. 2, and are not repeated herein. By the method, a second drug sensitivity prediction model can be obtained.

The obtained second drug susceptibility prediction model can be used for drug susceptibility prediction of cells to be detected.

Optionally, fig. 5 shows a flowchart of a method of drug susceptibility prediction according to yet another aspect of the application, wherein the method comprises:

s501, gene expression data of a cell to be detected are obtained, and based on the gene expression data of the cell to be detected, a score of the cell to be detected on each gene set of a gene set database is obtained;

s502, normalizing the scores of each gene set to obtain first characteristic data of the cells to be detected, and obtaining second characteristic data of the cells to be detected based on the first characteristic data;

s503, the first characteristic data and the second characteristic data are subjected to collective processing to obtain third characteristic data of the cell to be detected;

s504, inputting the third characteristic data of the cell to be detected into the second drug sensitivity prediction model so as to predict the drug sensitivity data of the cell to be detected.

Wherein the method embodiment may also be implemented by the apparatus 100.

In this embodiment, in step S501, the apparatus 100 acquires gene expression data of a cell to be tested, and obtains a score of the cell to be tested on each gene set of the gene set database based on the gene expression data of the cell to be tested.

The cells to be detected may be cell lines in different culture states in biological experiments or cells derived from patients, the cells of the patients can be obtained by clinical means such as operation, biopsy and the like, and the cells to be detected are collected and then subjected to transcription spectrum sequencing to obtain gene expression data of the cells to be detected. The apparatus 100 performs gene data comparison with each gene set in the gene set database according to the gene expression data of the cell to be tested, and obtains a score of the cell to be tested on each gene set in the gene set database. Illustratively, the score of the test cell on each gene set in the MsigDB database can be obtained by performing a gene data comparison with each gene set in the MsigDB database using the ssGSEA method according to the gene expression data of the test cell. And when the gene expression data of the cell to be detected is compared with the gene data of each gene set in the MsigDB database, if the gene expression data of the cell to be detected does not contain the data of a certain gene or certain genes on a certain gene set, the data of the gene or the genes is supplemented in the gene expression data of the cell to be detected so as to obtain the gene expression data with complete structure of the cell to be detected.

In this embodiment, in step S502, the apparatus 100 normalizes the score of each gene set in the gene set database corresponding to the cell to be tested to obtain the first feature data of the cell to be tested, and obtains the second feature data of the cell to be tested based on the first feature data.

In order to facilitate subsequent data processing, the score on each gene set of the gene set database corresponding to the cell to be detected is normalized to obtain first characteristic data of the cell to be detected. For example, if the gene set database has n gene sets, or n gene sets of the gene set database are selected, and scores of the gene expression data of the test cell on the n gene sets are obtained, the first characteristic data of the test cell may include: n characteristics, namely the normalized scores of the gene expression data of the cells to be detected on n gene sets, wherein the specific numerical value of each normalized score is the characteristic value of each characteristic. The first characteristic data of the cell to be detected can be mathematically processed to obtain second characteristic data of the cell to be detected. Illustratively, the mathematical process may include at least one of: square arithmetic, cross arithmetic, cubic arithmetic, natural logarithm arithmetic and box-cox transform.

Continuing in this embodiment, in step S503, the apparatus 100 processes the first characteristic data and the second characteristic data of the cell to be tested to obtain the third characteristic data of the cell to be tested.

The device 100 performs a summary set process on the first characteristic data and the second characteristic data of the cell to be detected to form a data set, so as to obtain a third characteristic data of the cell to be detected.

Continuing in this embodiment, in step S504, the apparatus 100 inputs the third characteristic data of the test cell into the second drug sensitivity prediction model of the method embodiment shown in fig. 4 to predict the drug sensitivity data of the test cell.

The obtained third characteristic data of the test cell can be input into the second drug sensitivity prediction model of the method embodiment shown in fig. 4 to predict the drug sensitivity data of the test cell. It is also possible to first screen a predetermined number of feature data with the highest importance score from the third feature data of the test cell, and then input the predetermined number of feature data with the highest importance score into the second drug sensitivity prediction model of the embodiment of the method shown in fig. 4 to predict the drug sensitivity data of the test cell.

FIG. 6 shows a schematic diagram of an apparatus for drug sensitivity prediction model sample construction according to yet another aspect of the present application, wherein the apparatus of an embodiment comprises:

a first module 601, configured to obtain gene expression data of cells and drug sensitivity data thereof;

a second module 602 for obtaining a score for the cell on each gene set of a database of gene sets based on gene expression data for the cell;

a third module 603, configured to perform normalization processing on the score of each gene set to obtain first feature data of the cell, and obtain second feature data of the cell based on the first feature data;

a fourth module 604, configured to aggregate the first characteristic data and the second characteristic data to obtain third characteristic data of the cell, and use the cell, the third characteristic data corresponding to the cell, and the drug sensitive data as sample data to construct a sample of a drug sensitive prediction model.

Wherein the apparatus is disposed in the device 100 for implementing the above method embodiments and/or optional embodiments, and the apparatus may be a software apparatus or a software and hardware combination apparatus.

In this embodiment, the first module 601 of the device obtains gene expression data and drug sensitivity data of a certain cell. The gene expression data of a cell can be obtained from a public data set, for example, the gene expression data of the cell in the CCLE database or the CellMiner database. The drug sensitive data of the cell can adopt the drug sensitive data of the cell in a drug sensitive database such as a CTRP database or a GDSC database. Wherein, a certain cell can be obtained in the experiment, and the cell is subjected to transcription profile sequencing through clinical experiments to obtain the gene expression data of the cell; drug susceptibility testing was performed to obtain drug susceptibility data for the cells. Wherein, if the gene expression data of a certain cell and the drug sensitive data corresponding to the cell cannot be obtained simultaneously, the cell is not selected.

Continuing in this example, the second module 602 of the apparatus performs a gene data comparison with each gene set in the gene set database based on the gene expression data of the cell to obtain a score for the cell on each gene set in the gene set database. Illustratively, a score for the cell on each gene set in the MsigDB database can be obtained using the ssGSEA method against each gene set in the MsigDB database based on the gene expression data for the cell.

Continuing in this example, the third module 603 of the apparatus normalizes the score of the cell on each gene set of the database of gene sets to obtain first feature data of the cell, and obtains second feature data of the cell based on the first feature data.

In order to facilitate subsequent data processing, the score on each gene set of the obtained gene set database can be normalized to obtain the first characteristic data of the cell, so that dimensionless data is realized, and the influence of specific cells and/or an acquisition process on the score is reduced. For example, if the gene-set database comprises n gene sets, or n gene sets of the gene-set database are selected, and scores of the gene expression data of the cell on the n gene sets are obtained, the first characteristic data of the cell may include: and n characteristics, namely the normalized scores of the gene expression data of the cell on n gene sets, wherein the specific numerical value of each normalized score is the characteristic value of each characteristic. And performing mathematical processing on the first characteristic data of the cell to be detected to obtain second characteristic data of the cell to be detected. Illustratively, the mathematical process may include at least one of: square operation, cross operation, cubic operation, natural logarithm operation, and box-cox transform.

In this embodiment, the fourth module 604 of the apparatus processes the first characteristic data and the second characteristic data to obtain third characteristic data of the cell, and uses the cell and the third characteristic data and the drug sensitive data corresponding to the cell as sample data to construct a sample of the drug sensitive prediction model. The fourth module 604 performs a summary set processing on the first characteristic data and the second characteristic data of the cell to form a data set, so as to obtain a third characteristic data of the cell. And taking the cell and the corresponding third characteristic data and drug sensitivity data as sample data to construct a sample of a drug sensitivity prediction model, and using the sample to train a machine learning model to obtain the drug sensitivity prediction model.

By the device, sample data of different cells can be obtained, a sample data set is constructed, and the sample data set can be used for training a machine learning model to obtain a drug sensitivity prediction model. Wherein the sample data comprises: cell identification, such as the name of the cell; third characteristic data; drug sensitivity data corresponding to the cells.

Optionally, the apparatus further comprises:

The fifth module of the apparatus may obtain sample data of different cells obtained by using the method embodiment shown in fig. 1, including each cell and its corresponding third feature data and drug sensitive data, to construct a first sample data set, and divide the sample data in the first sample data set to obtain a first training sample data set and a first test sample data set.

The device comprises a sixth module, a first test sample data set and a second test sample data set, wherein the sixth module is used for training a plurality of machine learning models with preset hyper-parameter sets by using training samples in the first training sample data set, when MSE (mean square error) of a certain preset machine model after training meets a preset threshold value, the model after training is tested by using the first test sample data set, and if the MSE error meets the preset threshold value, the training of the preset machine learning model is completed. And traversing each preset machine model until the training of each preset machine learning model is completed, and obtaining a plurality of trained machine learning models. For example, the plurality of preset machine learning models may select existing classical basic machine learning models which are verified to have beneficial effects in a certain application range or certain application ranges, and may include the following 7 major classes and 13 specific machine learning models, wherein different names of the same major class model represent different selected classical meta-parameter sets:

1. KNeighborsUnif and KNeighborsDist in KNN (K-nearest neighbor, K) model

2. Lightgbm, lightgbmxt, and Lightgbmlarge in Lightgbm (light Gradient hoisting Machine) model

3. RandomForestGini and RandomForestEntr in random forest model

4. Extratetree Gini, extratetree Entr in the Extratetree model

5、CatBoost

6、XGBoost

7. NeralNetFastAI and NeralNetTorch in artificial neural networks

Illustratively, through a sixth module, cross inspection and bootstrap sampling are adopted, training sample data are obtained from a first training sample data set, 13 preset machine learning models of the 7 categories are trained, the training is repeated for a preset number of times, when the MSE of each preset machine learning model after training meets a preset threshold, and after the test of each preset machine learning model after training is passed through by adopting the first test sample data set, the training of the 13 preset machine learning models of the 7 categories is completed, and 13 trained machine learning models are obtained.

And a seventh module of the device fuses each preset machine learning model which is trained to obtain a first drug sensitivity prediction model. Illustratively, a greedy forward selection method is adopted to perform linear fusion on each preset machine learning model after training, and the weight of each trained machine learning model in the fused first drug sensitivity prediction model is determined.

If too much feature data adopted in the sample data often occupies too many resources of hard disks, memories and the like of the equipment 100, the training time of the model may be too long, overfitting may also occur, and the accuracy of the drug sensitivity prediction of the trained fusion model is not significantly improved, so that the practical application value of the model may not be high. In the drug susceptibility prediction, the contribution degree of a small number of gene related characteristics is often considered, the characteristic data in sample data can be considered to be screened, the most important gene related characteristic data is screened out, and the sample data is constructed by the most important characteristic data.

Optionally, the apparatus further comprises:

The eighth module of the apparatus may replace the feature value of one feature of the third feature data of different samples in the first sample data set obtained by the method embodiment shown in fig. 2, to obtain the second sample data set. The feature value of one feature of the third feature data of each sample in the first sample data set may be randomly replaced to obtain the second sample data set.

The ninth module of the apparatus may predict the first sample data set and the second sample data set by using the first drug sensitivity prediction model obtained in the embodiment shown in fig. 2, and obtain the importance score of the feature according to the change of the accuracy of the prediction result. The third feature data of each cell in the first sample data set obtained by the method embodiment shown in fig. 2 is predicted by using the first drug sensitivity prediction model, the prediction result is compared with the corresponding drug sensitivity data to obtain the drug sensitivity prediction deviation of each cell, and the MSE of the drug sensitivity prediction deviation of all the cells is calculated to be used as the accuracy of the prediction result of the first sample data set. The apparatus 100 further predicts the third feature data of each cell in the obtained second sample data set by using the first drug sensitivity prediction model, compares the prediction result with the corresponding drug sensitivity data to obtain the drug sensitivity prediction deviation of each cell, and calculates the MSE of the drug sensitivity prediction deviation of all cells as the accuracy of the prediction result of the second sample data set. And comparing the accuracy of the predicted result of the first sample data set with the accuracy of the predicted result of the second sample data set, and taking the difference value of the two as the importance score of the feature.

And a tenth module of the device traverses each feature of the third feature data of the samples in the first sample data set, and repeats the operation steps of the eighth module and the ninth module to obtain the importance score of each feature of the third feature data in the samples of the first sample data set.

The eleventh module of the device selects a preset number of features with the highest importance scores according to the importance score of each feature of the third feature data in the samples of the first sample data set obtained by the tenth module, retains the preset number of features and data thereof in the third feature data, deletes other features and data thereof of the third feature data in each sample to obtain new feature data of each cell, takes the cell and the corresponding new feature data thereof as new sample data, has the same number as the samples of the first sample data set, and constructs a third sample data set, wherein each sample data in the third sample data set comprises the cell and the corresponding feature data with the highest importance scores corresponding to the preset number in the third feature data thereof and the corresponding drug sensitive data.

Optionally, the apparatus further comprises:

And a twelfth module of the device divides the sample data in the third sample data set to obtain a third training sample data set and a third test sample data set. For example, the third sample data set may be divided into a third training sample data set and a third test sample data set by using a random division method, and the KS test is performed after the division, and if the result does not pass, the KS test is performed after the random division is performed again, and the random division is repeated until the KS test passes.

The thirteenth module of the device trains a plurality of preset machine learning models by adopting a third training sample data set, when the MSE error of a certain preset machine model after a certain round of training meets a preset threshold, the model after the training is tested by adopting the third testing sample data set, and if the MSE error meets the preset threshold, the training of the preset machine learning model is completed. And traversing each preset machine model until the training of each preset machine learning model is completed, and obtaining a plurality of trained machine learning models. For example, the plurality of preset machine learning models may select existing classical basic machine learning models which are verified to have beneficial effects in a certain application range or certain application ranges, and may include the following 7 major classes and 13 specific machine learning models, wherein different names of the same major class model represent different selected classical meta-parameter sets:

1. KNeighborsUnif and KNeighborsDist in KNN (K-nearest neighbor, K) model

3. RandomForestGini and RandomForestEntr in random forest model

4. Extratetree Gini, extratetree Entr in the Extratetree model

5、CatBoost

6、XGBoost

7. NeralNetFastAI and NeralNetTorch in artificial neural networks

Illustratively, through a thirteenth module, cross inspection and bootstrap sampling are adopted, training sample data are obtained from a third training sample data set, 13 preset machine learning models of the 7 classes are trained, the training is repeated for a preset number of times, when the MSE of each preset machine learning model after training meets a preset threshold, and after the third testing sample data set passes the test of each preset machine learning model after training, the training of the 13 preset machine learning models of the 7 classes is completed, and 13 trained machine learning models are obtained.

And a fourteenth module of the device fuses each preset machine learning model after training to obtain a second drug susceptibility prediction model. Illustratively, a greedy forward selection method can be adopted to perform linear fusion on each preset machine learning model after training, determine the weight of each machine learning model after training, and obtain a second drug sensitivity prediction model after fusion.

Optionally, the apparatus further comprises:

The fifteenth module of the device acquires gene expression data of a cell to be detected, and performs gene data comparison with each gene set in the gene set database according to the gene expression data of the cell to be detected, so as to obtain a score of the cell to be detected on each gene set in the gene set database. Illustratively, the score of the test cell on each gene set in the MsigDB database can be obtained by performing a gene data comparison with each gene set in the MsigDB database using the ssGSEA method according to the gene expression data of the test cell. And when the gene expression data of the cell to be detected is compared with the gene data of each gene set in the MsigDB database, if the gene expression data of the cell to be detected does not contain the data of a certain gene or certain genes on a certain gene set, the data of the gene or the genes is supplemented in the gene expression data of the cell to be detected so as to obtain the gene expression data with complete structure of the cell to be detected.

The sixteenth module of the device normalizes the score of each gene set in the gene set database corresponding to the cell to be detected to obtain the first characteristic data of the cell to be detected. For example, if the gene set database has n gene sets, or n gene sets of the gene set database are selected, and scores of the gene expression data of the test cell on the n gene sets are obtained, the first characteristic data of the test cell may include: n characteristics, namely the normalized scores of the gene expression data of the cells to be detected on n gene sets, wherein the specific numerical value of each normalized score is the characteristic value of each characteristic. The first characteristic data of the cell to be detected can be mathematically processed to obtain second characteristic data of the cell to be detected. Illustratively, the mathematical process may include at least one of: square operation, cross operation, cubic operation, natural logarithm operation, and box-cox transform.

And a seventeenth module of the device collects the first characteristic data and the second characteristic data of the cell to be detected to form a data set, so as to obtain third characteristic data of the cell to be detected.

The eighteenth module of the apparatus inputs the third characteristic data of the cell to be tested into the second drug sensitivity prediction model according to the embodiment of the method shown in fig. 4, so as to predict the drug sensitivity data of the cell to be tested. In addition, a preset number of feature data with the highest importance score, which are the same as those in the embodiment of the method shown in fig. 3, may be selected from the third feature data of the cell to be tested, and then the preset number of feature data with the highest importance score may be input into the second drug sensitivity prediction model in the embodiment of the method shown in fig. 4 to predict the drug sensitivity data of the cell to be tested.

In an alternative embodiment of the present application, there is also provided a computer readable medium having stored thereon computer readable instructions executable by a processor to implement the foregoing method embodiments or alternative embodiments.

It is noted that the method embodiments or alternative embodiments in this application may be implemented in software and/or a combination of software and hardware. The software programs referred to in this application may be executed by a processor to implement the steps or functions of the various embodiments described above. Also, the software programs (including associated data structures) of the present application may be stored in a computer-readable recording medium.

In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions.

In yet another alternative embodiment of the present application, there is also provided an apparatus for drug sensitive prediction model sample construction, the apparatus comprising: a memory storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform operations as described in the foregoing method embodiments and/or alternative embodiments, and/or technical solutions.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software and/or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A method for drug sensitivity prediction model sample construction, the method comprising:

acquiring gene expression data and drug sensitivity data of cells;

and processing the first characteristic data and the second characteristic data to obtain third characteristic data of the cell, and taking the cell and the corresponding third characteristic data and drug sensitivity data as sample data to construct a sample of a drug sensitivity prediction model.

2. The method of claim 1, wherein the gene expression data of a cell is complemented if the or each gene on a gene set of the database of gene sets has no corresponding gene expression data before the score of the cell on each gene set of the database of gene sets is obtained.

3. The method of claim 1, wherein obtaining a score for the cell on each gene set of a database of gene sets based on the gene expression data of the cell comprises:

4. The method of claim 1, wherein obtaining second characteristic data of the cell based on the first characteristic data comprises:

square operation;

performing cross operation;

performing cubic operation;

natural logarithm operation;

box-cox transform.

5. The method of claim 1, further comprising:

6. The method of claim 5, wherein the dividing the first sample data set into a first training sample data set and a first test sample data set comprises:

7. The method of claim 5, wherein training a number of preset machine learning models based on the first set of training sample data comprises:

the training is repeated for a preset number of times.

8. The method of claim 5, wherein fusing each pre-set machine learning model after training to obtain a first drug sensitivity prediction model comprises:

9. The method of claim 5, further comprising:

predicting the first sample data set and the second sample data set by adopting the first drug sensitivity prediction model, and obtaining an importance score of the characteristics according to the change of the accuracy of a prediction result;

and selecting a preset number of features with highest importance scores based on the first sample data set, and constructing a third sample data set, wherein each sample data in the third sample data set comprises cells and feature data corresponding to the preset number of features with highest importance scores in the third feature data.

10. The method of claim 9, further comprising:

11. The method of claim 10, further comprising:

and inputting the third characteristic data of the cell to be detected into the second drug sensitivity prediction model to predict the drug sensitivity data of the cell to be detected.

12. An apparatus for drug sensitive predictive model sample construction, the apparatus comprising:

and the fourth module is used for collecting and processing the first characteristic data and the second characteristic data to obtain third characteristic data of the cell, and taking the cell, the third characteristic data corresponding to the cell and the drug sensitive data as sample data to construct a sample of the drug sensitive prediction model.

13. The apparatus of claim 12, further comprising:

a sixth module, configured to train a plurality of preset machine learning models based on the first training sample data set, perform a test based on the first test sample data set when an MSE error of the preset machine model meets a preset threshold, and complete training of each preset machine learning model if the MSE error meets the preset threshold;

14. The apparatus of claim 13, further comprising:

15. The apparatus of claim 14, further comprising:

16. The apparatus of claim 15, further comprising:

17. A computer-readable medium comprising, in combination,

stored thereon computer program instructions to be executed by a processor to implement the method of any of claims 1 to 11.

18. An apparatus for drug sensitive predictive model sample construction, the apparatus comprising:

one or more processors; and

a memory storing computer program instructions that, when executed, cause the processor to perform the operations of the method of any of claims 1 to 11.