CN112116957A - Disease subtype prediction method, system, device and medium based on small sample - Google Patents
Disease subtype prediction method, system, device and medium based on small sample Download PDFInfo
- Publication number
- CN112116957A CN112116957A CN202010843441.3A CN202010843441A CN112116957A CN 112116957 A CN112116957 A CN 112116957A CN 202010843441 A CN202010843441 A CN 202010843441A CN 112116957 A CN112116957 A CN 112116957A
- Authority
- CN
- China
- Prior art keywords
- sample
- data set
- prediction
- meta
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 75
- 201000010099 disease Diseases 0.000 title claims abstract description 58
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 title claims abstract description 58
- 230000014509 gene expression Effects 0.000 claims abstract description 45
- 238000012549 training Methods 0.000 claims abstract description 45
- 238000010801 machine learning Methods 0.000 claims abstract description 9
- 239000000523 sample Substances 0.000 claims description 181
- 230000006870 function Effects 0.000 claims description 24
- 239000013598 vector Substances 0.000 claims description 18
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 238000010197 meta-analysis Methods 0.000 claims description 5
- 238000012360 testing method Methods 0.000 claims description 5
- 238000012937 correction Methods 0.000 claims description 4
- 239000012468 concentrated sample Substances 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 abstract description 15
- 238000005516 engineering process Methods 0.000 abstract description 3
- 230000008901 benefit Effects 0.000 description 5
- 108090000623 proteins and genes Proteins 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 3
- 201000005202 lung cancer Diseases 0.000 description 3
- 208000020816 lung neoplasm Diseases 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000007477 logistic regression Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 238000011282 treatment Methods 0.000 description 2
- 206010006187 Breast cancer Diseases 0.000 description 1
- 208000026310 Breast neoplasm Diseases 0.000 description 1
- 238000012351 Integrated analysis Methods 0.000 description 1
- 206010028980 Neoplasm Diseases 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000010195 expression analysis Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 210000004072 lung Anatomy 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000011541 reaction mixture Substances 0.000 description 1
- 239000004576 sand Substances 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Biotechnology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Physiology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention provides a disease subtype prediction method, a system, a device and a medium based on a small sample, wherein the method comprises the steps of acquiring gene expression data of a first data set; predicting the gene expression data through a prediction model to obtain the disease subtype of the sample to be predicted; the prediction model comprises a sample selection net, a feature selection layer and a meta learner; the method obtains a prediction model for disease subtype prediction through training of a meta-learner, and learns from related clinical tasks through a meta-learning technology to extract valuable information so as to help the model to be well popularized to the prediction task of the disease subtype; in the training process of the prediction model, through the processes of feature selection and sample re-weighting, the problem of dimension disaster is solved by removing noise data in a self-adaptive manner, and the method can be widely applied to the technical field of machine learning.
Description
Technical Field
The invention belongs to the technical field of machine learning, and particularly relates to a disease subtype prediction method, a disease subtype prediction system, a disease subtype prediction device and a disease subtype prediction medium based on small samples.
Background
Disease subtype prediction is the identification of a subset of similar patients that can guide treatment decisions for a particular individual. For example, over the last 15 years, 5 subtypes of breast cancer have been identified and studied intensively. On the molecular biology level, the method for predicting the disease subtype by using the gene expression data has important significance for improving the accuracy of disease diagnosis and identifying potential disease-related genes. However, one challenging problem is that gene expression data is a well-known small sample of data, i.e. we have relatively few samples for each disease subtype. In recent years, more and more machine learning research has begun to trend toward small sample learning.
However, unlike images, gene expression data is difficult to analyze due to its high dimensionality and high noise characteristics, and dimensionality issues tend to make prediction more challenging because of the large number of redundant features involved in decision making. In addition, high noise inevitably exists in gene expression data, so that the model is easy to have an overfitting problem, and the generalization performance of the model is poor.
Disclosure of Invention
In view of the above, to at least partially solve one of the above technical problems, an embodiment of the present invention is directed to providing a disease subtype prediction method based on small samples, which can filter out genes related to diseases and remove noise data by adding two steps of feature selection and sample selection, thereby realizing accurate prediction of disease subtypes; meanwhile, the embodiment of the invention also provides a system, a device and a medium which can correspondingly realize the disease subtype prediction method based on the small sample.
In a first aspect, the embodiments of the present invention provide a disease subtype prediction method based on a small sample, which includes the following steps: obtaining gene expression data for a first dataset;
predicting the gene expression data through a prediction model to obtain a disease subtype of a first data set; the prediction model comprises a sample selection net, a feature selection layer and a meta learner;
the prediction model is obtained by training the following steps:
constructing a first sample data set; the sample data set comprises gene expression data of the second data set;
obtaining a feature weighting vector of a training data set through a feature selection layer; constructing a second sample data set according to the characteristic weighting vector;
inputting the second sample data set into a sample selection network to obtain a sample weight;
and training the meta-learner according to the second sample data set and the sample weight to obtain a trained prediction model.
In some embodiments of the present invention, the step of constructing the sample data set specifically includes:
acquiring a plurality of gene expression data, and obtaining a first sample data set through meta-analysis;
or acquiring a plurality of gene expression data, and integrating to obtain a first sample data set through batch correction and machine learning.
In some embodiments of the invention, the step of constructing a sample data set further comprises:
extracting a plurality of support samples and a plurality of query samples from the second data set according to the type of the gene expression data;
constructing a support set according to the support samples; and constructing a query set according to the query sample.
In some embodiments of the present invention, the step of inputting the second sample data set into the sample selection net to obtain the sample weight specifically includes:
and determining the confidence of the sample data in the second sample data set, and distributing the sample weight according to the confidence.
In some embodiments of the invention, the step of determining a confidence level for sample data in the second set of sample data further comprises:
determining a loss function for sample data in the second set of sample data, embedding the loss function in the sample data,
and fitting a weighting function through a neural network to obtain the confidence coefficient of the sample data.
In some embodiments of the present invention, the step of training the meta-learner according to the second sample data set and the sample weights to obtain a trained prediction model, wherein the step of training the meta-learner includes:
removing noise samples in the support set according to the sample weight; determining a sample class in a support set after denoising;
and obtaining the Euclidean distance between the denoised support concentrated samples and the sample types, and normalizing the Euclidean distance to obtain the output of the meta-learner.
In some embodiments of the present invention, the step of training the meta-learner according to the second sample data set and the sample weights to obtain a trained prediction model further includes:
and verifying the output of the meta learner according to the query set, and outputting a trained prediction model according to a test result.
In a second aspect, the present invention further provides a system for predicting disease subtypes based on small samples, including a data obtaining unit, a model building unit, and a prediction output unit, wherein:
a data acquisition unit for acquiring gene expression data of the first data set;
the model building unit is used for training to obtain a prediction model, and the prediction model comprises: the system comprises a sample selection network, a feature selection layer and a meta learner; the training step of the prediction model comprises the following steps:
constructing a first sample data set; the sample data set comprises gene expression data of the second data set; obtaining a feature weighting vector of a training data set through a feature selection layer; constructing a second sample data set according to the characteristic weighting vector; inputting the second sample data set into a sample selection network to obtain a sample weight; training the meta-learner according to the second sample data set and the sample weight to obtain a trained prediction model;
and the prediction output unit is used for predicting the gene expression data through the prediction model to obtain the disease subtype of the first data set sample.
In a third aspect, the present invention provides a device for disease subtype prediction based on a small sample, including:
at least one processor;
at least one memory for storing at least one program;
when the at least one program is executed by the at least one processor, the at least one processor is caused to implement the method for small sample-based disease subtype prediction in the first aspect.
In a fourth aspect, the present invention also provides a storage medium in which a processor-executable program is stored, the processor-executable program being configured to implement the method as in the first aspect when executed by a processor.
Advantages and benefits of the present invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention:
according to the disease subtype prediction method based on the small samples, the prediction model for disease subtype prediction is obtained through training of the meta-learner, and the meta-learning technology is used for learning from related clinical tasks to extract valuable information so as to help the model to be well popularized to the disease subtype prediction task; in the training process of the prediction model, through the processes of feature selection and sample re-weighting, the problem of dimensional disaster is solved by adaptively removing noise data, and certain superiority is achieved in the aspects of predicting disease subtypes and identifying potential disease correlation.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart illustrating the steps of a method for predicting disease subtype based on small samples according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating the steps of a method for training a disease subtype prediction model based on small samples according to an embodiment of the present invention;
FIG. 3 is a graph of training loss of Select-ProtoNet in unbiased data versus training loss of ProtoNet in unbiased data according to an embodiment of the present invention;
FIG. 4 is a graph of the accuracy of Select-ProtoNet under unbiased data and the training accuracy of ProtoNet under unbiased data in accordance with an embodiment of the present invention;
FIG. 5 is a sample weight distribution of training data of Select-ProtoNet at 30% noise rate according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
The technical scheme provided by the embodiment of the invention solves the problem of small sample disease subtype prediction. A more common approach in genomic data is to expand the sample size by aggregating data from multiple studies under comparable conditions or treatments. Due to the complex nature of gene expression data, methods often suffer from bottlenecks, and fusing data from different platforms or experiments inevitably suffers from batch effects, heterogeneity and other biases. Furthermore, existing methods typically only consider subtype prediction tasks for a particular disease, and do not simultaneously consider several clinical variables that are often of interest to both the physician and clinician. To simulate the process of physicians and clinicians studying disease subtype prediction, embodiments of the present invention introduce meta-learning techniques to develop new data-efficient models that can extract shared experience or knowledge from a series of related tasks and quickly transfer it to new tasks. Thus, the basic idea of the predictive model of an embodiment is to learn from relevant clinical tasks by meta-learning techniques to extract valuable information to help the model to generalize well to the predictive task of disease subtypes.
In a first aspect, as shown in fig. 1, the embodiment of the present invention provides a method for predicting disease subtype based on a small sample, which mainly includes steps S01-S03:
s01, acquiring gene expression data of the first data set; in an embodiment, the first data set is a data set formed by gene expression data to be predicted.
S02, predicting the gene expression data through a prediction model to obtain the disease subtype of the sample to be predicted; in the present embodiment, the constructed model includes a sample selection net, a feature selection layer, and a meta learner. First, the meta learner has a prominent history in meta learning in machine learning, and many meta learning methods have been applied to solve small sample learning. Meta-learning methods can be broadly divided into two main types: firstly, the method learns a meta-learner in an outer loop to initialize a basic learner of an inner loop, and then trains on a new fewer tasks; second, the metric-based meta-learning method requires a metric space in which the characteristics of the samples are known, and in this sample space, efficient classification can be performed by a small number of samples. Whereas most current methods of meta-learning rely only on image classification tasks. There is a lack of systematic research on solving the problem of small sample disease subtype prediction using meta-learning. In this embodiment, the extended meta learner is selected as the prototype network. In addition, the model of the embodiment adds two modules that are robust to high dimensional and high noise gene expression data, which select the net and feature selection layers for the sample.
In an embodiment, as shown in fig. 2, the training process of the prediction model includes steps S021-S024:
s021, constructing a first sample data set; the sample data set comprises gene expression data of the second data set; the constructed first sample data set is a training set of the model, the second data set is clinical data of gene expression data, historical data of the second data set all carry labels of corresponding disease subtypes, and the training set of the model is obtained by sorting the clinical data carrying the labels. In the step of constructing the sample data set in step S021, in order to solve the problem of learning of small samples, a data fusion strategy may be generally used to increase the sample size, that is, the generalization ability of the model is improved from the perspective of data, so as to avoid overfitting, and specifically, the step S0211 or step S0212 may be performed:
s0211, obtaining a plurality of gene expression data, and obtaining a first sample data set through meta-analysis; i.e. each data set is analyzed independently by meta-analysis (meta-analysis) and finally combined with statistical results to identify genes associated with disease.
S0212, acquiring a plurality of gene expression data, and integrating to obtain a first sample data set through batch correction and machine learning. The method adopts an integrated analysis via data aggregation method, treats experimental data from different platforms as a single data set from the same experiment, eliminates batch effect through a batch correction method, and then adopts a machine learning method to analyze the integrated data set.
In addition, in the step of constructing the first sample data set in step S021, it further includes steps S0213 and S0214:
s0213, extracting from the second data set according to the type of the gene expression data to obtain a plurality of support samples and a plurality of query samples;
s0214, constructing a support set according to the support samples; and constructing a query set according to the query sample.
Specifically, since small sample learning (FSL) involves the construction of a model using training data of known classes, which can classify unknown new classes using a small number of samples, under the setting of FSL, embodiments obtain a large sample set Ds from a set of source Cs with tagged gene expression data, and a small sample set Dt from a target Ct, from which the test set T is derived, andthe goal of small sample learning is to train a classification model with Cs so that it can be well generalized to task T. Yet further, embodiments employ an epidemic training strategy (epidemic training strategy) that is universally used by a small sample learning model based on meta-learning, specifically, defining a series of n-way k-shot tasks randomly sampled from Ds, each n-way k-shot task being defined as an epidemic D ═ S; Q), n-way k-shott means there are n classes, with k samples in each class. S is referred to as a support set (support set) containing n classes and k samples per class, and Q is referred to as a query set (query set) having the same n classes. In the examples, a new small sample dataset was constructed for disease subtype prediction in the biological field. Episode D can be constructed by the following process: first, a small number of source classes C containing n classes are selected { C ═ Ci|i=1,2,3…,ncIs taken from Cs. Then, k support samples (support samples) and Q query samples (query samples) are randomly drawn from each class in C to generate a support set S and a query set Q, respectively. Therefore, the support set S { (x) is ownedi,yi)|yi∈C,i=1,2,3…,msAnd query set Q { (x)i,yi)|yi∈C,i=1,2,3…,mqIn which m iss=n×k,mqRepresents the number of query samples in each episcotic, and
the meta learner to be trained is then denoted asWith parameters ofIt can be understood as the probability that the output sample of the meta-learner belongs to the y class given the data sample x. On a query set Q of different episodes, the classifier is trained between its predicted labels and true labels so that it can be generalized to other data sets, with the objective function as follows:
S022, obtaining a feature weighting vector of a training data set through a feature selection layer; constructing a second sample data set according to the characteristic weighting vector; since the prototype web-learning supports the prototypes of each class in the set S, each sample in the query set is classified according to the distance between the query set Q and the different prototypes. Thus, in prototype networksThe following were used:
in formula (2), d represents the Euclidean distance in the feature space, CnRepresents the prototype under each epsode. Equation (2) represents the prototype network output according to softmaxAnd type prototype CnThe distance between them, generates a class distribution of query samples x, where,is an embedding function that maps a sample x to a feature vectorEach prototype CnAll by supporting samples S for all embeddings belonging to class nnIs calculated by averaging the vectors of (a) in the form:
in the formula (3), SnE.g., S, represents the set of supported samples belonging to the category n. And in the gene expression data, each sample x is equal to RpAll have features with a high value of the feature dimension p. In general, feature selection methods attempt to find a selection vector β ═ β (β)1,β2,…,βp) I.e. multiplying the corresponding element by the data x to filter out unwanted features and obtain the data xnewTo be more advantageous for performing the subsequent steps, namely:
xnew=β⊙x,βj∈[0,1] (4)
regularization techniques are effective methods for solving the problem of dimensional disasters, and usually a specific form of regularization is manually set under certain assumptions on training data, and corresponding improvements are needed when little is known about the basic knowledge of gene expression data as in the examples. To overcome the problem caused by this situation, embodiments select vector modeling to obtain a softmax network layer, and may learn to obtain an adaptive feature weighting vector from data:
in the formula (5), θ ∈ RpIs the parameter vector of softmax layer, exp is the exponential operator computed by element. Equation (5) can be easily embedded in equation (2) to yield:
in the formula (6), CnIt can be further stated that:
equation (6) makes θ look likeAs well as efficient learning by auto-differentiation techniques, can be easily extended to other problems.
S023, inputting the second sample data set into a sample selection network to obtain a sample weight; wherein the second sample data set is processed through step S022The obtained training set or support set and query set. Due to the high noise in the gene expression data, the generalization performance of the model is easily reduced. To overcome this problem, embodiments do not employ a simple averaging strategy in prototype networks, but rather attempt to assign weights to the support samples to characterize the confidence of the samples in an attempt to suppress the effects of extreme noise samples. But from the finally obtained prototype CnCalculated by the weighting of all the embedded support samples of class n. In an implementation process of some embodiments, the step of inputting the second sample data set into the sample selection net to obtain the sample weight specifically includes: and determining the confidence of the sample data in the second sample data set, and distributing the sample weight according to the confidence.
In particular, prototype CnThe weighted representation of all embedded support samples by class n is:
in the formula (8), viReflects the supporting sample xiThe confidence that belongs to a clean sample. In an embodiment, a larger weight v is more likely to be a clean sample with high confidence.
In addition, the step of determining the confidence level of the sample data in the second sample data set may, in some embodiments, be specifically expanded as follows: and determining a loss function of the sample data in the second sample data set, embedding the loss function into the sample data, and fitting a weighting function through a neural network to obtain the confidence coefficient of the sample data.
Specifically, to determine v, inspired by the current adaptive sample weighting strategy Meta-Weight-Net (MW-Net), embodiments learn a weighting function to distinguish between clean and noisy samples. More specifically, MW-Net models the sample weights V as an MLP network V (iota; Θ) with only one hidden layer, being a general approximator of any continuous function, and thus can fit a variety of weighting functions, with the input being the loss function of a sample and the output being the weight of that sample. But since the prototype network does not compute the loss of support samples, embodimentsUsing embedded representationAs input to MW-Net. Thus, equation (8) can be further extended to write as:
in equation (9), Θ represents the parameters of MW-NetV2 (distinguished from the original MW-Net nomenclature), and MW-NetV2 has the same architecture as MW-Net, except that the inputs differ, and the input to the network is an embedded representation of each sample, rather than a loss in the network.
By adding a feature selection layer (FS-Net) and a sample selection Net (MW-NetV2) to the prototype network, the method of an embodiment can simultaneously selectively pick out disease-related genes that contribute to classification and suppress the negative effects of noisy data. The learning process is similar to the prototype network, with the parameters of FS-Net and MW-NetV2 updated simultaneously with the parameters of the prototype network. The final loss function is:
s024, training a meta learner according to the second sample data set and the sample weight to obtain a trained prediction model; step S024 further includes:
s0241, removing noise samples in a support set according to sample weight; determining a sample class in a support set after denoising;
and S0242, obtaining the Euclidean distance between the denoised support concentrated samples and the sample types, and normalizing the Euclidean distance to obtain the output of the meta-learner.
Specifically, the prototype network is trained through steps S022 and S023. The original features are converted into a new set of features through subspace learning in the feature extraction process. However, it is difficult to derive a reasonable biological interpretation from the learned feature subspace. Feature selection removes irrelevant, redundant features and selects a set of important features that are closely related to the target. And then applying a weight to the sample according to the training reliability of the sample by adopting a sample weight strategy to finally obtain a trained prediction model.
Step S024 further includes, as necessary, step S0243: and verifying the output of the meta learner according to the query set, and outputting a trained prediction model according to a test result.
In a second aspect, the present invention further provides a system for predicting disease subtypes based on small samples, including a data obtaining unit, a model building unit, and a prediction output unit, wherein:
a data acquisition unit for acquiring gene expression data of the first data set;
the model building unit is used for training to obtain a prediction model, and the prediction model comprises: the system comprises a sample selection network, a feature selection layer and a meta learner; the training step of the prediction model comprises the following steps:
constructing a first sample data set; the sample data set comprises gene expression data of the second data set; obtaining a feature weighting vector of a training data set through a feature selection layer; constructing a second sample data set according to the characteristic weighting vector; inputting the second sample data set into a sample selection network to obtain a sample weight; training a meta-learner according to the support samples and the sample weights to obtain a trained prediction model;
and the prediction output unit is used for predicting the gene expression data through the prediction model to obtain the disease subtype of the sample to be predicted.
In a third aspect, embodiments of the present invention also provide an apparatus for disease subtype prediction based on a small sample, comprising at least one processor; at least one memory for storing at least one program; when the at least one program is executed by the at least one processor, the at least one processor is caused to implement the small-sample based disease subtype prediction method as in the first aspect.
An embodiment of the present invention further provides a storage medium storing a program, where the program is executed by a processor as the method in the first aspect.
In addition to the above description of the embodiments of the method, system, apparatus, etc., the following is a detailed description of exemplary embodiments of the invention; in the process, the prediction is carried out by the embodiment, the simulation data set and the real gene expression data set, and the prediction results of other modes or methods are compared, as shown in table 1:
TABLE 1
In table one, this embodiment (Select-ProtoNet) is combined with the ordinary metric-based meta-learning method ProtoNet and ProtoNet with one of two additional modules. SelectF-ProtoNet and SelectS-ProtoNet have a feature selection layer and a sample selection net, respectively. In the process, four situations for damaging the sample class label on the support set are set. Independently converting the class label of each support sample into a random class with probability of P, wherein P is 0%; 10 percent; 30 percent; 50 percent. Table 1 shows the accuracy (%) ± variance of the Select-ProtoNet and its baseline method on the simulated dataset after 30 random runs at different experimental settings. It can be seen that both additional modules contribute to improved performance and the proposed method achieves the best results. The advantage of Select-ProtoNet over ProtoNet becomes more pronounced as the noise rate increases. Training loss curves and accuracies between ProtoNet and Select-ProtoNet were compared as shown in FIGS. 3 and 4. It can be seen that compared to Select-ProtoNet of this embodiment, ProtoNet takes nearly twice as much time to achieve the best classification accuracy. As shown in fig. 5, the weight distribution of clean and noisy training support samples is plotted in fig. 5. It can be seen that almost all the larger weights belong to clean samples, and the weight value of noisy samples is smaller than that of clean samples, indicating that MW-NetV2 can distinguish clean samples from noisy samples.
Besides the comparison with the metric-based meta learning method ProtoNet, the process is also compared with the existing machine learning method, including a prototype Network (ProtoNet), a Majority prediction (Majority), a Logistic Regression (Logistic Regression) and a Neural Network (Neural Network). The actual Dataset used in the comparison process was TCGA Meta-Dataset, which is a published benchmark Dataset in the field of gene expression analysis, containing 174 clinical tasks derived from a cancer genomic map (TCGA), which can be used in a multi-task learning framework. However, it cannot be used directly as a reference Dataset for small sample learning, because some tasks have very unbalanced classes, and therefore a new Dataset mini-TCGA Meta-Dataset consisting of 68 TCGA reference clinical tasks with two classes per task, each with at least 60 samples, was used in the comparison process, as shown in Table 2:
TABLE 2
Table 2 shows the results of a (%) comparison of model test accuracy using different noise rate settings on the miniTCGA Meta-Dataset. The average accuracy (± variance) of 30 replicates was recorded. The best results are highlighted in bold for the classification accuracy at different noise rate settings in the miniTCGA Meta-Dataset. It can be seen that the prediction model of the embodiment is improved on its baseline method, has a great improvement in all noise rate settings, and is superior to the three conventional supervised methods.
As shown in Table 3, the LUNG cancer Subtype prediction accuracy of all comparison methods on TCGA Meta-Dataset with task id of Expression _ Subtype, Lung is embodied. In particular, ProtoNet and Select-ProtoNet consider the interrelationship of clinical tasks, and use all clinical task samples of miniTCGA Meta-Dataset to construct shared experience or knowledge and shift it to help predict lung cancer subtypes, whereas traditional approaches only consider lung cancer subtype tasks.
TABLE 3
As shown in table 3, we can see that the prediction model of the embodiment achieves the best results, the accuracy rate is improved by more than 20% compared with the benchmark method, and the accuracy rate is improved by more than 29% compared with the best results of the conventional method. This means that our model significantly improves the predictive performance of disease subtypes by merging various relevant clinical tasks.
From the above specific implementation process, it can be concluded that the technical solution provided by the present invention has the following advantages or advantages compared to the prior art:
1. according to the embodiment provided by the invention, two modules, namely a feature selection layer and a sample selection network, are additionally added in the model of the prototype network, so that the model can adaptively select important features and clean samples, and can well learn and generalize from small sample data.
2. According to the embodiment provided by the invention, the two additional modules and the prototype network are updated in a unified framework, and the method can be more easily realized on the basis of the original prototype network.
3. The superiority of the method in predicting disease subtypes and identifying genes related to potential diseases is proved through simulation and verification of real gene expression data.
In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.
Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the functions and/or features may be integrated in a single physical device and/or software module, or one or more of the functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.
Wherein the functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (10)
1. The disease subtype prediction method based on the small sample is characterized by comprising the following steps of:
obtaining gene expression data for a first dataset;
predicting the gene expression data through a prediction model to obtain a disease subtype of the first data set; the prediction model comprises a sample selection net, a feature selection layer and a meta learner;
the prediction model is obtained by training the following steps:
constructing a first sample data set; the sample data set comprises gene expression data of a second data set;
obtaining a feature weighting vector of the training data set through the feature selection layer; constructing a second sample data set according to the characteristic weighting vector;
inputting the second sample data set into the sample selection net to obtain a sample weight;
and training the meta learner according to the second sample data set and the sample weight to obtain a trained prediction model.
2. The method for disease subtype prediction based on small samples according to claim 1, characterized in that the step of constructing the first sample data set specifically comprises:
acquiring a plurality of gene expression data, and obtaining the first sample data set through meta-analysis;
or acquiring a plurality of gene expression data, and integrating to obtain the first sample data set through batch correction and machine learning.
3. The method of claim 1, wherein the step of constructing the first sample data set further comprises:
extracting a plurality of support samples and a plurality of query samples from the second data set according to the type of the gene expression data;
constructing a support set according to the support sample; and constructing a query set according to the query sample.
4. The method according to claim 1, wherein the step of inputting the second sample data set into the sample selection net to obtain sample weights comprises:
and determining the confidence of the sample data in the second sample data set, and distributing the sample weight according to the confidence.
5. The method of claim 4, wherein the step of determining the confidence level of the sample data in the second sample data set further comprises:
and determining a loss function of the sample data in the second sample data set, embedding the loss function into the sample data, and fitting a weighting function through a neural network to obtain the confidence coefficient of the sample data.
6. The method of claim 3, wherein the step of training the meta-learner based on the second set of sample data and the sample weights to obtain a trained predictive model comprises:
removing noise samples in the support set according to the sample weight; determining a sample class in a support set after denoising;
and obtaining the Euclidean distance between the denoised support concentrated samples and the sample types, and normalizing the Euclidean distance to obtain the output of the meta-learner.
7. The method of claim 6, wherein the step of training the meta-learner to obtain a trained predictive model based on the second set of sample data and the sample weights further comprises:
and checking the output of the meta learner according to the query set, and outputting a trained prediction model according to a test result.
8. The disease subtype prediction system based on the small sample is characterized by comprising a data acquisition unit, a model construction unit and a prediction output unit, wherein:
the data acquisition unit is used for acquiring gene expression data of the first data set;
the model building unit is used for training to obtain a prediction model, and the prediction model comprises: the system comprises a sample selection network, a feature selection layer and a meta learner; the training step of the prediction model comprises the following steps:
constructing a first sample data set; the sample data set comprises gene expression data of a second data set; obtaining a feature weighting vector of the training data set through the feature selection layer; constructing a second sample data set according to the characteristic weighting vector; inputting the second sample data set into the sample selection net to obtain a sample weight; training the meta learner according to the second sample data set and the sample weights to obtain a trained prediction model;
and the prediction output unit is used for predicting the gene expression data through a prediction model to obtain the disease subtype of the first data set sample.
9. An apparatus for disease subtype prediction based on small samples, comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement the small sample-based disease subtype prediction method of any one of claims 1-7.
10. A storage medium having stored therein a program executable by a processor, characterized in that: the processor executable program when executed by a processor is for implementing a small sample based disease subtype prediction method as claimed in any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010843441.3A CN112116957A (en) | 2020-08-20 | 2020-08-20 | Disease subtype prediction method, system, device and medium based on small sample |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010843441.3A CN112116957A (en) | 2020-08-20 | 2020-08-20 | Disease subtype prediction method, system, device and medium based on small sample |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112116957A true CN112116957A (en) | 2020-12-22 |
Family
ID=73804344
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010843441.3A Pending CN112116957A (en) | 2020-08-20 | 2020-08-20 | Disease subtype prediction method, system, device and medium based on small sample |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112116957A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112669929A (en) * | 2020-12-30 | 2021-04-16 | 深圳大学 | Crohn's disease infliximab drug effect prediction method and terminal equipment |
CN112685561A (en) * | 2020-12-26 | 2021-04-20 | 广州知汇云科技有限公司 | Small sample clinical medical text post-structuring processing method across disease categories |
CN113057589A (en) * | 2021-03-17 | 2021-07-02 | 上海电气集团股份有限公司 | Method and system for predicting organ failure infection diseases and training prediction model |
CN113555118A (en) * | 2021-07-26 | 2021-10-26 | 内蒙古自治区人民医院 | Method and device for predicting disease degree, electronic equipment and storage medium |
CN114067914A (en) * | 2021-10-27 | 2022-02-18 | 山东大学 | Meta-learning-based bioactive peptide prediction method and system |
WO2023071406A1 (en) * | 2021-10-29 | 2023-05-04 | 复旦大学附属华山医院 | Classification method and system for classifier used for immune-related disease molecular typing and subtyping |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015018517A1 (en) * | 2013-08-05 | 2015-02-12 | Mr. PD Dr. NIKOLAOS KOUTSOULERIS | Adaptive pattern recognition for psychosis risk modelling |
CN109919299A (en) * | 2019-02-19 | 2019-06-21 | 西安交通大学 | A kind of meta learning algorithm based on meta learning device gradually gradient calibration |
CN111476292A (en) * | 2020-04-03 | 2020-07-31 | 北京全景德康医学影像诊断中心有限公司 | Small sample element learning training method for medical image classification processing artificial intelligence |
-
2020
- 2020-08-20 CN CN202010843441.3A patent/CN112116957A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015018517A1 (en) * | 2013-08-05 | 2015-02-12 | Mr. PD Dr. NIKOLAOS KOUTSOULERIS | Adaptive pattern recognition for psychosis risk modelling |
CN109919299A (en) * | 2019-02-19 | 2019-06-21 | 西安交通大学 | A kind of meta learning algorithm based on meta learning device gradually gradient calibration |
CN111476292A (en) * | 2020-04-03 | 2020-07-31 | 北京全景德康医学影像诊断中心有限公司 | Small sample element learning training method for medical image classification processing artificial intelligence |
Non-Patent Citations (5)
Title |
---|
JAKE SNELL 等: ""Prototypical Networks for Few-shot Learning"", 《31ST CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS (NIPS 2017)》, 31 December 2017 (2017-12-31), pages 1 - 11 * |
JUN SHU 等: ""Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting"", 《ARXIV》, 27 September 2019 (2019-09-27), pages 1 - 23 * |
JUN SHU 等: ""Small Sample Learning in Big Data Era"", 《ARXIV》, 22 August 2018 (2018-08-22), pages 1 - 76 * |
ZIYI YANG 等: ""Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype Prediction"", 《ARXIV》, 3 September 2020 (2020-09-03), pages 1 - 11 * |
束俊 等: ""元自步学习"", 《中国科学 : 信息科学 》, vol. 50, no. 6, 10 June 2020 (2020-06-10), pages 781 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112685561A (en) * | 2020-12-26 | 2021-04-20 | 广州知汇云科技有限公司 | Small sample clinical medical text post-structuring processing method across disease categories |
CN112669929A (en) * | 2020-12-30 | 2021-04-16 | 深圳大学 | Crohn's disease infliximab drug effect prediction method and terminal equipment |
CN113057589A (en) * | 2021-03-17 | 2021-07-02 | 上海电气集团股份有限公司 | Method and system for predicting organ failure infection diseases and training prediction model |
CN113555118A (en) * | 2021-07-26 | 2021-10-26 | 内蒙古自治区人民医院 | Method and device for predicting disease degree, electronic equipment and storage medium |
CN113555118B (en) * | 2021-07-26 | 2023-03-31 | 内蒙古自治区人民医院 | Method and device for predicting disease degree, electronic equipment and storage medium |
CN114067914A (en) * | 2021-10-27 | 2022-02-18 | 山东大学 | Meta-learning-based bioactive peptide prediction method and system |
CN114067914B (en) * | 2021-10-27 | 2024-08-20 | 山东大学 | Method and system for predicting bioactive peptide based on meta learning |
WO2023071406A1 (en) * | 2021-10-29 | 2023-05-04 | 复旦大学附属华山医院 | Classification method and system for classifier used for immune-related disease molecular typing and subtyping |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112116957A (en) | Disease subtype prediction method, system, device and medium based on small sample | |
Angermueller et al. | Deep learning for computational biology | |
CN111832627B (en) | Image classification model training method, classification method and system for suppressing label noise | |
JP6814981B2 (en) | Learning device, identification device, learning identification system, and program | |
JP7522936B2 (en) | Gene phenotype prediction based on graph neural networks | |
US20210256699A1 (en) | Systems and methods for mesothelioma feature detection and enhanced prognosis or response to treatment | |
JP2015087903A (en) | Apparatus and method for information processing | |
CN113674864B (en) | Malignant tumor combined venous thromboembolism risk prediction method | |
Akilandasowmya et al. | Skin cancer diagnosis: Leveraging deep hidden features and ensemble classifiers for early detection and classification | |
CN108877947B (en) | Depth sample learning method based on iterative mean clustering | |
Jiang et al. | MHAttnSurv: Multi-head attention for survival prediction using whole-slide pathology images | |
Sekaran et al. | Predicting autism spectrum disorder from associative genetic markers of phenotypic groups using machine learning | |
WO2023114519A1 (en) | Applications of deep neuroevolution on models for evaluating biomedical images and data | |
Patra et al. | Deep learning methods for scientific and industrial research | |
Mahmoud et al. | Early diagnosis and personalised treatment focusing on synthetic data modelling: novel visual learning approach in healthcare | |
Rajasree et al. | Ensemble-of-classifiers-based approach for early Alzheimer’s Disease detection | |
Aljuhani et al. | Uncertainty aware sampling framework of weak-label learning for histology image classification | |
Zhang et al. | Semi‐supervised graph convolutional networks for the domain adaptive recognition of thyroid nodules in cross‐device ultrasound images | |
CN113838519B (en) | Gene selection method and system based on adaptive gene interaction regularization elastic network model | |
Linmans et al. | The Latent Doctor Model for Modeling Inter-Observer Variability | |
Plazas et al. | Towards reduction of expert bias on Gleason score classification via a semi-supervised deep learning strategy | |
KR102713565B1 (en) | Method for detecting white matter lesions based on medical image | |
CN112086174B (en) | Three-dimensional knowledge diagnosis model construction method and system | |
Wang et al. | Semisupervised Bacterial Heuristic Feature Selection Algorithm for High‐Dimensional Classification with Missing Labels | |
JP2024500470A (en) | Lesion analysis methods in medical images |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |