CN116401555A - Method, system and storage medium for constructing double-cell recognition model - Google Patents
Method, system and storage medium for constructing double-cell recognition model Download PDFInfo
- Publication number
- CN116401555A CN116401555A CN202310665802.3A CN202310665802A CN116401555A CN 116401555 A CN116401555 A CN 116401555A CN 202310665802 A CN202310665802 A CN 202310665802A CN 116401555 A CN116401555 A CN 116401555A
- Authority
- CN
- China
- Prior art keywords
- training
- data
- double
- base learner
- recognition model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000012549 training Methods 0.000 claims abstract description 201
- 230000014509 gene expression Effects 0.000 claims abstract description 56
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 50
- 230000002787 reinforcement Effects 0.000 claims abstract description 36
- 238000012360 testing method Methods 0.000 claims abstract description 28
- 239000011159 matrix material Substances 0.000 claims abstract description 26
- 238000005728 strengthening Methods 0.000 claims abstract description 23
- 238000002156 mixing Methods 0.000 claims abstract description 18
- 238000010276 construction Methods 0.000 claims abstract description 10
- 238000004891 communication Methods 0.000 claims description 13
- 230000006870 function Effects 0.000 claims description 9
- 238000012795 verification Methods 0.000 claims description 6
- 238000003066 decision tree Methods 0.000 claims description 5
- 238000007477 logistic regression Methods 0.000 claims description 5
- 238000012706 support-vector machine Methods 0.000 claims description 5
- 230000003247 decreasing effect Effects 0.000 claims description 4
- 238000000513 principal component analysis Methods 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 2
- 210000004027 cell Anatomy 0.000 description 69
- 238000005516 engineering process Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 6
- 238000004590 computer program Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 230000005859 cell recognition Effects 0.000 description 4
- 238000001914 filtration Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 241000894007 species Species 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000009396 hybridization Methods 0.000 description 3
- 210000001744 T-lymphocyte Anatomy 0.000 description 2
- 238000011109 contamination Methods 0.000 description 2
- 238000002790 cross-validation Methods 0.000 description 2
- 108020004707 nucleic acids Proteins 0.000 description 2
- 150000007523 nucleic acids Chemical class 0.000 description 2
- 102000039446 nucleic acids Human genes 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 241001529936 Murinae Species 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000011166 aliquoting Methods 0.000 description 1
- 210000003719 b-lymphocyte Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 210000002950 fibroblast Anatomy 0.000 description 1
- 210000005260 human cell Anatomy 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008212 organismal development Effects 0.000 description 1
- 230000008506 pathogenesis Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000000087 stabilizing effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Abstract
The application provides a method, a system and a storage medium for constructing a double-cell identification model. The construction method comprises the following steps: preparing a gene expression matrix obtained by mixing two single-cell gene data as a data set of a double-cell recognition model; randomly splitting the data set into a training set and a testing set according to a preset proportion; dividing the training set into K groups of training subsets, and respectively adjusting the super parameters of K heterogeneous base learners based on partial data in each group of training subsets; respectively performing reinforcement training on each base learner after the super parameters are adjusted based on a Boosting method; and combining the base learner after the strengthening training according to corresponding weights based on the prediction accuracy of the base learner after the strengthening training on the test set so as to form the double-cell recognition model.
Description
Technical Field
The application relates to the field of bioinformatics, in particular to a method for constructing a double-cell recognition model.
Background
Single cell transcriptome (scRNA-sequencing) techniques allow for transcript analysis in units of cells, as compared to traditional transcriptome techniques, to divide cell types and dissect functional status. Therefore, single cell transcriptome technology has become an important means for constructing cell maps and accurately researching the aspects of organism development, pathogenesis and progress of diseases and the like. The realization of the capture and separation of single cells is a key step in the implementation process of the technology. At present, water-in-oil droplets are constructed based on microfluidic technology, and individual cells can be packed in a single droplet by controlling the rate of loading cells and droplet generation in a pipeline. The method is currently the most mainstream solution because of its ultra-high throughput and relatively simple and convenient operation. In this process, if two (or more) cells enter the same droplet, double cell contamination may result. In the case of the current microfluidic technology, the problem of double cells cannot be solved in the experimental level, and the proportion of double cells increases with the increase of the loading quantity. Double cell droplets are not strictly defined cells, and belong to key confounding factors of single cell transcriptome technology. Therefore, a sufficiently effective filtering operation is necessary to remove such features to obtain accurate analysis results in downstream analysis.
The bicells can be classified into homologous bicells (i.e., cells contained in a droplet are the same type of cells) and heterologous bicells (i.e., cells contained in a droplet are different types of cells). For homologous double cells, it can be seen that the same droplet contains nucleic acid information twice as abundant as normal cells. The influence of the double cells on the single-cell transcriptome technology is smaller, because on the one hand, the influence of the homologous double cells on downstream cluster analysis and cell annotation is relatively smaller, and on the other hand, effective and simple filtration can be easily performed by setting a reasonable nucleic acid abundance threshold. However, for heterogeneous double cells, since the gene expression characteristics of different cell types are different, and the characteristics of double cell hybridization (such as hybridization of T cells and B cells, hybridization of T cells and fibroblasts, etc.) of different cell types are different, it is difficult to perform effective filtration by a hard filtration method using a set threshold. In addition, the heterogeneous double cells have a great influence on the subsequent analysis, such as the group with two types of cell characteristics obtained after cell annotation, and it is difficult to distinguish whether the cells have special biological significance or the data are polluted by the double cells. Thus, there is a need in the marketplace for a relatively efficient, accurate, stable heterologous double cell identification method and system.
Disclosure of Invention
The application provides a construction method of a double-cell recognition model, which comprises the following steps: preparing a gene expression matrix obtained by mixing two single-cell gene data as a data set of a double-cell recognition model; randomly splitting the data set into a training set and a testing set according to a preset proportion; equally dividing the training set into Q groups of training subsets, and respectively adjusting the super parameters of Q heterogeneous base learners based on partial data in each group of training subsets; respectively performing reinforcement training on each base learner after the super parameters are adjusted based on a Boosting method; and combining the base learner after the strengthening training according to corresponding weights based on the prediction accuracy of the base learner after the strengthening training on the test set so as to form the double-cell recognition model.
According to an embodiment of the present application, preparing a gene expression matrix obtained by mixing gene data of two species as a dataset of a two-cell recognition model includes: and mixing the gene data of the two species to obtain an M multiplied by N gene expression matrix which is used as a data set of the double-cell recognition model, wherein M is the number of genes and N is the number of samples. In order to further reduce the data size, the L genes with the highest expression level may also be selected from the M genes, thereby reducing the dimension of the data set from mxn to lxn. Then, the dimension of the data set is reduced from L×N to P×N by performing principal component analysis on the expression amounts of the L genes in the N samples, wherein P < L.
According to an embodiment of the present application, adjusting the super parameters of the Q heterogeneous base learners based on the partial data in each set of training subsets includes: and carrying out K equal division on each training subset, and determining the hyper-parameters with highest prediction accuracy on the verification set through K-fold cross verification and Grid Search (Grid Search).
According to an embodiment of the present application, performing the reinforcement training on each base learner after the super parameter adjustment based on the Boosting method includes: training one of the Q heterogeneous base learners based on a portion of training samples of a set of the Q sets of training subsets; based on the situation of the prediction error of the base learner after the previous training, increasing the weight of the training sample with the prediction error in the previous training in the training samples of the next training; iteratively training the base learner I times; for each base learner, based on the prediction error rate of the I base learners obtained after training I times, combining the I base learners into the base learner after strengthening training.
According to the embodiment of the application, based on the prediction accuracy of the I base learners obtained after training I times, the I base learners are combined into the base learners after strengthening training according to the following formula:
wherein G is m (x) Base learner, alpha, for mth training m Assigning coefficients, alpha, to the base learner obtained for the mth training m The expression of (2) is:
wherein e m The error rate of the base learner obtained for the mth training is predicted.
Wherein e m The classification error rate of the base learner obtained by the mth training is obtained; w (w) mi Weights for the ith data in the training subset for the mth training; g m (x i ) The predicted value of the base learner is obtained for the mth training; y is i A data true value; i is an indication function, and is 0 when the prediction is correct and 1 when the prediction is incorrect; n is the number of samples in the training subset for the mth training.
According to the embodiment of the application, increasing the weight of the training sample that was predicted to be incorrect in the previous training or decreasing the weight of the training sample that was predicted to be correct in the previous training is performed according to the following formula:
wherein w is m+1 I is the weight of the ith data in the training subset for the (m+1) th training; z m Is a normalization factor.
According to an embodiment of the present application, combining the base learner after the reinforcement training according to the corresponding weights includes: and linearly superposing the output of the base learner after the reinforcement training according to the prediction accuracy as a weight by a soft voting mode.
According to the implementation mode of the application, the Q heterogeneous base learners are a decision tree model, a K neighbor model, a support vector machine model, a logistic regression model and a naive Bayes model respectively.
The application also provides a system for constructing a double cell recognition model, which comprises: a memory storing executable instructions; and one or more processors in communication with the memory to execute the executable instructions to: preparing a gene expression matrix obtained by mixing gene data of two species as a data set of a double-cell recognition model; randomly splitting the data set into a training set and a testing set according to a preset proportion; equally dividing the training set into Q groups of training subsets, and respectively adjusting the super parameters of Q heterogeneous base learners based on partial data in each group of training subsets; respectively performing reinforcement training on each base learner after the super parameters are adjusted based on a Boosting method; and combining the base learner after the strengthening training according to corresponding weights based on the prediction accuracy of the base learner after the strengthening training on the test set so as to form the double-cell recognition model.
The present application also provides a computer-readable storage medium for training a two-cell recognition model, the computer-readable storage medium storing executable instructions executable by one or more processors to perform operations comprising: preparing a gene expression matrix obtained by mixing gene data of two species as a data set of a double-cell recognition model; randomly splitting the data set into a training set and a testing set according to a preset proportion; equally dividing the training set into Q groups of training subsets, and respectively adjusting the super parameters of Q heterogeneous base learners based on partial data in each group of training subsets; respectively performing reinforcement training on each base learner after the super parameters are adjusted based on a Boosting method; and combining the base learner after the strengthening training according to corresponding weights based on the prediction accuracy of the base learner after the strengthening training on the test set so as to form the double-cell recognition model.
The method for constructing the double-cell recognition model creatively uses the integrated learning method in the construction of the double-cell recognition model and integrates the recognition results of a plurality of learners, so that the double-cell recognition model with good stability and high robustness is provided.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings.
Fig. 1 is a schematic diagram of a method of constructing a two-cell recognition model according to an embodiment of the present application.
FIG. 2 is a schematic representation of a gene expression matrix based on a mixture of two single cell gene data according to an embodiment of the present application.
FIG. 3 is a schematic block diagram of a system for building a two-cell recognition model according to an embodiment of the present application.
Detailed Description
For a better understanding of the present application, a more detailed description of the technical solution of the present application will be made with reference to the accompanying drawings. It should be understood that the detailed description is merely illustrative of exemplary embodiments of the application and is not intended to limit the scope of the application in any way. Like reference numerals refer to like elements throughout the specification. The expression "and/or" includes any or all combinations of one or more of the associated listed items.
It should be noted that in this specification, expressions of "first", "second", "third", etc. are used only to distinguish one feature from another feature, and do not represent any limitation on the feature. Thus, a first single cell discussed below may also be referred to as a second single cell without departing from the teachings of the present application. And vice versa.
In the drawings, the size, proportion, and shape of the drawings have been slightly adjusted for convenience of explanation. The figures are merely examples and are not drawn to scale. As used herein, the terms "about," "approximately," and the like are used as terms of table approximation, not as terms of degree of a table, and are intended to illustrate inherent deviations in measured or calculated values that will be recognized by one of ordinary skill in the art.
It will be further understood that terms such as "comprises," "comprising," "includes," "including," "having," "contains," and/or "containing" are open-ended, rather than closed-ended, terms that specify the presence of the stated features, elements, and/or components, but do not preclude the presence or addition of one or more other features, elements, components, and/or groups thereof. Furthermore, when a statement such as "at least one of the list of features" appears after the list of features, it modifies the entire list of features rather than just a single feature in the list. Furthermore, when describing embodiments of the present application, use of "may" means "one or more embodiments of the present application. Also, the term "exemplary" is intended to refer to an example or illustration.
Unless otherwise defined, all terms (including engineering and technical terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
In addition, features in the embodiments and examples of the present application may be combined with each other without conflict. In addition, unless explicitly defined or contradicted by context, the particular steps included in the methods described herein are not necessarily limited to the order described, but may be performed in any order or in parallel. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Deep learning is a rapidly developing field in recent years, and is characterized in that data classification features can be autonomously learned from data, so that the deep learning is widely applied to complex feature recognition and classification tasks. The breakthrough development of deep learning in recent years provides a new possibility for double cell recognition. In order to cope with complex and various heterogeneous double-cell types and ensure high robustness and stability of the double-cell recognition model, the application innovatively provides a method for constructing the double-cell recognition model by utilizing ensemble learning. The model construction method is described below with reference to fig. 1.
Fig. 1 is a flowchart of a method of constructing a two-cell recognition model according to an embodiment of the present application.
In step S1010, a gene expression matrix obtained by mixing two single-cell gene data is prepared as a dataset of a two-cell recognition model. Specifically, the two single cells may be a certain tissue cell of a human and a certain tissue cell of a mouse. Since in the single cell transcriptome technique, the double cell samples belong to an abnormal contamination factor, the number of which is smaller than that of the normal single cell samples, the sample size of the data set in which two single cell gene data are mixed for constructing the double cell recognition model is generally smaller than that of only one single cell gene data in order to be more compatible with the actual situation. For example, 9900 samples may be set to be constituted of single-cell gene data of human, out of every 10000 samples; and 100 samples are set to be formed by overlapping single-cell gene data of a human and single-cell gene data of a mouse.
At step S1020, the data set is randomly split into a training set and a test set according to a predetermined ratio. In general, the data set in the present application is manually labeled, so that each data can be distinguished according to the result of manual labeling whether it belongs to single cell data or double cell data. The training set is a data set for optimizing parameters of the model, and the model is trained on the basis of the training set, so that higher prediction accuracy is possible to be shown on the training set. The test set is another data set independent of the training set and is used for testing whether the model trained on the training set can obtain higher prediction accuracy on the test set, so that the phenomenon of overfitting is avoided.
In step S1030, the training set is equally divided into Q training subsets, and the super parameters of the Q heterogeneous base learners are respectively adjusted based on partial data in each training subset. In order to enable the finally trained double-cell recognition model to have good generalization capability and good stability and robustness in various practical application scenes, Q heterogeneous base learners are selected as bottom layer learners. Because of the different models of the different base learners, the prediction logic adopted is different, so that different prediction performances of the Q heterogeneous base learners are possible on different data sets. The prediction results of different base learners are comprehensively utilized, so that the stability and the robustness of the double-cell recognition model on different data sets are improved. During training, the training set can be divided into Q equal parts according to the types of the base learner, and one model is optimized by using one training subset, so that the difference of the models can be further improved, and the stability and the robustness of the double-cell recognition model on different data sets are further improved. The hyper-parameters described in the application can be the maximum depth (max_depth) in the decision tree model, the alpha value in the naive Bayes model, the k value in the KNN model, the algorithm used by the logistic regression model, the penalty parameter C of the linear kernel function in the support vector machine model and the like.
In step S1040, reinforcement training is performed on each base learner after the super parameters are adjusted based on the Boosting method. The application refers to Boosting method logic for performing reinforcement training on each base learner. Specifically, according to the present application, during the reinforcement training of each base learner, samples that were mispredicted in the previous training are given higher weights, so that the higher the proportion of such samples that occur in the next training. In this way, the final prediction accuracy of each base learner after the final reinforcement training can be improved.
Finally, in step S1050, based on the prediction accuracy of the base learner after the reinforcement training on the test set, the base learner after the reinforcement training is combined according to the corresponding weights to form the two-cell recognition model. After each basic learner is strengthened and trained by a Boosting method, the basic learners can be combined according to different weight proportions according to the prediction accuracy of the strengthened and trained basic learners on a test set to form a final double-cell recognition model.
The method for constructing the double-cell recognition model creatively uses the integrated learning method in the construction of the double-cell recognition model and integrates the recognition results of various learners, so that the double-cell recognition model with high accuracy, good stability and high robustness is provided.
According to the application, an M multiplied by N gene expression matrix obtained by mixing the two single-cell gene data is used as a data set of the double-cell recognition model, wherein M is the number of genes, and N is the number of samples. As described above, therefore, in order to be more compatible with the actual situation, the sample size of the data set in which two single-cell gene data are mixed for constructing the two-cell recognition model is generally smaller than that of only one single-cell gene data. Referring to fig. 2, the mxn gene expression matrix (i.e., the two-cell recognition model dataset) is derived from a superposition of the gene expression matrix of the first single cell and the gene expression matrix of the second single cell. In the three gene expression matrices shown in fig. 2, each row represents a different gene, and each column represents a different sample. Most of the samples are single-cell samples prepared based on a first single cell (for example, a certain tissue cell of a human body), and few of the samples are double-cell samples prepared by mixing the first single cell and the second single cell. For example, sample 1 is a two-cell sample formed by mixing two single cells. The numerical value (i.e., gene expression level) of the first column data in the two-cell recognition model data set is obtained by adding the first column data of the gene expression matrix of the first single cell and the first column data of the gene expression matrix of the second single cell. And the 2 nd sample, the 3 rd sample and the nth sample are single cell samples containing only one single cell of the first type, and thus, the values of the 2 nd, 3 rd and 20 th N th columns of the gene expression matrix of the second type single cell are 0, and the values of the 2 nd, 3 rd and nth columns in the double cell recognition model data set are the values of the corresponding columns of the gene expression matrix of the first type single cell.
According to one embodiment of the present application, the gene expression data of a murine cells are superimposed with the a-column gene expression data based on the 20000×20000 gene expression matrix of human cells, resulting in a dataset of a two-cell recognition model that is still 20000×20000 in size.
However, in general, the data size of the initially prepared data set may be too large for training. Such a data size may result in a model training speed that is too low. To train the two-cell recognition model more efficiently, the dataset of the two-cell recognition model may be compressed.
According to the application, L genes with the most obvious expression level difference can be selected from M genes, so that the dimension of the data set is reduced from M multiplied by N to L multiplied by N, and the first data set compression is completed.
According to the present application, a data set may first be processed using an applied Variance Stabilizing Transformation (VST).
Then, the average expression and variance of each gene in the dataset are calculated, log10 transformation is applied to the average expression and variance, and a curve is fitted through a local fitting method of a polynomial, so that a functional relation between the average expression and variance is established. The fitting method provides a variance estimator of the average expression level of a given gene, and can effectively reduce the influence of a few outliers on the overall variance of the data. The normalized expression level of a gene can be calculated using the following expression:
wherein Z is ij Represents the normalized expression level of gene i in sample j, X ij The original expression level of the gene i in the sample j is the original expression level of the gene i,representing the average expression level of gene i, σi represents the expected standard deviation of gene i from the average expression level-variance fit.
Finally, the variance of each gene is calculated based on the normalized expression level of each gene in all samples, and the variances are arranged in descending order, and the genes corresponding to the first L variances are selected, so that the scale of the data set is reduced to L multiplied by N. For example, l=2000, meaning that the size of the dataset is reduced from 20000 x 20000 to 2000 x 20000.
According to the application, the data size can be further reduced by performing principal component analysis on the expression amounts of the L genes in the N samples. For example, the 2000×20000 data may be subjected to principal component analysis, and the gene data expression of the 2000 genes on 20000 samples may be converted into data expression of 30 principal components on 20000 samples, thereby reducing the size of the data set from 2000×20000 to 30×20000. The above gene expression matrix of size 30×20000 will be used as the final dataset for constructing a two-cell recognition model.
According to the application, the final data set with the size of p×n can be randomly split into a training set and a test set, for example, the training set accounts for 70% of the data set, and the test set accounts for 30% of the data set.
According to the method, K-aliquoting can be carried out on each training subset, and the hyper-parameters with highest prediction accuracy on the verification set can be determined through K-fold cross verification and grid search. According to the application, the Q heterogeneous learners are a decision tree model, a KNN model, a support vector machine model, a logistic regression model and a naive Bayes model respectively. And respectively establishing K-fold cross-validation grid search on the corresponding training subset for each base learner, and selecting the optimal super-parameters corresponding to each base learner when the accuracy rate of each base learner is highest from the manually set possible value sets of the super-parameters by taking the accuracy rate as a measurement index. In this process, no adjustments are made to other parameters of the base learner. Wherein, accuracy = (true positive data+true negative data)/(true positive data+false positive data+true negative data+false negative data) ×100%. In the present application, true positive data, true negative data, false positive data, and false negative data are defined as follows, respectively: marking true positive data-labels as double cells, and marking predicted values as double cells; the true negative data-label is marked as unit cell, and the predicted value is the unit cell; the false positive data-label is marked as single cell, and the predicted value is double cells; the pseudo-negative data-tag is marked as double cells, and the predicted value is single cell.
According to one embodiment of the present application, the possible combinations of adjusted hyper-parameters are: the maximum depth (max_depth) in the decision tree model is 2, the alpha value in the naive bayes model is 0.01, the k value in the knn model is 9, the logistic regression model uses a coordinate descent algorithm (solver=liblier), the support vector machine model is set to a linear kernel function and the penalty parameter (C) =2.
According to the method, if the prediction accuracy of the Q-seed-based learner is higher than a preset threshold, the step of strengthening training is entered; otherwise, the previous step can be returned to replace the numerical range of the super parameter of the grid search, and K-fold cross validation and grid search are performed on the super parameter of the base learner again.
According to the method and the device, in order to further improve the performance of the base learner, boosting algorithm can be used for carrying out reinforcement training on each base learner so as to improve the prediction accuracy of the base learner.
According to the application, the Boosting method for performing the reinforcement training on each base learner after the preliminary training comprises the following steps: training one of the Q heterogeneous base learners based on partial training data of a set of the Q sets of training subsets; increasing the weight of training data that was mispredicted in a previous training or decreasing the weight of training samples that were correctly predicted in a previous training in the set of training subsets for a subsequent training based on the prediction situation of the base learner after the previous training; iteratively training the base learner I times; for each base learner, based on the prediction accuracy of the I base learners obtained after training I times, the I base learners are combined into the base learner after strengthening training.
When the base learner is subjected to reinforcement training for the first time, the weight w of each training data in the training subset can be set to be the same value, and part of data in the training subset is randomly selected as the first reinforcement training subset.
And performing first reinforcement training on the base learner based on the first reinforcement training subset to obtain a first-stage base learner. According to training results of the first stage base learner on the first intensive training subset, calculating a classification error rate of the first stage base learner according to the following formula:
where m is the number of current iterative training (i.e., number of levels), e m Is the classification error rate of the m-th level base learner, i represents the i-th sample in the training subset, w mi Is the weight of the ith sample in the mth training, G m (x i ) Is the predictive value of the m-th level base learner, y i Is the true value of the data. I is an indication function, and takes a value of 1 or 0. When G m (x i ) ≠ y i If true, i.e. when the classification of the current basic classifier is wrong, the I function result is 1; when G m (x i ) ≠ y i And if false, namely the current basic classifier is correctly classified, the I function result is 0.
Determining the weight coefficient alpha of the first-stage base learner in the base learner after the final reinforcement training is finished according to the classification error rate em according to the following formula m :
From the above formula, the weight coefficient alpha m Error rate with classification e m The weak learners with larger classification error rates are weighted less in the learners formed by the final iteration.
And updating weight distribution of training data in the reinforced training subset used in the training of the next stage of the basic learner according to the following formula by using the weight coefficient of the first stage of the basic learner:
wherein z is m The purpose is to normalize the factor to make the new sample weight value a probability distribution that sums to 1. Specifically, z m The calculation formula of (2) is as follows:
when the classification is wrong, for the characteristic dataPredicted value G of i m (x i ) And true value y i Inconsistencies, -alpha m y i G m (x i ) Greater than 0, exp (-alpha) m y i G m (x i ) Greater than 1), the data weight becomes greater in the m+1st stage training; when the classification is correct, the predicted value G for the feature i m (x i )y i Concordance, -alpha m y i G m (x i ) Less than 0, exp (-alpha) m y i G m (x i ) Less than 1), the data weight becomes smaller in the m+1 training. Through the processing, the training samples which are misclassified can be realized, and the weight becomes larger in the next iteration, so that more importance is obtained.
And then, performing second reinforcement training by using the second reinforcement training subset with updated training data weights to obtain a second-stage base learner. And similarly, repeating the reinforcement training for I times to obtain the I-level base learner.
And linearly combining the I-level base learners into a learner after reinforcement training according to the weight coefficient alpha m of each-level base learner in the final reinforcement learner.
According to one embodiment of the present application, the expression of the base learner G (x) after I times of reinforcement training is:
finally, according to the application, based on the prediction accuracy of the base learner after the reinforcement training on the test set, the outputs of the various base learners after the reinforcement training are linearly overlapped according to the prediction accuracy as weights in a soft voting mode, so that the double-cell recognition model is formed.
The application also provides a system for constructing the double-cell recognition model, which can be realized in the forms of a mobile terminal, a Personal Computer (PC), a tablet personal computer, a server and the like. Referring now to fig. 3, a schematic diagram of a system suitable for use in implementing embodiments of the present application is shown.
As shown in fig. 3, the computer system includes one or more processors, communication sections, etc., such as: one or more Central Processing Units (CPUs) 301, and 20/or one or more image processors (GPUs) 313, etc., the processors may perform various suitable actions and processes according to executable instructions stored in a Read Only Memory (ROM) 302 or loaded from a storage 308 into a Random Access Memory (RAM) 303. The communications portion 312 may include, but is not limited to, a network card, which may include, but is not limited to, a IB (Infiniband) network card.
The processor may communicate with the ROM 302 and/or the RAM 303 to execute the executable instructions, and is connected to the communication section 312 through the bus 304, and communicates with other target devices through the communication section 312, so as to perform operations corresponding to any of the methods set forth in the embodiments of the present application, for example: preparing a gene expression matrix obtained by mixing two single-cell gene data as a data set of a double-cell recognition model; randomly splitting the data set into a training set and a test 13 test set according to a preset proportion; equally dividing the training set into Q groups of training subsets, and respectively adjusting the super parameters of Q heterogeneous base learners based on partial data in each group of training subsets; respectively performing reinforcement training on each base learner after the super parameters are adjusted based on a Boosting method; and combining the base learner after the strengthening training according to corresponding weights based on the prediction accuracy of the base learner after the strengthening training on the test set so as to form the double-cell recognition model.
In addition, in the RAM 303, various programs and data required for device operation can also be stored. The CPU301, ROM 302, and RAM 303 are connected to each other through a bus 304. In the case of RAM 303, ROM 302 is an optional module. The RAM 303 stores executable instructions that cause the CPU301 to execute operations corresponding to the above-described communication methods, or writes executable instructions to the ROM 302 at the time of execution. An input/output interface (I/O interface) 305 is also connected to the bus 304. The communication unit 312 may be integrally provided or may be provided with a plurality of sub-modules (e.g., a plurality of IB network cards) and be connected to a bus link.
The following components are connected to the I/O interface 305: an input section 306 including a keyboard, a mouse, and the like; an output portion 307 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage section 308 including a hard disk or the like; and a communication section 309 including a network interface card such as a LAN card, a modem, or the like. The communication section 309 performs communication processing via a network such as the internet. The drive 310 is also connected to the I/O interface 305 as needed. Removable media 311, such as magnetic disks, optical disks, magneto-optical disks, semiconductor memory, and the like, is mounted on drive 310 as needed.
It should be noted that the architecture shown in fig. 3 is only an alternative implementation, and in a specific practical process, the number and types of components in fig. 3 may be selected, deleted, added or replaced according to actual needs; in the setting of different functional components, implementation manners such as separate setting or integrated setting may be adopted, for example, the GPU and the CPU may be separately set or the GPU may be integrated on the CPU, the communication portion 312 may be separately set, may be integrally set on the CPU or the GPU, and the like. Such alternative embodiments fall within the scope of the present disclosure.
In particular, the process described with reference to flowchart 1 may be implemented as a computer program product according to the present application. For example, the present application proposes a computer program product comprising computer readable instructions which, when executed by a processor, implement the operations of: preparing a gene expression matrix obtained by mixing two single-cell gene data as a data set of a double-cell recognition model; randomly splitting the data set into a training set and a testing set according to a preset proportion; equally dividing the training set into Q groups of training subsets, and respectively adjusting the super parameters of Q heterogeneous base learners based on partial data in each group of training subsets; respectively performing reinforcement training on each base learner after the super parameters are adjusted based on a Boosting method; and combining the base learner after the strengthening training according to corresponding weights based on the prediction accuracy of the base learner after the strengthening training on the test set so as to form the double-cell recognition model.
In such embodiments, the computer program product may be downloaded and installed from a network via the communication portion 309 and/or read and installed from the removable medium 311. The above-described functions defined in the method of the present application are performed when the computer program product is executed by a Central Processing Unit (CPU) 301.
The technical solutions of the present application may be implemented in many ways. For example, the techniques of this application may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The order of steps used to describe the method is provided only for the purpose of more clearly describing the technical solution. The method steps of the present application are not limited to the order specifically described above unless specifically limited. Furthermore, in some embodiments, the present application may also be implemented as a storage medium storing a computer program product.
The above description is merely illustrative of the implementations of the application and of the principles of the technology applied. It should be understood by those skilled in the art that the scope of protection referred to in this application is not limited to the specific combination of the above technical features, but also encompasses other technical solutions formed by any combination of the above technical features or their equivalents without departing from the technical concept. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.
Claims (12)
1. The method for constructing the double-cell recognition model is characterized by comprising the following steps of:
preparing a gene expression matrix obtained by mixing two single-cell gene data as a data set of a double-cell recognition model;
randomly splitting the data set into a training set and a testing set according to a preset proportion;
equally dividing the training set into Q groups of training subsets, and respectively adjusting the super parameters of Q heterogeneous base learners based on partial data in each group of training subsets;
respectively performing reinforcement training on each base learner after the super parameters are adjusted based on a Boosting method;
and combining the base learner after the strengthening training according to corresponding weights based on the prediction accuracy of the base learner after the strengthening training on the test set so as to form the double-cell recognition model.
2. The method according to claim 1, wherein preparing a gene expression matrix obtained by mixing two single-cell gene data as a data set of a double-cell recognition model comprises:
and mixing the two single-cell gene data to obtain an M multiplied by N gene expression matrix which is used as a data set of the double-cell recognition model, wherein M is the number of genes and N is the number of samples.
3. The method according to claim 2, wherein preparing a gene expression matrix obtained by mixing two single-cell gene data as a data set of a double-cell recognition model further comprises:
selecting L genes with the most obvious difference in expression amount from M genes, thereby reducing the dimension of the data set from M multiplied by N to L multiplied by N;
the dimension of the dataset is reduced from L N to P N by principal component analysis of the expression levels of the L genes in the N samples, where P < L.
4. The method of claim 1, wherein adjusting the super parameters of the Q heterobased learners based on the partial data in each training subset comprises:
and carrying out K equal division on each training subset, and determining the hyper-parameters with highest prediction accuracy on the verification set through K-fold cross verification and grid search.
5. The construction method according to claim 1, wherein the performing the reinforcement training on each base learner after the super parameter adjustment based on the Boosting method includes:
training one of the Q heterogeneous base learners based on a portion of training samples of a set of the Q sets of training subsets;
increasing weights of training samples that were mispredicted in a previous training or decreasing weights of training samples that were correctly predicted in a previous training in the set of training subsets for a subsequent training based on the prediction situation of the base learner after the previous training;
iteratively training the base learner I times;
for each base learner, based on the prediction error rate of the I base learners obtained after training I times, combining the I base learners into the base learner after strengthening training.
6. The construction method according to claim 5, wherein the combination of the I basis learners into the training-enhanced such basis learner is performed according to the following formula based on the prediction accuracy of the I basis learners obtained after training I times:
wherein G is m (x) Base learner, alpha, for mth training m Assigning coefficients, alpha, to the base learner obtained for the mth training m The expression of (2) is:
wherein e m The error rate of the base learner obtained for the mth training is predicted.
7. The construction method according to claim 6, wherein,
wherein w is mi Weights for the ith sample in the training subset for the mth training; g m (x i ) The predicted value of the base learner is obtained for the mth training; y is i A data true value; i is an indication function, and is 0 when the prediction is correct and 1 when the prediction is incorrect; n is the number of samples in the set of training subsets.
8. The construction method according to claim 6, wherein increasing the weight of a training sample that was predicted to be incorrect in a previous training or decreasing the weight of a training sample that was predicted to be correct in a previous training is performed according to the following formula:
wherein w is m+1 I is the weight of the ith sample in the training subset for the (m+1) th training, z m Is a normalization factor.
9. The construction method according to claim 1, wherein combining the base learner after the reinforcement training according to the corresponding weights comprises:
and linearly superposing the output of the base learner after the reinforcement training according to the prediction accuracy as a weight by a soft voting mode.
10. The method of claim 1, wherein the Q heterogeneous learners are a decision tree model, a KNN model, a support vector machine model, a logistic regression model, and a naive bayes model, respectively.
11. A system for constructing a two-cell recognition model, comprising:
a memory storing executable instructions; and
one or more processors in communication with the memory to execute the executable instructions to:
preparing a gene expression matrix obtained by mixing two single-cell gene data as a data set of a double-cell recognition model;
randomly splitting the data set into a training set and a testing set according to a preset proportion;
equally dividing the training set into Q groups of training subsets, and respectively adjusting the super parameters of Q heterogeneous base learners based on partial data in each group of training subsets;
respectively performing reinforcement training on each base learner after the super parameters are adjusted based on a Boosting method;
and combining the base learner after the strengthening training according to corresponding weights based on the prediction accuracy of the base learner after the strengthening training on the test set so as to form the double-cell recognition model.
12. A computer-readable storage medium for training a two-cell recognition model, the computer-readable storage medium storing executable instructions executable by one or more processors to perform operations comprising:
preparing a gene expression matrix obtained by mixing two single-cell gene data as a data set of a double-cell recognition model;
randomly splitting the data set into a training set and a testing set according to a preset proportion;
equally dividing the training set into Q groups of training subsets, and respectively adjusting the super parameters of Q heterogeneous base learners based on partial data in each group of training subsets;
respectively performing reinforcement training on each base learner after the super parameters are adjusted based on a Boosting method;
and combining the base learner after the strengthening training according to corresponding weights based on the prediction accuracy of the base learner after the strengthening training on the test set so as to form the double-cell recognition model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310665802.3A CN116401555A (en) | 2023-06-07 | 2023-06-07 | Method, system and storage medium for constructing double-cell recognition model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310665802.3A CN116401555A (en) | 2023-06-07 | 2023-06-07 | Method, system and storage medium for constructing double-cell recognition model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116401555A true CN116401555A (en) | 2023-07-07 |
Family
ID=87010829
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310665802.3A Pending CN116401555A (en) | 2023-06-07 | 2023-06-07 | Method, system and storage medium for constructing double-cell recognition model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116401555A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116628601A (en) * | 2023-07-25 | 2023-08-22 | 中山大学中山眼科中心 | Analysis method for classifying non-human primate neurons by adopting multi-modal information |
CN116805157A (en) * | 2023-08-25 | 2023-09-26 | 中国人民解放军国防科技大学 | Unmanned cluster autonomous dynamic evaluation method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114600172A (en) * | 2019-08-30 | 2022-06-07 | 朱诺治疗学股份有限公司 | Machine learning method for classifying cells |
CN114882954A (en) * | 2022-05-24 | 2022-08-09 | 南京邮电大学 | Integrated learning-based automatic cell type classification method |
CN115359264A (en) * | 2022-08-11 | 2022-11-18 | 西安理工大学 | Intensive distribution adhesion cell deep learning identification method |
CN115394358A (en) * | 2022-08-31 | 2022-11-25 | 西安理工大学 | Single cell sequencing gene expression data interpolation method and system based on deep learning |
-
2023
- 2023-06-07 CN CN202310665802.3A patent/CN116401555A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114600172A (en) * | 2019-08-30 | 2022-06-07 | 朱诺治疗学股份有限公司 | Machine learning method for classifying cells |
CN114882954A (en) * | 2022-05-24 | 2022-08-09 | 南京邮电大学 | Integrated learning-based automatic cell type classification method |
CN115359264A (en) * | 2022-08-11 | 2022-11-18 | 西安理工大学 | Intensive distribution adhesion cell deep learning identification method |
CN115394358A (en) * | 2022-08-31 | 2022-11-25 | 西安理工大学 | Single cell sequencing gene expression data interpolation method and system based on deep learning |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116628601A (en) * | 2023-07-25 | 2023-08-22 | 中山大学中山眼科中心 | Analysis method for classifying non-human primate neurons by adopting multi-modal information |
CN116628601B (en) * | 2023-07-25 | 2023-11-10 | 中山大学中山眼科中心 | Analysis method for classifying non-human primate neurons by adopting multi-modal information |
CN116805157A (en) * | 2023-08-25 | 2023-09-26 | 中国人民解放军国防科技大学 | Unmanned cluster autonomous dynamic evaluation method and device |
CN116805157B (en) * | 2023-08-25 | 2023-11-17 | 中国人民解放军国防科技大学 | Unmanned cluster autonomous dynamic evaluation method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116401555A (en) | Method, system and storage medium for constructing double-cell recognition model | |
CN102859528A (en) | Systems and methods for identifying drug targets using biological networks | |
CN115394358B (en) | Single-cell sequencing gene expression data interpolation method and system based on deep learning | |
CN114091603A (en) | Spatial transcriptome cell clustering and analyzing method | |
CN111785326B (en) | Gene expression profile prediction method after drug action based on generation of antagonism network | |
CN112992267A (en) | Single-cell transcription factor regulation network prediction method and device | |
CN113724195B (en) | Quantitative analysis model and establishment method of protein based on immunofluorescence image | |
CN114783526A (en) | Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder | |
US20070078606A1 (en) | Methods, software arrangements, storage media, and systems for providing a shrinkage-based similarity metric | |
Niemi et al. | Empirical Bayes analysis of RNA-seq data for detection of gene expression heterosis | |
CN115881209B (en) | RNA secondary structure prediction processing method and device | |
Yoo et al. | Discovery of gene-regulation pathways using local causal search. | |
CN116153396A (en) | Non-coding variation prediction method based on transfer learning | |
Sedki et al. | Efficient learning in ABC algorithms | |
Hoffmann et al. | Minimizing the expected posterior entropy yields optimal summary statistics | |
CN116525006A (en) | Single cell classification method, device, equipment and storage medium | |
Landau et al. | Fully Bayesian analysis of RNA-seq counts for the detection of gene expression heterosis | |
CN115620808A (en) | Cancer gene prognosis screening method and system based on improved Cox model | |
Sanchez | Reconstructing our past˸ deep learning for population genetics | |
Chowdhury et al. | UICPC: centrality-based clustering for scRNA-seq data analysis without user input | |
US20130325786A1 (en) | Sparse higher-order markov random field | |
Zhai et al. | Two‐sample test with g‐modeling and its applications | |
Ricci et al. | Magic Moments for Structured Output Prediction. | |
CN112163068B (en) | Information prediction method and system based on autonomous evolution learner | |
Liu | Extracting Rules from Trained Machine Learning Models with Applications in Bioinformatics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |