CN113889192A - Single cell RNA-seq data clustering method based on deep noise reduction self-encoder - Google Patents
Single cell RNA-seq data clustering method based on deep noise reduction self-encoder Download PDFInfo
- Publication number
- CN113889192A CN113889192A CN202111152923.5A CN202111152923A CN113889192A CN 113889192 A CN113889192 A CN 113889192A CN 202111152923 A CN202111152923 A CN 202111152923A CN 113889192 A CN113889192 A CN 113889192A
- Authority
- CN
- China
- Prior art keywords
- data
- cell rna
- encoder
- seq data
- clustering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003559 RNA-seq method Methods 0.000 title claims abstract description 87
- 238000000034 method Methods 0.000 title claims abstract description 77
- 239000000203 mixture Substances 0.000 claims abstract description 23
- 238000012880 independent component analysis Methods 0.000 claims abstract description 19
- 230000000694 effects Effects 0.000 claims abstract description 18
- 206010028980 Neoplasm Diseases 0.000 claims abstract description 11
- 201000011510 cancer Diseases 0.000 claims abstract description 10
- 230000014509 gene expression Effects 0.000 claims abstract description 9
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 7
- 210000004027 cell Anatomy 0.000 claims description 58
- 230000006870 function Effects 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 11
- 238000007781 pre-processing Methods 0.000 claims description 7
- 238000013079 data visualisation Methods 0.000 claims description 6
- 238000012163 sequencing technique Methods 0.000 claims description 6
- 238000005259 measurement Methods 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 4
- 230000002087 whitening effect Effects 0.000 claims description 4
- 238000005457 optimization Methods 0.000 claims description 3
- 230000035945 sensitivity Effects 0.000 claims description 3
- 210000003701 histiocyte Anatomy 0.000 claims description 2
- 238000005070 sampling Methods 0.000 claims description 2
- 238000012216 screening Methods 0.000 claims description 2
- 230000002411 adverse Effects 0.000 abstract 1
- 238000004364 calculation method Methods 0.000 abstract 1
- 238000010276 construction Methods 0.000 abstract 1
- 238000005065 mining Methods 0.000 abstract 1
- 238000000926 separation method Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000003321 amplification Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000031018 biological processes and functions Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 238000010195 expression analysis Methods 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2134—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on separation criteria, e.g. independent component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Biotechnology (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Epidemiology (AREA)
- Software Systems (AREA)
- Public Health (AREA)
- Probability & Statistics with Applications (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a single-cell RNA-seq data clustering method based on a deep noise reduction self-encoder, which comprises the steps of firstly adjusting the batch effect of single-cell RNA-seq data and standardizing the data so as to reduce the adverse effect caused by technical noise; secondly, effectively mining the characteristic information of the single-cell RNA-seq data by using a deep noise reduction self-encoder based on zero-expansion negative binomial distribution; then, a rapid independent component analysis method is used for reducing the dimension of the single cell RNA-seq data, so that the calculation efficiency of the method model is improved; and finally, expanding more accurate clustering on the cells through a Gaussian mixture model based on expectation maximization, and visualizing the final single-cell RNA-seq data clustering result by using a T distribution random neighbor embedding method. The invention can effectively reduce the interference of the characteristics of high dimensionality, large noise and the like of the single-cell RNA-seq data on data clustering, accurately learn the gene expression information of the single-cell RNA-seq data so as to cluster cells, and provide help for gene network construction, cell type discovery and early cancer discovery and treatment.
Description
Technical Field
The invention belongs to the technical field of single cell RNA-seq data analysis in bioinformatics, and particularly relates to a single cell RNA-seq (Ribonnucleic acid-sequence) data clustering method based on a deep noise reduction self-encoder.
Background
With the rapid development of sequencing technologies, researchers have acquired a large amount of single-cell RNA-seq data. Unsupervised clustering plays an important role in analyzing single cell RNA-seq data, and the clustering method aiming at the single cell RNA-seq data can not only identify unknown cell types, but also reveal the heterogeneity of cells. Through the research on the clustering method of the single cell RNA-seq data, researchers can more accurately identify the cell state, build a network structure between cells, deeply understand the differentiation process of cancer cells and the like, and lay a foundation for the early discovery and treatment of the future cancer. At present, traditional clustering methods such as hierarchical clustering, spectral clustering and density-based clustering methods with noise are widely used, but single-cell RNA-seq data have unique characteristics, so that the traditional clustering methods cannot effectively cluster the data.
Disclosure of Invention
In order to overcome the technical problems, the invention provides a single-cell RNA-seq data clustering method based on a deep noise reduction self-encoder, which combines methods such as a self-encoder and rapid independent component analysis to realize the purposes of feature learning, dimension reduction and the like in the single-cell RNA-seq data clustering process, finally uses Gaussian mixture clustering to cluster the single-cell RNA-seq data, and reduces the influence of data noise on the clustering effect by introducing zero-expansion negative binomial distribution reconstruction data.
In order to achieve the purpose, the invention adopts the technical scheme that:
a single-cell RNA-seq data clustering method based on a deep noise reduction self-encoder comprises the following steps;
1) adjusting batch effect and data standardization preprocessing:
5 public real single-cell RNA-seq data sets downloaded from Arrayexpress and GEO databases are selected to cluster single cells, cell subtypes are further discovered, assistance is provided for early discovery and targeted treatment of related cancers, gene expression values in the 5 public data sets are obtained from various tissue cells, including GSE60361, GSE65525, GSE72056, GSE76312 and GSE103322, original single-cell RNA-seq data are read and subjected to batch effect adjustment and standardized preprocessing, and systematic technical deviation which is irrelevant to biological states and is introduced due to sample data in different batch processing and measurement is avoided;
2) data reconstruction and noise reduction:
because a large number of zero values exist in the single-cell RNA-seq data, the zero values can not only indicate that partial genes of some cells are not expressed actually, but also can be the result caused by technical errors, the noise can greatly interfere the discovery of cell subtypes, the single-cell RNA-seq data after logarithmic normalization processing is input into a deep noise reduction self-encoder, the deep noise reduction self-encoder reconstructs the data by using zero-expansion negative binomial distribution, and the reconstructed data can better store the original characteristics of organisms;
3) and (3) data dimension reduction:
the single-cell RNA-seq data reconstructed by the deep noise reduction self-encoder is still high-dimensional, the high-dimensional single-cell RNA-seq data brings great difficulty to the identification of cell subtypes, the dimension of sample data is reduced by using a rapid independent component analysis method, redundant parts in the data are eliminated, and the early discovery and related treatment of cancer are further prevented from being interfered by the redundant parts in the data;
4) gaussian mixture clustering and data visualization:
after the low-dimensional and low-noise single-cell RNA-seq data is obtained, a Gaussian mixture model is used for clustering cells and determining cell types, the obtained cell types are potential cell subtypes found, a final clustering result is visualized by adopting a T distribution random neighbor embedding method, and the clustering result is analyzed by combining the existing cells and a cancer database, so that a doctor is helped to develop early cancer discovery.
The step of adjusting batch effect and standardizing pretreatment on the single cell RNA-seq data in the step 1) comprises the following steps: firstly, a hierarchical Bayesian model is used for adjusting the batch effect of single-cell RNA-seq data and solving the problem of uncertainty caused by measurement sensitivity; then screening out cells with normal gene expression quantity; then, the data was normalized for sequencing depth and gene length using a logarithmic normalization method.
The deep noise reduction self-encoder used in the step 2) reconstructs single-cell RNA-seq data through zero-expansion negative binomial distribution, and the whole self-encoder has three outputs and respectively learns the zero-expansion factor, the mean value and the variance of the zero-expansion negative binomial distribution;
the RNA-seq data of the single cell to be analyzed is represented by X, and the coding stage in the self-encoder is represented by h (X) ═ sigmah(WX + b), wherein W represents a weight matrix in the encoding process, b represents a bias term, a decoding stage of the self-encoder corresponds to the encoding stage, the encoded data is reconstructed, the input dimension of the self-encoder is consistent with the dimension of single-cell RNA-seq data used for training, the encoder and the decoder are respectively provided with a five-layer network, a Zero-expansion factor is added on the basis of a Negative binomial distribution (NB) model, and the situation that an impulse function, namely Zero-expanded Negative binomial distribution (Zero-expanded Negative Bi) is added at a Zero point can be understood as the situation that a Zero point is added with the impulse functionnomial) to model single-cell RNA-seq data, formulated as ZINB (X | pi, μ, θ) ═ pi δ0(X) + (1-. pi.) N BETA (X. mu., θ) if Y. sigma., (X-pi.) Oo(W 'h (X) + b') represents the last hidden layer of the decoder, after which three independent fully-connected layers are added, that is, the whole self-encoder has three outputs, respectively learning the zero-expansion factor, mean and variance of the zero-expansion negative binomial distribution, and the loss function of the noise-reduction part of the noise-reduction self-encoder is represented as Ld=-log(ZINB(X|π,μ,θ))。
In the step 3), the dimensionality of the single-cell RNA-seq data is reduced by using a rapid independent component analysis method, the independent component analysis assumes that all parts of all data are independent from each other and considers that all components are equally important, and original data are decomposed into linear combinations of non-Gaussian data components with mutually independent statistical meanings;
assuming that reconstructed single cell RNA-seq data obeys a model X ═ AS, wherein S is unknown source data with independent components, A is an unknown mixed matrix, each independent component in S and each mixed coefficient in A are unknown, an independent component analysis method predicts the mixed coefficient and the independent component only through each observed signal data in X, the method firstly performs centralization and whitening pretreatment on original data, and after the pretreatment, sample data is processed by adopting a rapid independent component analysis method, firstly, a vector W is initialized, and W ═ A is defined-1And W is the row vector in W. Secondly, let w+=E{Xg(wTX)}-E{g′(wTX) } w, where g in the above formula is a non-linear scalar function, and let w ═ w+/||w+If the process is not converged, the step is continuously repeated, and finally, a rapid independent component analysis method is used for estimating a plurality of independent components containing important information, so that the purpose of reducing the dimensionality of the single cell RNA-seq data is achieved.
In the step 4), a Gaussian mixture model is used for clustering the cells and determining the cell types, and the method specifically comprises the following steps:
firstly, initializing model parameters of Gaussian mixture distribution, and then iteratively optimizing the model based on an expectation-maximization algorithmA parameter of type; the E iteration step in the expectation maximization algorithm: calculating posterior probability gamma of ith sample data based on ith Gaussian mixture componentji:(ii) a The M iteration steps in the expectation maximization algorithm: other parameters mu of the iterative optimization modeli,∑iAnd alphaiCalculated based on the following formula: stopping iteration when the maximum iteration times is reached in the experimental process, continuously iterating and updating parameters if the conditions are not met, and finally, sampling data xjCluster label lambda ofjUsing λj=argmaxγjiAnd calculating, and visualizing the final clustering result by using a T-distribution random neighbor embedding method to display the clustering result on a two-dimensional coordinate.
In the process of initializing by using a Gaussian mixture model, the problem of centroid initialization is solved by adopting k-means + +, and the method is that a point is randomly selected from an input data point set to serve as a first clustering center; for each object in the data set, calculating the similarity of the object to the nearest clustering center; selecting a new data point as a new cluster center according to the following selection principles: the point with larger similarity is selected as the clustering center with larger probability; the above steps are repeated until k cluster centers are selected, using the k initial cluster centers to run the standard k-means algorithm.
The invention has the beneficial effects that:
the invention combines a self-encoder and a rapid independent component analysis method to learn the expression of the single-cell RNA-seq data, reduces the dimension of the data, uses zero-expansion negative binomial distribution to reconstruct the data so as to reduce the influence of data noise on a clustering result, uses Gaussian mixed clustering to cluster the low-dimension single-cell RNA-seq data, finally uses a T distribution random neighbor embedding method to visualize the clustering result, can identify the cell subtype, and can help to develop early discovery and related diagnosis and treatment of cancer.
Drawings
FIG. 1 is a general flow diagram of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples.
As shown in fig. 1, four major steps of improving the single cell RNA-seq data clustering effect based on the deep noise reduction self-encoder of the present invention are shown, including batch effect adjustment and data standardization preprocessing, data reconstruction and noise reduction, data dimension reduction, gaussian mixture clustering, and data visualization.
The invention provides a single-cell RNA-seq data clustering method based on a deep noise reduction self-encoder, which comprises the following steps:
step one, adjusting batch effect and data standardization preprocessing. The invention selects 5 public data sets downloaded from Arrayexpress and GEO databases to verify the effectiveness of the invention, and gene expression values in the 5 public data sets are obtained from various histiocytes, including GSE60361, GSE65525, GSE72056, GSE76312 and GSE 103322. The data are used as initial input raw data, the invention uses a hierarchical Bayesian model to adjust the batch effect of the single-cell RNA-seq data, and simultaneously solves the problem of uncertainty caused by measurement sensitivity. The method uses a single-cell gene expression analysis package SCANPY based on Python to effectively screen and filter the original single-cell RNA-seq data, removes the data with poor sequencing quality, and then carries out standardized processing on the data so as to facilitate the subsequent network learning.
And step two, reconstructing data and reducing noise. The analysis of single cell RNA-seq data is interfered due to problems of data amplification, data loss and the like. The present invention uses a denoise self-encoder technique to map the input single-cell RNA-seq data to an embedding space. In the experimental process, random gaussian noise was added to the data used and a full-link layer was used to construct the self-encoder. To capture single cells betterThe invention relates to RNA-seq data.A three independent full-connection layers are added behind the last hidden layer of a decoder, and three outputs respectively learn a pulse function regulating factor of zero-expansion negative binomial distribution, the mean value of the negative binomial distribution and the sparsity. The loss function of the noise-reduced self-encoder's noise-reduced portion is further defined as the negative logarithm of the zero-expansion negative binomial distribution formula. The RNA-seq data of the single cell to be analyzed is represented by X, and the coding stage in the self-encoder is represented by h (X) ═ σh(WX + b), W represents a weight matrix in the encoding process, b represents an offset item, and the decoding stage of the self-encoder corresponds to the encoding stage to reconstruct the encoded data. The input dimension of the self-encoder is consistent with the dimension of single-cell RNA-seq data used for training, and the encoder and the decoder respectively have five layers of networks. Recent research progress on single-cell RNA-seq data shows that the single-cell RNA-seq data is closest to Negative Binomial distribution (NB) and is formulated asBecause the dispersion of single-cell RNA-seq data is usually highly distorted, the variance tends to be larger than the mean and is therefore not suitable for approximation with a poisson distribution, whereas the variance of single-cell RNA-seq data typically changes as the mean changes. In addition to the above, single cell RNA-seq data also have the feature of a particularly high number of zero values. Since the Zero values in the gene expression data may come from genes that are not expressed in the biological process (True Zero) and also from losses due to technical reasons in the sequencing process (Dropout Zero). In order to better capture single-cell RNA-seq data, the invention improves the traditional noise reduction self-encoder, adds a Zero-expansion factor on the basis of a Negative Binomial distribution (NB) model, and can also be understood as adding a pulse function at a Zero point, namely modeling the single-cell RNA-seq data by using Zero-expanded Negative Binomial distribution (Zero-expanded Negative Binomial). Formulated as ZINB (X | pi, μ, θ) ═ pi δ0(X) + (1-. pi.) N BETA (X. mu., θ) if Y. sigma., (X-pi.) Oo(W 'h (X) + b') denotes the last hidden layer of the decoder, after which the invention adds three independent fully-connected layers, that is to sayThe whole self-encoder has three outputs, and the zero expansion factor, the mean value and the variance of the zero expansion negative binomial distribution are respectively learned. The invention expresses the loss function of the noise reduction part of the noise reduction self-encoder as Ld=-log(ZINB(X|π,μ,θ))。
And step three, reducing the dimension of the data. In the process of data dimension reduction, firstly, centralizing and whitening pretreatment are carried out on high-dimensional single cell RNA-seq data, a separation matrix is calculated on the basis of the pretreated data, and the separation matrix is initialized; and then continuously optimizing the separation matrix, constantly judging whether the separation matrix is converged, if so, solving final low-dimensional single cell RNA-seq data, and if not, continuously optimizing the separation matrix.
Assuming that the reconstructed single-cell RNA-seq data obeys the model X ═ AS, where S is unknown source data with independent components, a is an unknown mixing matrix, and each independent component in S and each mixing coefficient in a are unknown. The independent component analysis method estimates the mixing coefficients and independent components only from each observed signal data in X. The method firstly carries out centering and whitening pretreatment on original data, and after the pretreatment, the invention adopts a rapid independent component analysis method to process the sample number, firstly, a vector W is initialized, and W is defined as A-1And W is the row vector in W. Secondly, let w+=E{Xg(wTX)}-E{g′(wTX) } w, where g in the above formula is a non-linear scalar function, and let w ═ w+/||w+If the above process does not converge, this step is repeated continuously. And finally, estimating several independent components containing important information by using a rapid independent component analysis method, and achieving the purpose of reducing the single cell RNA-seq data dimension.
And step four, Gaussian mixed clustering and data visualization. In the process of initializing by using a Gaussian mixture model, k-means + + is adopted to solve the problem of centroid initialization. Firstly, initializing model parameters of Gaussian mixture distribution, and then iteratively optimizing the parameters of the model repeatedly based on an expectation-maximization algorithm; the E iteration step in the expectation maximization algorithm: based on the ith Gaussian mixture component meterCalculating the posterior probability gamma of the ith sample dataji:(ii) a The M iteration steps in the expectation maximization algorithm: other parameters mu of the iterative optimization modeli,∑iAnd alphaiCalculated based on the following formula: and stopping iteration when the maximum iteration times are reached in the experimental process, and continuously iterating and updating the parameters if the conditions are not met. Finally, sample data xjCluster label lambda ofjUsing λj=argmaxγjiAnd (4) calculating. And finally, visualizing the clustering result by using a T-distribution random neighbor embedding method.
In the process of initializing by using a Gaussian mixture model, the problem of centroid initialization is solved by adopting k-means + +, and the method is that a point is randomly selected from an input data point set to serve as a first clustering center; for each object in the data set, calculating the similarity of the object to the nearest clustering center; selecting a new data point as a new cluster center according to the following selection principles: the point with larger similarity is selected as the clustering center with larger probability; the above steps are repeated until k cluster centers are selected, using the k initial cluster centers to run the standard k-means algorithm.
For the Gaussian mixture model, the invention uses a Gaussian mixture function in the scimit-leann mixture module, with the parameters held at default values. The Clustering performance evaluation indexes used by the invention mainly comprise Normalized Mutual Information, Clustering Accuracy and Adjusted Rand Index, and the higher the numerical values of the three evaluation indexes are, the better the Clustering performance of the method is.
The invention has the following characteristics:
1. the influence of the single cell RNA-seq data batch effect on the final clustering effect is reduced;
2. the influence of high dimensionality and large noise of single cell RNA-seq data on a clustering result is reduced;
3. in the clustering process of the single cell RNA-seq data, the data representation can be effectively learned, and the method has strong data representation capability;
4. and after the clustering is finished, the data visualization capacity is good.
The self-encoder model is an unsupervised deep learning method, the method can not only effectively reduce the dimension of input data, but also learn the implicit characteristics of the analyzed data in a mode of adjusting the number of layers of a neural network, optimizing a network training process and the like, and recover the data through a data reconstruction process. The noise reduction self-encoder allows the damaged data with noise information to become input data of a network, so that the reconstructed data obtains certain robustness to noise in the input data.
The single-cell RNA-seq data has high dimension and also contains large noise, the noise is usually expressed as that the single-cell RNA-seq data is sparse, and a large amount of zero values are derived from certain genes which are not expressed actually on one hand and are derived from the fact that the expressed gene values are not detected due to the defects of sequencing technology and the like on the other hand.
In the process of carrying out deep clustering on single cell RNA-seq data, the invention designs a single cell RNA-seq data clustering method based on a deep noise reduction self-encoder, which combines the self-encoder, rapid independent component analysis, Gaussian mixture clustering, T distribution random neighbor embedding and other methods to solve the problems of feature learning, dimension reduction, clustering, data visualization and the like in the single cell RNA-seq data clustering process, and reduces the influence of data noise on the clustering effect by introducing zero-expansion negative binomial distribution reconstruction data.
Claims (6)
1. A single cell RNA-seq data clustering method based on a deep noise reduction self-encoder is characterized by comprising the following steps;
1) adjusting batch effect and data standardization preprocessing:
selecting 5 public real single-cell RNA-seq data sets downloaded from Arrayexpress and GEO databases to cluster single cells, wherein gene expression values in the 5 public data sets are obtained from various histiocytes, including GSE60361, GSE65525, GSE72056, GSE76312 and GSE103322, reading original single-cell RNA-seq data and carrying out batch effect adjustment and standardization preprocessing on the original single-cell RNA-seq data;
2) data reconstruction and noise reduction:
inputting the single-cell RNA-seq data subjected to logarithmic standardization into a deep noise reduction self-encoder, wherein the deep noise reduction self-encoder reconstructs the data by using zero-expansion negative binomial distribution, and the reconstructed data can better store the original characteristics of organisms;
3) and (3) data dimension reduction:
the single-cell RNA-seq data reconstructed by the deep noise reduction self-encoder is still high-dimensional, the high-dimensional single-cell RNA-seq data brings great difficulty to the identification of cell subtypes, the dimension of sample data is reduced by using a rapid independent component analysis method, redundant parts in the data are eliminated, and the early discovery and related treatment of cancer are further prevented from being interfered by the redundant parts in the data;
4) gaussian mixture clustering and data visualization:
after the low-dimensional and low-noise single-cell RNA-seq data is obtained, a Gaussian mixture model is used for clustering cells and determining cell types, the obtained cell types are potential cell subtypes found, a final clustering result is visualized by adopting a T distribution random neighbor embedding method, and the clustering result is analyzed by combining the existing cells and a cancer database, so that a doctor is helped to develop early cancer discovery.
2. The method for clustering single-cell RNA-seq data based on the deep noise reduction self-encoder as claimed in claim 1, wherein the step of adjusting the batch effect and the normalization pre-processing on the single-cell RNA-seq data in step 1) comprises: firstly, a hierarchical Bayesian model is used for adjusting the batch effect of single-cell RNA-seq data and solving the problem of uncertainty caused by measurement sensitivity; then screening out cells with normal gene expression quantity; then, the data was normalized for sequencing depth and gene length using a logarithmic normalization method.
3. The method for clustering single-cell RNA-seq data based on the deep denoising self-encoder as claimed in claim 1, wherein the deep denoising self-encoder used in the step 2) reconstructs the single-cell RNA-seq data by zero-dilation negative binomial distribution, and the whole self-encoder has three outputs, which respectively learn the zero-dilation factor, the mean value and the variance of the zero-dilation negative binomial distribution;
the RNA-seq data of the single cell to be analyzed is represented by X, and the coding stage in the self-encoder is represented by h (X) ═ sigmah(WX + b), wherein W represents a weight matrix in the encoding process, b represents a bias term, a decoding stage of the self-encoder corresponds to the encoding stage, the encoded data is reconstructed, the input dimension of the self-encoder is consistent with the dimension of single-cell RNA-seq data used for training, the encoder and the decoder are respectively provided with a five-layer network, a Zero-expansion factor is added on the basis of a Negative Binomial distribution (NB) model, and the Zero-expansion factor can also be understood as adding a pulse function at a Zero point, namely modeling the single-cell RNA-seq data by using Zero-expanded Negative Binomial, and the formula is expressed as ZINB (X | π, μ, θ) πδ δ0(X) + (1-. pi.) N BETA (X. mu., θ) if Y. sigma., (X-pi.) Oo(W 'h (X) + b') represents the last hidden layer of the decoder, after which three independent fully-connected layers are added, that is, the whole self-encoder has three outputs, respectively learning the zero-expansion factor, mean and variance of the zero-expansion negative binomial distribution, and the loss function of the noise-reduction part of the noise-reduction self-encoder is represented as Ld=-log(ZINB(X|π,μ,θ))。
4. The method for clustering single-cell RNA-seq data based on the deep noise reduction self-encoder as claimed in claim 1, wherein the dimension of the single-cell RNA-seq data is reduced by using a fast independent component analysis in step 3), the independent component analysis is used for decomposing the original data into linear combinations of non-Gaussian data components with mutually independent statistical significance, assuming that all parts of all data are independent and all components are considered to be equally important;
assuming that reconstructed single cell RNA-seq data obeys a model X ═ AS, wherein S is unknown source data with independent components, A is an unknown mixed matrix, each independent component in S and each mixed coefficient in A are unknown, an independent component analysis method predicts the mixed coefficient and the independent component only through each observed signal data in X, the method firstly performs centralization and whitening pretreatment on original data, and after the pretreatment, sample data is processed by adopting a rapid independent component analysis method, firstly, a vector W is initialized, and W ═ A is defined-1And W is the row vector in W. Secondly, let w+=E{Xg(wTX)}-E{g′(wTX) } w, where g in the above formula is a non-linear scalar function, and let w ═ w+/||w+If the process is not converged, the step is continuously repeated, and finally, a rapid independent component analysis method is used for estimating a plurality of independent components containing important information, so that the purpose of reducing the dimensionality of the single cell RNA-seq data is achieved.
5. The method for clustering single-cell RNA-seq data based on the deep noise reduction self-encoder as claimed in claim 1, wherein the step 4) uses a Gaussian mixture model to cluster cells and determine cell types, and the specific steps include:
firstly, initializing model parameters of Gaussian mixture distribution, and then iteratively optimizing the parameters of the model repeatedly based on an expectation-maximization algorithm; the E iteration step in the expectation maximization algorithm: calculating posterior probability gamma of ith sample data based on ith Gaussian mixture componentji:
The M iteration steps in the expectation maximization algorithm: other parameters mu of the iterative optimization modeli,∑iAnd alphaiCalculated based on the following formula: stopping iteration when the maximum iteration times is reached in the experimental process, continuously iterating and updating parameters if the conditions are not met, and finally, sampling data xjCluster label lambda ofjUsing λj=arg maxγjiAnd calculating, and visualizing the final clustering result by using a T-distribution random neighbor embedding method to display the clustering result on a two-dimensional coordinate.
6. The method according to claim 5, wherein the initialization using the Gaussian mixture model is performed by using k-means + + to solve the centroid initialization problem by randomly selecting a point from the input data point set as the first clustering center; for each object in the data set, calculating the similarity of the object to the nearest clustering center; selecting a new data point as a new cluster center according to the following selection principles: the point with larger similarity is selected as the clustering center with larger probability; the above steps are repeated until k cluster centers are selected, using the k initial cluster centers to run the standard k-means algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111152923.5A CN113889192B (en) | 2021-09-29 | 2021-09-29 | Single-cell RNA-seq data clustering method based on deep noise reduction self-encoder |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111152923.5A CN113889192B (en) | 2021-09-29 | 2021-09-29 | Single-cell RNA-seq data clustering method based on deep noise reduction self-encoder |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113889192A true CN113889192A (en) | 2022-01-04 |
CN113889192B CN113889192B (en) | 2024-02-27 |
Family
ID=79008210
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111152923.5A Active CN113889192B (en) | 2021-09-29 | 2021-09-29 | Single-cell RNA-seq data clustering method based on deep noise reduction self-encoder |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113889192B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115527610A (en) * | 2022-11-09 | 2022-12-27 | 上海交通大学 | Cluster analysis method of unicellular omics data |
CN114462548B (en) * | 2022-02-23 | 2023-07-18 | 曲阜师范大学 | Method for improving accuracy of single-cell deep clustering algorithm |
CN116665786A (en) * | 2023-07-21 | 2023-08-29 | 曲阜师范大学 | RNA layered embedding clustering method based on graph convolution neural network |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107075543A (en) * | 2014-04-21 | 2017-08-18 | 哈佛学院院长及董事 | System and method for bar coded nucleic acid |
US20190071718A1 (en) * | 2016-04-15 | 2019-03-07 | Koninklijke Philips N.V. | Sub-population detection and quantization of receptor-ligand states for characterizing inter-cellular communication and intratumoral heterogeneity |
CN110147648A (en) * | 2019-06-20 | 2019-08-20 | 浙江大学 | Automobile sensor fault detection method based on independent component analysis and sparse denoising self-encoding encoder |
CN110890132A (en) * | 2019-11-19 | 2020-03-17 | 湖南大学 | Cancer mutation cluster identification method based on adaptive Gaussian mixture model |
CN111428768A (en) * | 2020-03-18 | 2020-07-17 | 电子科技大学 | Hellinger distance-Gaussian mixture model-based clustering method |
CN111785329A (en) * | 2020-07-24 | 2020-10-16 | 中国人民解放军国防科技大学 | Single-cell RNA sequencing clustering method based on confrontation automatic encoder |
CN112464004A (en) * | 2020-11-26 | 2021-03-09 | 大连理工大学 | Multi-view depth generation image clustering method |
CN112735536A (en) * | 2020-12-23 | 2021-04-30 | 湖南大学 | Single cell integrated clustering method based on subspace randomization |
CN112967755A (en) * | 2021-03-04 | 2021-06-15 | 深圳大学 | Cell type identification method for single cell RNA sequencing data |
-
2021
- 2021-09-29 CN CN202111152923.5A patent/CN113889192B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107075543A (en) * | 2014-04-21 | 2017-08-18 | 哈佛学院院长及董事 | System and method for bar coded nucleic acid |
US20190071718A1 (en) * | 2016-04-15 | 2019-03-07 | Koninklijke Philips N.V. | Sub-population detection and quantization of receptor-ligand states for characterizing inter-cellular communication and intratumoral heterogeneity |
CN110147648A (en) * | 2019-06-20 | 2019-08-20 | 浙江大学 | Automobile sensor fault detection method based on independent component analysis and sparse denoising self-encoding encoder |
CN110890132A (en) * | 2019-11-19 | 2020-03-17 | 湖南大学 | Cancer mutation cluster identification method based on adaptive Gaussian mixture model |
CN111428768A (en) * | 2020-03-18 | 2020-07-17 | 电子科技大学 | Hellinger distance-Gaussian mixture model-based clustering method |
CN111785329A (en) * | 2020-07-24 | 2020-10-16 | 中国人民解放军国防科技大学 | Single-cell RNA sequencing clustering method based on confrontation automatic encoder |
CN112464004A (en) * | 2020-11-26 | 2021-03-09 | 大连理工大学 | Multi-view depth generation image clustering method |
CN112735536A (en) * | 2020-12-23 | 2021-04-30 | 湖南大学 | Single cell integrated clustering method based on subspace randomization |
CN112967755A (en) * | 2021-03-04 | 2021-06-15 | 深圳大学 | Cell type identification method for single cell RNA sequencing data |
Non-Patent Citations (2)
Title |
---|
栾志玲;: "DNA基因深度特征选择策略的研究现状及发展趋势", 佳木斯职业学院学报, no. 05, 15 May 2019 (2019-05-15) * |
高美加;: "基于loess回归加权的单细胞RNA-seq数据预处理算法", 智能计算机与应用, no. 05, 1 May 2020 (2020-05-01) * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114462548B (en) * | 2022-02-23 | 2023-07-18 | 曲阜师范大学 | Method for improving accuracy of single-cell deep clustering algorithm |
CN115527610A (en) * | 2022-11-09 | 2022-12-27 | 上海交通大学 | Cluster analysis method of unicellular omics data |
CN115527610B (en) * | 2022-11-09 | 2023-11-24 | 上海交通大学 | Cluster analysis method for single-cell histology data |
CN116665786A (en) * | 2023-07-21 | 2023-08-29 | 曲阜师范大学 | RNA layered embedding clustering method based on graph convolution neural network |
Also Published As
Publication number | Publication date |
---|---|
CN113889192B (en) | 2024-02-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113889192B (en) | Single-cell RNA-seq data clustering method based on deep noise reduction self-encoder | |
CN108805167B (en) | Sparse depth confidence network image classification method based on Laplace function constraint | |
CN114022693B (en) | Single-cell RNA-seq data clustering method based on double self-supervision | |
CN111564183B (en) | Single cell sequencing data dimension reduction method fusing gene ontology and neural network | |
CN110826635B (en) | Sample clustering and feature identification method based on integration non-negative matrix factorization | |
Albergante et al. | Estimating the effective dimension of large biological datasets using Fisher separability analysis | |
Yan et al. | Unsupervised and semi‐supervised learning: The next frontier in machine learning for plant systems biology | |
CN112735536A (en) | Single cell integrated clustering method based on subspace randomization | |
CN116580848A (en) | Multi-head attention mechanism-based method for analyzing multiple groups of chemical data of cancers | |
Bellazzi et al. | The Gene Mover's Distance: Single-cell similarity via Optimal Transport | |
CN116109613A (en) | Defect detection method and system based on distribution characterization | |
Zhang et al. | SLRRSC: single-cell type recognition method based on similarity and graph regularization constraints | |
CN111178427A (en) | Depth self-coding embedded clustering method based on Sliced-Wasserstein distance | |
Wang et al. | scDSSC: deep sparse subspace clustering for scRNA-seq data | |
CN115661498A (en) | Self-optimization single cell clustering method | |
CN117497038A (en) | Method for rapidly optimizing culture medium formula based on nuclear method | |
CN114997303A (en) | Bladder cancer metabolic marker screening method and system based on deep learning | |
CN114783526A (en) | Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder | |
CN112768001A (en) | Single cell trajectory inference method based on manifold learning and main curve | |
Peng et al. | A deep learning-based unsupervised learning method for spatially resolved transcriptomic data analysis | |
CN117727373B (en) | Sample and feature double weighting-based intelligent C-means clustering method for feature reduction | |
Abou El-Naga et al. | Consensus Nature Inspired Clustering of Single-Cell RNA-Sequencing Data | |
Hu et al. | WEDGE: recovery of gene expression values for sparse single-cell RNA-seq datasets using matrix decomposition | |
Murugesan et al. | Weighted Fuzzy Score Normalization and Bayesian Independent Principal Component Analysis Imputation for Breast Cancer Gene Expression Analysis. | |
CN113177604B (en) | High-dimensional data feature selection method based on improved L1 regularization and clustering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |