CN113889192A - Single cell RNA-seq data clustering method based on deep noise reduction self-encoder - Google Patents

Single cell RNA-seq data clustering method based on deep noise reduction self-encoder Download PDF

Info

Publication number
CN113889192A
CN113889192A CN202111152923.5A CN202111152923A CN113889192A CN 113889192 A CN113889192 A CN 113889192A CN 202111152923 A CN202111152923 A CN 202111152923A CN 113889192 A CN113889192 A CN 113889192A
Authority
CN
China
Prior art keywords
data
cell rna
encoder
seq data
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111152923.5A
Other languages
Chinese (zh)
Other versions
CN113889192B (en
Inventor
王艺杰
王文庆
杨东
胥冠军
崔逸群
毕玉冰
刘超飞
董夏昕
刘迪
肖力炀
刘骁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Thermal Power Research Institute Co Ltd
Original Assignee
Xian Thermal Power Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Thermal Power Research Institute Co Ltd filed Critical Xian Thermal Power Research Institute Co Ltd
Priority to CN202111152923.5A priority Critical patent/CN113889192B/en
Publication of CN113889192A publication Critical patent/CN113889192A/en
Application granted granted Critical
Publication of CN113889192B publication Critical patent/CN113889192B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2134Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on separation criteria, e.g. independent component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Epidemiology (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Probability & Statistics with Applications (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a single-cell RNA-seq data clustering method based on a deep noise reduction self-encoder, which comprises the steps of firstly adjusting the batch effect of single-cell RNA-seq data and standardizing the data so as to reduce the adverse effect caused by technical noise; secondly, effectively mining the characteristic information of the single-cell RNA-seq data by using a deep noise reduction self-encoder based on zero-expansion negative binomial distribution; then, a rapid independent component analysis method is used for reducing the dimension of the single cell RNA-seq data, so that the calculation efficiency of the method model is improved; and finally, expanding more accurate clustering on the cells through a Gaussian mixture model based on expectation maximization, and visualizing the final single-cell RNA-seq data clustering result by using a T distribution random neighbor embedding method. The invention can effectively reduce the interference of the characteristics of high dimensionality, large noise and the like of the single-cell RNA-seq data on data clustering, accurately learn the gene expression information of the single-cell RNA-seq data so as to cluster cells, and provide help for gene network construction, cell type discovery and early cancer discovery and treatment.

Description

Single cell RNA-seq data clustering method based on deep noise reduction self-encoder
Technical Field
The invention belongs to the technical field of single cell RNA-seq data analysis in bioinformatics, and particularly relates to a single cell RNA-seq (Ribonnucleic acid-sequence) data clustering method based on a deep noise reduction self-encoder.
Background
With the rapid development of sequencing technologies, researchers have acquired a large amount of single-cell RNA-seq data. Unsupervised clustering plays an important role in analyzing single cell RNA-seq data, and the clustering method aiming at the single cell RNA-seq data can not only identify unknown cell types, but also reveal the heterogeneity of cells. Through the research on the clustering method of the single cell RNA-seq data, researchers can more accurately identify the cell state, build a network structure between cells, deeply understand the differentiation process of cancer cells and the like, and lay a foundation for the early discovery and treatment of the future cancer. At present, traditional clustering methods such as hierarchical clustering, spectral clustering and density-based clustering methods with noise are widely used, but single-cell RNA-seq data have unique characteristics, so that the traditional clustering methods cannot effectively cluster the data.
Disclosure of Invention
In order to overcome the technical problems, the invention provides a single-cell RNA-seq data clustering method based on a deep noise reduction self-encoder, which combines methods such as a self-encoder and rapid independent component analysis to realize the purposes of feature learning, dimension reduction and the like in the single-cell RNA-seq data clustering process, finally uses Gaussian mixture clustering to cluster the single-cell RNA-seq data, and reduces the influence of data noise on the clustering effect by introducing zero-expansion negative binomial distribution reconstruction data.
In order to achieve the purpose, the invention adopts the technical scheme that:
a single-cell RNA-seq data clustering method based on a deep noise reduction self-encoder comprises the following steps;
1) adjusting batch effect and data standardization preprocessing:
5 public real single-cell RNA-seq data sets downloaded from Arrayexpress and GEO databases are selected to cluster single cells, cell subtypes are further discovered, assistance is provided for early discovery and targeted treatment of related cancers, gene expression values in the 5 public data sets are obtained from various tissue cells, including GSE60361, GSE65525, GSE72056, GSE76312 and GSE103322, original single-cell RNA-seq data are read and subjected to batch effect adjustment and standardized preprocessing, and systematic technical deviation which is irrelevant to biological states and is introduced due to sample data in different batch processing and measurement is avoided;
2) data reconstruction and noise reduction:
because a large number of zero values exist in the single-cell RNA-seq data, the zero values can not only indicate that partial genes of some cells are not expressed actually, but also can be the result caused by technical errors, the noise can greatly interfere the discovery of cell subtypes, the single-cell RNA-seq data after logarithmic normalization processing is input into a deep noise reduction self-encoder, the deep noise reduction self-encoder reconstructs the data by using zero-expansion negative binomial distribution, and the reconstructed data can better store the original characteristics of organisms;
3) and (3) data dimension reduction:
the single-cell RNA-seq data reconstructed by the deep noise reduction self-encoder is still high-dimensional, the high-dimensional single-cell RNA-seq data brings great difficulty to the identification of cell subtypes, the dimension of sample data is reduced by using a rapid independent component analysis method, redundant parts in the data are eliminated, and the early discovery and related treatment of cancer are further prevented from being interfered by the redundant parts in the data;
4) gaussian mixture clustering and data visualization:
after the low-dimensional and low-noise single-cell RNA-seq data is obtained, a Gaussian mixture model is used for clustering cells and determining cell types, the obtained cell types are potential cell subtypes found, a final clustering result is visualized by adopting a T distribution random neighbor embedding method, and the clustering result is analyzed by combining the existing cells and a cancer database, so that a doctor is helped to develop early cancer discovery.
The step of adjusting batch effect and standardizing pretreatment on the single cell RNA-seq data in the step 1) comprises the following steps: firstly, a hierarchical Bayesian model is used for adjusting the batch effect of single-cell RNA-seq data and solving the problem of uncertainty caused by measurement sensitivity; then screening out cells with normal gene expression quantity; then, the data was normalized for sequencing depth and gene length using a logarithmic normalization method.
The deep noise reduction self-encoder used in the step 2) reconstructs single-cell RNA-seq data through zero-expansion negative binomial distribution, and the whole self-encoder has three outputs and respectively learns the zero-expansion factor, the mean value and the variance of the zero-expansion negative binomial distribution;
the RNA-seq data of the single cell to be analyzed is represented by X, and the coding stage in the self-encoder is represented by h (X) ═ sigmah(WX + b), wherein W represents a weight matrix in the encoding process, b represents a bias term, a decoding stage of the self-encoder corresponds to the encoding stage, the encoded data is reconstructed, the input dimension of the self-encoder is consistent with the dimension of single-cell RNA-seq data used for training, the encoder and the decoder are respectively provided with a five-layer network, a Zero-expansion factor is added on the basis of a Negative binomial distribution (NB) model, and the situation that an impulse function, namely Zero-expanded Negative binomial distribution (Zero-expanded Negative Bi) is added at a Zero point can be understood as the situation that a Zero point is added with the impulse functionnomial) to model single-cell RNA-seq data, formulated as ZINB (X | pi, μ, θ) ═ pi δ0(X) + (1-. pi.) N BETA (X. mu., θ) if Y. sigma., (X-pi.) Oo(W 'h (X) + b') represents the last hidden layer of the decoder, after which three independent fully-connected layers are added, that is, the whole self-encoder has three outputs, respectively learning the zero-expansion factor, mean and variance of the zero-expansion negative binomial distribution, and the loss function of the noise-reduction part of the noise-reduction self-encoder is represented as Ld=-log(ZINB(X|π,μ,θ))。
In the step 3), the dimensionality of the single-cell RNA-seq data is reduced by using a rapid independent component analysis method, the independent component analysis assumes that all parts of all data are independent from each other and considers that all components are equally important, and original data are decomposed into linear combinations of non-Gaussian data components with mutually independent statistical meanings;
assuming that reconstructed single cell RNA-seq data obeys a model X ═ AS, wherein S is unknown source data with independent components, A is an unknown mixed matrix, each independent component in S and each mixed coefficient in A are unknown, an independent component analysis method predicts the mixed coefficient and the independent component only through each observed signal data in X, the method firstly performs centralization and whitening pretreatment on original data, and after the pretreatment, sample data is processed by adopting a rapid independent component analysis method, firstly, a vector W is initialized, and W ═ A is defined-1And W is the row vector in W. Secondly, let w+=E{Xg(wTX)}-E{g′(wTX) } w, where g in the above formula is a non-linear scalar function, and let w ═ w+/||w+If the process is not converged, the step is continuously repeated, and finally, a rapid independent component analysis method is used for estimating a plurality of independent components containing important information, so that the purpose of reducing the dimensionality of the single cell RNA-seq data is achieved.
In the step 4), a Gaussian mixture model is used for clustering the cells and determining the cell types, and the method specifically comprises the following steps:
firstly, initializing model parameters of Gaussian mixture distribution, and then iteratively optimizing the model based on an expectation-maximization algorithmA parameter of type; the E iteration step in the expectation maximization algorithm: calculating posterior probability gamma of ith sample data based on ith Gaussian mixture componentji
Figure BDA0003287704220000051
(ii) a The M iteration steps in the expectation maximization algorithm: other parameters mu of the iterative optimization modeli,∑iAnd alphaiCalculated based on the following formula:
Figure BDA0003287704220000052
Figure BDA0003287704220000053
stopping iteration when the maximum iteration times is reached in the experimental process, continuously iterating and updating parameters if the conditions are not met, and finally, sampling data xjCluster label lambda ofjUsing λj=argmaxγjiAnd calculating, and visualizing the final clustering result by using a T-distribution random neighbor embedding method to display the clustering result on a two-dimensional coordinate.
In the process of initializing by using a Gaussian mixture model, the problem of centroid initialization is solved by adopting k-means + +, and the method is that a point is randomly selected from an input data point set to serve as a first clustering center; for each object in the data set, calculating the similarity of the object to the nearest clustering center; selecting a new data point as a new cluster center according to the following selection principles: the point with larger similarity is selected as the clustering center with larger probability; the above steps are repeated until k cluster centers are selected, using the k initial cluster centers to run the standard k-means algorithm.
The invention has the beneficial effects that:
the invention combines a self-encoder and a rapid independent component analysis method to learn the expression of the single-cell RNA-seq data, reduces the dimension of the data, uses zero-expansion negative binomial distribution to reconstruct the data so as to reduce the influence of data noise on a clustering result, uses Gaussian mixed clustering to cluster the low-dimension single-cell RNA-seq data, finally uses a T distribution random neighbor embedding method to visualize the clustering result, can identify the cell subtype, and can help to develop early discovery and related diagnosis and treatment of cancer.
Drawings
FIG. 1 is a general flow diagram of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples.
As shown in fig. 1, four major steps of improving the single cell RNA-seq data clustering effect based on the deep noise reduction self-encoder of the present invention are shown, including batch effect adjustment and data standardization preprocessing, data reconstruction and noise reduction, data dimension reduction, gaussian mixture clustering, and data visualization.
The invention provides a single-cell RNA-seq data clustering method based on a deep noise reduction self-encoder, which comprises the following steps:
step one, adjusting batch effect and data standardization preprocessing. The invention selects 5 public data sets downloaded from Arrayexpress and GEO databases to verify the effectiveness of the invention, and gene expression values in the 5 public data sets are obtained from various histiocytes, including GSE60361, GSE65525, GSE72056, GSE76312 and GSE 103322. The data are used as initial input raw data, the invention uses a hierarchical Bayesian model to adjust the batch effect of the single-cell RNA-seq data, and simultaneously solves the problem of uncertainty caused by measurement sensitivity. The method uses a single-cell gene expression analysis package SCANPY based on Python to effectively screen and filter the original single-cell RNA-seq data, removes the data with poor sequencing quality, and then carries out standardized processing on the data so as to facilitate the subsequent network learning.
And step two, reconstructing data and reducing noise. The analysis of single cell RNA-seq data is interfered due to problems of data amplification, data loss and the like. The present invention uses a denoise self-encoder technique to map the input single-cell RNA-seq data to an embedding space. In the experimental process, random gaussian noise was added to the data used and a full-link layer was used to construct the self-encoder. To capture single cells betterThe invention relates to RNA-seq data.A three independent full-connection layers are added behind the last hidden layer of a decoder, and three outputs respectively learn a pulse function regulating factor of zero-expansion negative binomial distribution, the mean value of the negative binomial distribution and the sparsity. The loss function of the noise-reduced self-encoder's noise-reduced portion is further defined as the negative logarithm of the zero-expansion negative binomial distribution formula. The RNA-seq data of the single cell to be analyzed is represented by X, and the coding stage in the self-encoder is represented by h (X) ═ σh(WX + b), W represents a weight matrix in the encoding process, b represents an offset item, and the decoding stage of the self-encoder corresponds to the encoding stage to reconstruct the encoded data. The input dimension of the self-encoder is consistent with the dimension of single-cell RNA-seq data used for training, and the encoder and the decoder respectively have five layers of networks. Recent research progress on single-cell RNA-seq data shows that the single-cell RNA-seq data is closest to Negative Binomial distribution (NB) and is formulated as
Figure BDA0003287704220000071
Because the dispersion of single-cell RNA-seq data is usually highly distorted, the variance tends to be larger than the mean and is therefore not suitable for approximation with a poisson distribution, whereas the variance of single-cell RNA-seq data typically changes as the mean changes. In addition to the above, single cell RNA-seq data also have the feature of a particularly high number of zero values. Since the Zero values in the gene expression data may come from genes that are not expressed in the biological process (True Zero) and also from losses due to technical reasons in the sequencing process (Dropout Zero). In order to better capture single-cell RNA-seq data, the invention improves the traditional noise reduction self-encoder, adds a Zero-expansion factor on the basis of a Negative Binomial distribution (NB) model, and can also be understood as adding a pulse function at a Zero point, namely modeling the single-cell RNA-seq data by using Zero-expanded Negative Binomial distribution (Zero-expanded Negative Binomial). Formulated as ZINB (X | pi, μ, θ) ═ pi δ0(X) + (1-. pi.) N BETA (X. mu., θ) if Y. sigma., (X-pi.) Oo(W 'h (X) + b') denotes the last hidden layer of the decoder, after which the invention adds three independent fully-connected layers, that is to sayThe whole self-encoder has three outputs, and the zero expansion factor, the mean value and the variance of the zero expansion negative binomial distribution are respectively learned. The invention expresses the loss function of the noise reduction part of the noise reduction self-encoder as Ld=-log(ZINB(X|π,μ,θ))。
And step three, reducing the dimension of the data. In the process of data dimension reduction, firstly, centralizing and whitening pretreatment are carried out on high-dimensional single cell RNA-seq data, a separation matrix is calculated on the basis of the pretreated data, and the separation matrix is initialized; and then continuously optimizing the separation matrix, constantly judging whether the separation matrix is converged, if so, solving final low-dimensional single cell RNA-seq data, and if not, continuously optimizing the separation matrix.
Assuming that the reconstructed single-cell RNA-seq data obeys the model X ═ AS, where S is unknown source data with independent components, a is an unknown mixing matrix, and each independent component in S and each mixing coefficient in a are unknown. The independent component analysis method estimates the mixing coefficients and independent components only from each observed signal data in X. The method firstly carries out centering and whitening pretreatment on original data, and after the pretreatment, the invention adopts a rapid independent component analysis method to process the sample number, firstly, a vector W is initialized, and W is defined as A-1And W is the row vector in W. Secondly, let w+=E{Xg(wTX)}-E{g′(wTX) } w, where g in the above formula is a non-linear scalar function, and let w ═ w+/||w+If the above process does not converge, this step is repeated continuously. And finally, estimating several independent components containing important information by using a rapid independent component analysis method, and achieving the purpose of reducing the single cell RNA-seq data dimension.
And step four, Gaussian mixed clustering and data visualization. In the process of initializing by using a Gaussian mixture model, k-means + + is adopted to solve the problem of centroid initialization. Firstly, initializing model parameters of Gaussian mixture distribution, and then iteratively optimizing the parameters of the model repeatedly based on an expectation-maximization algorithm; the E iteration step in the expectation maximization algorithm: based on the ith Gaussian mixture component meterCalculating the posterior probability gamma of the ith sample dataji
Figure BDA0003287704220000091
(ii) a The M iteration steps in the expectation maximization algorithm: other parameters mu of the iterative optimization modeli,∑iAnd alphaiCalculated based on the following formula:
Figure BDA0003287704220000092
Figure BDA0003287704220000093
and stopping iteration when the maximum iteration times are reached in the experimental process, and continuously iterating and updating the parameters if the conditions are not met. Finally, sample data xjCluster label lambda ofjUsing λj=argmaxγjiAnd (4) calculating. And finally, visualizing the clustering result by using a T-distribution random neighbor embedding method.
In the process of initializing by using a Gaussian mixture model, the problem of centroid initialization is solved by adopting k-means + +, and the method is that a point is randomly selected from an input data point set to serve as a first clustering center; for each object in the data set, calculating the similarity of the object to the nearest clustering center; selecting a new data point as a new cluster center according to the following selection principles: the point with larger similarity is selected as the clustering center with larger probability; the above steps are repeated until k cluster centers are selected, using the k initial cluster centers to run the standard k-means algorithm.
For the Gaussian mixture model, the invention uses a Gaussian mixture function in the scimit-leann mixture module, with the parameters held at default values. The Clustering performance evaluation indexes used by the invention mainly comprise Normalized Mutual Information, Clustering Accuracy and Adjusted Rand Index, and the higher the numerical values of the three evaluation indexes are, the better the Clustering performance of the method is.
The invention has the following characteristics:
1. the influence of the single cell RNA-seq data batch effect on the final clustering effect is reduced;
2. the influence of high dimensionality and large noise of single cell RNA-seq data on a clustering result is reduced;
3. in the clustering process of the single cell RNA-seq data, the data representation can be effectively learned, and the method has strong data representation capability;
4. and after the clustering is finished, the data visualization capacity is good.
The self-encoder model is an unsupervised deep learning method, the method can not only effectively reduce the dimension of input data, but also learn the implicit characteristics of the analyzed data in a mode of adjusting the number of layers of a neural network, optimizing a network training process and the like, and recover the data through a data reconstruction process. The noise reduction self-encoder allows the damaged data with noise information to become input data of a network, so that the reconstructed data obtains certain robustness to noise in the input data.
The single-cell RNA-seq data has high dimension and also contains large noise, the noise is usually expressed as that the single-cell RNA-seq data is sparse, and a large amount of zero values are derived from certain genes which are not expressed actually on one hand and are derived from the fact that the expressed gene values are not detected due to the defects of sequencing technology and the like on the other hand.
In the process of carrying out deep clustering on single cell RNA-seq data, the invention designs a single cell RNA-seq data clustering method based on a deep noise reduction self-encoder, which combines the self-encoder, rapid independent component analysis, Gaussian mixture clustering, T distribution random neighbor embedding and other methods to solve the problems of feature learning, dimension reduction, clustering, data visualization and the like in the single cell RNA-seq data clustering process, and reduces the influence of data noise on the clustering effect by introducing zero-expansion negative binomial distribution reconstruction data.

Claims (6)

1. A single cell RNA-seq data clustering method based on a deep noise reduction self-encoder is characterized by comprising the following steps;
1) adjusting batch effect and data standardization preprocessing:
selecting 5 public real single-cell RNA-seq data sets downloaded from Arrayexpress and GEO databases to cluster single cells, wherein gene expression values in the 5 public data sets are obtained from various histiocytes, including GSE60361, GSE65525, GSE72056, GSE76312 and GSE103322, reading original single-cell RNA-seq data and carrying out batch effect adjustment and standardization preprocessing on the original single-cell RNA-seq data;
2) data reconstruction and noise reduction:
inputting the single-cell RNA-seq data subjected to logarithmic standardization into a deep noise reduction self-encoder, wherein the deep noise reduction self-encoder reconstructs the data by using zero-expansion negative binomial distribution, and the reconstructed data can better store the original characteristics of organisms;
3) and (3) data dimension reduction:
the single-cell RNA-seq data reconstructed by the deep noise reduction self-encoder is still high-dimensional, the high-dimensional single-cell RNA-seq data brings great difficulty to the identification of cell subtypes, the dimension of sample data is reduced by using a rapid independent component analysis method, redundant parts in the data are eliminated, and the early discovery and related treatment of cancer are further prevented from being interfered by the redundant parts in the data;
4) gaussian mixture clustering and data visualization:
after the low-dimensional and low-noise single-cell RNA-seq data is obtained, a Gaussian mixture model is used for clustering cells and determining cell types, the obtained cell types are potential cell subtypes found, a final clustering result is visualized by adopting a T distribution random neighbor embedding method, and the clustering result is analyzed by combining the existing cells and a cancer database, so that a doctor is helped to develop early cancer discovery.
2. The method for clustering single-cell RNA-seq data based on the deep noise reduction self-encoder as claimed in claim 1, wherein the step of adjusting the batch effect and the normalization pre-processing on the single-cell RNA-seq data in step 1) comprises: firstly, a hierarchical Bayesian model is used for adjusting the batch effect of single-cell RNA-seq data and solving the problem of uncertainty caused by measurement sensitivity; then screening out cells with normal gene expression quantity; then, the data was normalized for sequencing depth and gene length using a logarithmic normalization method.
3. The method for clustering single-cell RNA-seq data based on the deep denoising self-encoder as claimed in claim 1, wherein the deep denoising self-encoder used in the step 2) reconstructs the single-cell RNA-seq data by zero-dilation negative binomial distribution, and the whole self-encoder has three outputs, which respectively learn the zero-dilation factor, the mean value and the variance of the zero-dilation negative binomial distribution;
the RNA-seq data of the single cell to be analyzed is represented by X, and the coding stage in the self-encoder is represented by h (X) ═ sigmah(WX + b), wherein W represents a weight matrix in the encoding process, b represents a bias term, a decoding stage of the self-encoder corresponds to the encoding stage, the encoded data is reconstructed, the input dimension of the self-encoder is consistent with the dimension of single-cell RNA-seq data used for training, the encoder and the decoder are respectively provided with a five-layer network, a Zero-expansion factor is added on the basis of a Negative Binomial distribution (NB) model, and the Zero-expansion factor can also be understood as adding a pulse function at a Zero point, namely modeling the single-cell RNA-seq data by using Zero-expanded Negative Binomial, and the formula is expressed as ZINB (X | π, μ, θ) πδ δ0(X) + (1-. pi.) N BETA (X. mu., θ) if Y. sigma., (X-pi.) Oo(W 'h (X) + b') represents the last hidden layer of the decoder, after which three independent fully-connected layers are added, that is, the whole self-encoder has three outputs, respectively learning the zero-expansion factor, mean and variance of the zero-expansion negative binomial distribution, and the loss function of the noise-reduction part of the noise-reduction self-encoder is represented as Ld=-log(ZINB(X|π,μ,θ))。
4. The method for clustering single-cell RNA-seq data based on the deep noise reduction self-encoder as claimed in claim 1, wherein the dimension of the single-cell RNA-seq data is reduced by using a fast independent component analysis in step 3), the independent component analysis is used for decomposing the original data into linear combinations of non-Gaussian data components with mutually independent statistical significance, assuming that all parts of all data are independent and all components are considered to be equally important;
assuming that reconstructed single cell RNA-seq data obeys a model X ═ AS, wherein S is unknown source data with independent components, A is an unknown mixed matrix, each independent component in S and each mixed coefficient in A are unknown, an independent component analysis method predicts the mixed coefficient and the independent component only through each observed signal data in X, the method firstly performs centralization and whitening pretreatment on original data, and after the pretreatment, sample data is processed by adopting a rapid independent component analysis method, firstly, a vector W is initialized, and W ═ A is defined-1And W is the row vector in W. Secondly, let w+=E{Xg(wTX)}-E{g′(wTX) } w, where g in the above formula is a non-linear scalar function, and let w ═ w+/||w+If the process is not converged, the step is continuously repeated, and finally, a rapid independent component analysis method is used for estimating a plurality of independent components containing important information, so that the purpose of reducing the dimensionality of the single cell RNA-seq data is achieved.
5. The method for clustering single-cell RNA-seq data based on the deep noise reduction self-encoder as claimed in claim 1, wherein the step 4) uses a Gaussian mixture model to cluster cells and determine cell types, and the specific steps include:
firstly, initializing model parameters of Gaussian mixture distribution, and then iteratively optimizing the parameters of the model repeatedly based on an expectation-maximization algorithm; the E iteration step in the expectation maximization algorithm: calculating posterior probability gamma of ith sample data based on ith Gaussian mixture componentji
Figure FDA0003287704210000041
Figure FDA0003287704210000044
The M iteration steps in the expectation maximization algorithm: other parameters mu of the iterative optimization modeli,∑iAnd alphaiCalculated based on the following formula:
Figure FDA0003287704210000042
Figure FDA0003287704210000043
stopping iteration when the maximum iteration times is reached in the experimental process, continuously iterating and updating parameters if the conditions are not met, and finally, sampling data xjCluster label lambda ofjUsing λj=arg maxγjiAnd calculating, and visualizing the final clustering result by using a T-distribution random neighbor embedding method to display the clustering result on a two-dimensional coordinate.
6. The method according to claim 5, wherein the initialization using the Gaussian mixture model is performed by using k-means + + to solve the centroid initialization problem by randomly selecting a point from the input data point set as the first clustering center; for each object in the data set, calculating the similarity of the object to the nearest clustering center; selecting a new data point as a new cluster center according to the following selection principles: the point with larger similarity is selected as the clustering center with larger probability; the above steps are repeated until k cluster centers are selected, using the k initial cluster centers to run the standard k-means algorithm.
CN202111152923.5A 2021-09-29 2021-09-29 Single-cell RNA-seq data clustering method based on deep noise reduction self-encoder Active CN113889192B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111152923.5A CN113889192B (en) 2021-09-29 2021-09-29 Single-cell RNA-seq data clustering method based on deep noise reduction self-encoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111152923.5A CN113889192B (en) 2021-09-29 2021-09-29 Single-cell RNA-seq data clustering method based on deep noise reduction self-encoder

Publications (2)

Publication Number Publication Date
CN113889192A true CN113889192A (en) 2022-01-04
CN113889192B CN113889192B (en) 2024-02-27

Family

ID=79008210

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111152923.5A Active CN113889192B (en) 2021-09-29 2021-09-29 Single-cell RNA-seq data clustering method based on deep noise reduction self-encoder

Country Status (1)

Country Link
CN (1) CN113889192B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115527610A (en) * 2022-11-09 2022-12-27 上海交通大学 Cluster analysis method of unicellular omics data
CN114462548B (en) * 2022-02-23 2023-07-18 曲阜师范大学 Method for improving accuracy of single-cell deep clustering algorithm
CN116665786A (en) * 2023-07-21 2023-08-29 曲阜师范大学 RNA layered embedding clustering method based on graph convolution neural network

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107075543A (en) * 2014-04-21 2017-08-18 哈佛学院院长及董事 System and method for bar coded nucleic acid
US20190071718A1 (en) * 2016-04-15 2019-03-07 Koninklijke Philips N.V. Sub-population detection and quantization of receptor-ligand states for characterizing inter-cellular communication and intratumoral heterogeneity
CN110147648A (en) * 2019-06-20 2019-08-20 浙江大学 Automobile sensor fault detection method based on independent component analysis and sparse denoising self-encoding encoder
CN110890132A (en) * 2019-11-19 2020-03-17 湖南大学 Cancer mutation cluster identification method based on adaptive Gaussian mixture model
CN111428768A (en) * 2020-03-18 2020-07-17 电子科技大学 Hellinger distance-Gaussian mixture model-based clustering method
CN111785329A (en) * 2020-07-24 2020-10-16 中国人民解放军国防科技大学 Single-cell RNA sequencing clustering method based on confrontation automatic encoder
CN112464004A (en) * 2020-11-26 2021-03-09 大连理工大学 Multi-view depth generation image clustering method
CN112735536A (en) * 2020-12-23 2021-04-30 湖南大学 Single cell integrated clustering method based on subspace randomization
CN112967755A (en) * 2021-03-04 2021-06-15 深圳大学 Cell type identification method for single cell RNA sequencing data

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107075543A (en) * 2014-04-21 2017-08-18 哈佛学院院长及董事 System and method for bar coded nucleic acid
US20190071718A1 (en) * 2016-04-15 2019-03-07 Koninklijke Philips N.V. Sub-population detection and quantization of receptor-ligand states for characterizing inter-cellular communication and intratumoral heterogeneity
CN110147648A (en) * 2019-06-20 2019-08-20 浙江大学 Automobile sensor fault detection method based on independent component analysis and sparse denoising self-encoding encoder
CN110890132A (en) * 2019-11-19 2020-03-17 湖南大学 Cancer mutation cluster identification method based on adaptive Gaussian mixture model
CN111428768A (en) * 2020-03-18 2020-07-17 电子科技大学 Hellinger distance-Gaussian mixture model-based clustering method
CN111785329A (en) * 2020-07-24 2020-10-16 中国人民解放军国防科技大学 Single-cell RNA sequencing clustering method based on confrontation automatic encoder
CN112464004A (en) * 2020-11-26 2021-03-09 大连理工大学 Multi-view depth generation image clustering method
CN112735536A (en) * 2020-12-23 2021-04-30 湖南大学 Single cell integrated clustering method based on subspace randomization
CN112967755A (en) * 2021-03-04 2021-06-15 深圳大学 Cell type identification method for single cell RNA sequencing data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
栾志玲;: "DNA基因深度特征选择策略的研究现状及发展趋势", 佳木斯职业学院学报, no. 05, 15 May 2019 (2019-05-15) *
高美加;: "基于loess回归加权的单细胞RNA-seq数据预处理算法", 智能计算机与应用, no. 05, 1 May 2020 (2020-05-01) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114462548B (en) * 2022-02-23 2023-07-18 曲阜师范大学 Method for improving accuracy of single-cell deep clustering algorithm
CN115527610A (en) * 2022-11-09 2022-12-27 上海交通大学 Cluster analysis method of unicellular omics data
CN115527610B (en) * 2022-11-09 2023-11-24 上海交通大学 Cluster analysis method for single-cell histology data
CN116665786A (en) * 2023-07-21 2023-08-29 曲阜师范大学 RNA layered embedding clustering method based on graph convolution neural network

Also Published As

Publication number Publication date
CN113889192B (en) 2024-02-27

Similar Documents

Publication Publication Date Title
CN113889192B (en) Single-cell RNA-seq data clustering method based on deep noise reduction self-encoder
CN108805167B (en) Sparse depth confidence network image classification method based on Laplace function constraint
CN114022693B (en) Single-cell RNA-seq data clustering method based on double self-supervision
CN111564183B (en) Single cell sequencing data dimension reduction method fusing gene ontology and neural network
CN110826635B (en) Sample clustering and feature identification method based on integration non-negative matrix factorization
Albergante et al. Estimating the effective dimension of large biological datasets using Fisher separability analysis
Yan et al. Unsupervised and semi‐supervised learning: The next frontier in machine learning for plant systems biology
CN112735536A (en) Single cell integrated clustering method based on subspace randomization
CN116580848A (en) Multi-head attention mechanism-based method for analyzing multiple groups of chemical data of cancers
Bellazzi et al. The Gene Mover's Distance: Single-cell similarity via Optimal Transport
CN116109613A (en) Defect detection method and system based on distribution characterization
Zhang et al. SLRRSC: single-cell type recognition method based on similarity and graph regularization constraints
CN111178427A (en) Depth self-coding embedded clustering method based on Sliced-Wasserstein distance
Wang et al. scDSSC: deep sparse subspace clustering for scRNA-seq data
CN115661498A (en) Self-optimization single cell clustering method
CN117497038A (en) Method for rapidly optimizing culture medium formula based on nuclear method
CN114997303A (en) Bladder cancer metabolic marker screening method and system based on deep learning
CN114783526A (en) Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder
CN112768001A (en) Single cell trajectory inference method based on manifold learning and main curve
Peng et al. A deep learning-based unsupervised learning method for spatially resolved transcriptomic data analysis
CN117727373B (en) Sample and feature double weighting-based intelligent C-means clustering method for feature reduction
Abou El-Naga et al. Consensus Nature Inspired Clustering of Single-Cell RNA-Sequencing Data
Hu et al. WEDGE: recovery of gene expression values for sparse single-cell RNA-seq datasets using matrix decomposition
Murugesan et al. Weighted Fuzzy Score Normalization and Bayesian Independent Principal Component Analysis Imputation for Breast Cancer Gene Expression Analysis.
CN113177604B (en) High-dimensional data feature selection method based on improved L1 regularization and clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant