CN113889192A

CN113889192A - Single cell RNA-seq data clustering method based on deep noise reduction self-encoder

Info

Publication number: CN113889192A
Application number: CN202111152923.5A
Authority: CN
Inventors: 王艺杰; 王文庆; 杨东; 胥冠军; 崔逸群; 毕玉冰; 刘超飞; 董夏昕; 刘迪; 肖力炀; 刘骁
Original assignee: Xian Thermal Power Research Institute Co Ltd
Current assignee: Xian Thermal Power Research Institute Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2022-01-04
Anticipated expiration: 2041-09-29
Also published as: CN113889192B

Abstract

The invention discloses a single-cell RNA-seq data clustering method based on a deep noise reduction self-encoder, which comprises the steps of firstly adjusting the batch effect of single-cell RNA-seq data and standardizing the data so as to reduce the adverse effect caused by technical noise; secondly, effectively mining the characteristic information of the single-cell RNA-seq data by using a deep noise reduction self-encoder based on zero-expansion negative binomial distribution; then, a rapid independent component analysis method is used for reducing the dimension of the single cell RNA-seq data, so that the calculation efficiency of the method model is improved; and finally, expanding more accurate clustering on the cells through a Gaussian mixture model based on expectation maximization, and visualizing the final single-cell RNA-seq data clustering result by using a T distribution random neighbor embedding method. The invention can effectively reduce the interference of the characteristics of high dimensionality, large noise and the like of the single-cell RNA-seq data on data clustering, accurately learn the gene expression information of the single-cell RNA-seq data so as to cluster cells, and provide help for gene network construction, cell type discovery and early cancer discovery and treatment.

Description

Single cell RNA-seq data clustering method based on deep noise reduction self-encoder

Technical Field

The invention belongs to the technical field of single cell RNA-seq data analysis in bioinformatics, and particularly relates to a single cell RNA-seq (Ribonnucleic acid-sequence) data clustering method based on a deep noise reduction self-encoder.

Background

With the rapid development of sequencing technologies, researchers have acquired a large amount of single-cell RNA-seq data. Unsupervised clustering plays an important role in analyzing single cell RNA-seq data, and the clustering method aiming at the single cell RNA-seq data can not only identify unknown cell types, but also reveal the heterogeneity of cells. Through the research on the clustering method of the single cell RNA-seq data, researchers can more accurately identify the cell state, build a network structure between cells, deeply understand the differentiation process of cancer cells and the like, and lay a foundation for the early discovery and treatment of the future cancer. At present, traditional clustering methods such as hierarchical clustering, spectral clustering and density-based clustering methods with noise are widely used, but single-cell RNA-seq data have unique characteristics, so that the traditional clustering methods cannot effectively cluster the data.

Disclosure of Invention

In order to overcome the technical problems, the invention provides a single-cell RNA-seq data clustering method based on a deep noise reduction self-encoder, which combines methods such as a self-encoder and rapid independent component analysis to realize the purposes of feature learning, dimension reduction and the like in the single-cell RNA-seq data clustering process, finally uses Gaussian mixture clustering to cluster the single-cell RNA-seq data, and reduces the influence of data noise on the clustering effect by introducing zero-expansion negative binomial distribution reconstruction data.

In order to achieve the purpose, the invention adopts the technical scheme that:

a single-cell RNA-seq data clustering method based on a deep noise reduction self-encoder comprises the following steps;

1) adjusting batch effect and data standardization preprocessing:

5 public real single-cell RNA-seq data sets downloaded from Arrayexpress and GEO databases are selected to cluster single cells, cell subtypes are further discovered, assistance is provided for early discovery and targeted treatment of related cancers, gene expression values in the 5 public data sets are obtained from various tissue cells, including GSE60361, GSE65525, GSE72056, GSE76312 and GSE103322, original single-cell RNA-seq data are read and subjected to batch effect adjustment and standardized preprocessing, and systematic technical deviation which is irrelevant to biological states and is introduced due to sample data in different batch processing and measurement is avoided;

2) data reconstruction and noise reduction:

because a large number of zero values exist in the single-cell RNA-seq data, the zero values can not only indicate that partial genes of some cells are not expressed actually, but also can be the result caused by technical errors, the noise can greatly interfere the discovery of cell subtypes, the single-cell RNA-seq data after logarithmic normalization processing is input into a deep noise reduction self-encoder, the deep noise reduction self-encoder reconstructs the data by using zero-expansion negative binomial distribution, and the reconstructed data can better store the original characteristics of organisms;

3) and (3) data dimension reduction:

the single-cell RNA-seq data reconstructed by the deep noise reduction self-encoder is still high-dimensional, the high-dimensional single-cell RNA-seq data brings great difficulty to the identification of cell subtypes, the dimension of sample data is reduced by using a rapid independent component analysis method, redundant parts in the data are eliminated, and the early discovery and related treatment of cancer are further prevented from being interfered by the redundant parts in the data;

4) gaussian mixture clustering and data visualization:

after the low-dimensional and low-noise single-cell RNA-seq data is obtained, a Gaussian mixture model is used for clustering cells and determining cell types, the obtained cell types are potential cell subtypes found, a final clustering result is visualized by adopting a T distribution random neighbor embedding method, and the clustering result is analyzed by combining the existing cells and a cancer database, so that a doctor is helped to develop early cancer discovery.

The step of adjusting batch effect and standardizing pretreatment on the single cell RNA-seq data in the step 1) comprises the following steps: firstly, a hierarchical Bayesian model is used for adjusting the batch effect of single-cell RNA-seq data and solving the problem of uncertainty caused by measurement sensitivity; then screening out cells with normal gene expression quantity; then, the data was normalized for sequencing depth and gene length using a logarithmic normalization method.

The deep noise reduction self-encoder used in the step 2) reconstructs single-cell RNA-seq data through zero-expansion negative binomial distribution, and the whole self-encoder has three outputs and respectively learns the zero-expansion factor, the mean value and the variance of the zero-expansion negative binomial distribution;

the RNA-seq data of the single cell to be analyzed is represented by X, and the coding stage in the self-encoder is represented by h (X) ═ sigma_h(WX + b), wherein W represents a weight matrix in the encoding process, b represents a bias term, a decoding stage of the self-encoder corresponds to the encoding stage, the encoded data is reconstructed, the input dimension of the self-encoder is consistent with the dimension of single-cell RNA-seq data used for training, the encoder and the decoder are respectively provided with a five-layer network, a Zero-expansion factor is added on the basis of a Negative binomial distribution (NB) model, and the situation that an impulse function, namely Zero-expanded Negative binomial distribution (Zero-expanded Negative Bi) is added at a Zero point can be understood as the situation that a Zero point is added with the impulse functionnomial) to model single-cell RNA-seq data, formulated as ZINB (X | pi, μ, θ) ═ pi δ₀(X) + (1-. pi.) N BETA (X. mu., θ) if Y. sigma., (X-pi.) O_o(W 'h (X) + b') represents the last hidden layer of the decoder, after which three independent fully-connected layers are added, that is, the whole self-encoder has three outputs, respectively learning the zero-expansion factor, mean and variance of the zero-expansion negative binomial distribution, and the loss function of the noise-reduction part of the noise-reduction self-encoder is represented as L_d＝-log(ZINB(X|π,μ,θ))。

In the step 3), the dimensionality of the single-cell RNA-seq data is reduced by using a rapid independent component analysis method, the independent component analysis assumes that all parts of all data are independent from each other and considers that all components are equally important, and original data are decomposed into linear combinations of non-Gaussian data components with mutually independent statistical meanings;

assuming that reconstructed single cell RNA-seq data obeys a model X ═ AS, wherein S is unknown source data with independent components, A is an unknown mixed matrix, each independent component in S and each mixed coefficient in A are unknown, an independent component analysis method predicts the mixed coefficient and the independent component only through each observed signal data in X, the method firstly performs centralization and whitening pretreatment on original data, and after the pretreatment, sample data is processed by adopting a rapid independent component analysis method, firstly, a vector W is initialized, and W ═ A is defined^-1And W is the row vector in W. Secondly, let w⁺＝E{Xg(w^TX)}-E{g′(w^TX) } w, where g in the above formula is a non-linear scalar function, and let w ═ w⁺/||w⁺If the process is not converged, the step is continuously repeated, and finally, a rapid independent component analysis method is used for estimating a plurality of independent components containing important information, so that the purpose of reducing the dimensionality of the single cell RNA-seq data is achieved.

In the step 4), a Gaussian mixture model is used for clustering the cells and determining the cell types, and the method specifically comprises the following steps:

firstly, initializing model parameters of Gaussian mixture distribution, and then iteratively optimizing the model based on an expectation-maximization algorithmA parameter of type; the E iteration step in the expectation maximization algorithm: calculating posterior probability gamma of ith sample data based on ith Gaussian mixture component_ji：

(ii) a The M iteration steps in the expectation maximization algorithm: other parameters mu of the iterative optimization model_i，∑_iAnd alpha_iCalculated based on the following formula:

stopping iteration when the maximum iteration times is reached in the experimental process, continuously iterating and updating parameters if the conditions are not met, and finally, sampling data x_jCluster label lambda of_jUsing λ_j＝argmaxγ_jiAnd calculating, and visualizing the final clustering result by using a T-distribution random neighbor embedding method to display the clustering result on a two-dimensional coordinate.

In the process of initializing by using a Gaussian mixture model, the problem of centroid initialization is solved by adopting k-means + +, and the method is that a point is randomly selected from an input data point set to serve as a first clustering center; for each object in the data set, calculating the similarity of the object to the nearest clustering center; selecting a new data point as a new cluster center according to the following selection principles: the point with larger similarity is selected as the clustering center with larger probability; the above steps are repeated until k cluster centers are selected, using the k initial cluster centers to run the standard k-means algorithm.

The invention has the beneficial effects that:

the invention combines a self-encoder and a rapid independent component analysis method to learn the expression of the single-cell RNA-seq data, reduces the dimension of the data, uses zero-expansion negative binomial distribution to reconstruct the data so as to reduce the influence of data noise on a clustering result, uses Gaussian mixed clustering to cluster the low-dimension single-cell RNA-seq data, finally uses a T distribution random neighbor embedding method to visualize the clustering result, can identify the cell subtype, and can help to develop early discovery and related diagnosis and treatment of cancer.

Drawings

FIG. 1 is a general flow diagram of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples.

As shown in fig. 1, four major steps of improving the single cell RNA-seq data clustering effect based on the deep noise reduction self-encoder of the present invention are shown, including batch effect adjustment and data standardization preprocessing, data reconstruction and noise reduction, data dimension reduction, gaussian mixture clustering, and data visualization.

The invention provides a single-cell RNA-seq data clustering method based on a deep noise reduction self-encoder, which comprises the following steps:

step one, adjusting batch effect and data standardization preprocessing. The invention selects 5 public data sets downloaded from Arrayexpress and GEO databases to verify the effectiveness of the invention, and gene expression values in the 5 public data sets are obtained from various histiocytes, including GSE60361, GSE65525, GSE72056, GSE76312 and GSE 103322. The data are used as initial input raw data, the invention uses a hierarchical Bayesian model to adjust the batch effect of the single-cell RNA-seq data, and simultaneously solves the problem of uncertainty caused by measurement sensitivity. The method uses a single-cell gene expression analysis package SCANPY based on Python to effectively screen and filter the original single-cell RNA-seq data, removes the data with poor sequencing quality, and then carries out standardized processing on the data so as to facilitate the subsequent network learning.

And step two, reconstructing data and reducing noise. The analysis of single cell RNA-seq data is interfered due to problems of data amplification, data loss and the like. The present invention uses a denoise self-encoder technique to map the input single-cell RNA-seq data to an embedding space. In the experimental process, random gaussian noise was added to the data used and a full-link layer was used to construct the self-encoder. To capture single cells betterThe invention relates to RNA-seq data.A three independent full-connection layers are added behind the last hidden layer of a decoder, and three outputs respectively learn a pulse function regulating factor of zero-expansion negative binomial distribution, the mean value of the negative binomial distribution and the sparsity. The loss function of the noise-reduced self-encoder's noise-reduced portion is further defined as the negative logarithm of the zero-expansion negative binomial distribution formula. The RNA-seq data of the single cell to be analyzed is represented by X, and the coding stage in the self-encoder is represented by h (X) ═ σ_h(WX + b), W represents a weight matrix in the encoding process, b represents an offset item, and the decoding stage of the self-encoder corresponds to the encoding stage to reconstruct the encoded data. The input dimension of the self-encoder is consistent with the dimension of single-cell RNA-seq data used for training, and the encoder and the decoder respectively have five layers of networks. Recent research progress on single-cell RNA-seq data shows that the single-cell RNA-seq data is closest to Negative Binomial distribution (NB) and is formulated as

Because the dispersion of single-cell RNA-seq data is usually highly distorted, the variance tends to be larger than the mean and is therefore not suitable for approximation with a poisson distribution, whereas the variance of single-cell RNA-seq data typically changes as the mean changes. In addition to the above, single cell RNA-seq data also have the feature of a particularly high number of zero values. Since the Zero values in the gene expression data may come from genes that are not expressed in the biological process (True Zero) and also from losses due to technical reasons in the sequencing process (Dropout Zero). In order to better capture single-cell RNA-seq data, the invention improves the traditional noise reduction self-encoder, adds a Zero-expansion factor on the basis of a Negative Binomial distribution (NB) model, and can also be understood as adding a pulse function at a Zero point, namely modeling the single-cell RNA-seq data by using Zero-expanded Negative Binomial distribution (Zero-expanded Negative Binomial). Formulated as ZINB (X | pi, μ, θ) ═ pi δ₀(X) + (1-. pi.) N BETA (X. mu., θ) if Y. sigma., (X-pi.) O_o(W 'h (X) + b') denotes the last hidden layer of the decoder, after which the invention adds three independent fully-connected layers, that is to sayThe whole self-encoder has three outputs, and the zero expansion factor, the mean value and the variance of the zero expansion negative binomial distribution are respectively learned. The invention expresses the loss function of the noise reduction part of the noise reduction self-encoder as L_d＝-log(ZINB(X|π,μ,θ))。

And step three, reducing the dimension of the data. In the process of data dimension reduction, firstly, centralizing and whitening pretreatment are carried out on high-dimensional single cell RNA-seq data, a separation matrix is calculated on the basis of the pretreated data, and the separation matrix is initialized; and then continuously optimizing the separation matrix, constantly judging whether the separation matrix is converged, if so, solving final low-dimensional single cell RNA-seq data, and if not, continuously optimizing the separation matrix.

Assuming that the reconstructed single-cell RNA-seq data obeys the model X ═ AS, where S is unknown source data with independent components, a is an unknown mixing matrix, and each independent component in S and each mixing coefficient in a are unknown. The independent component analysis method estimates the mixing coefficients and independent components only from each observed signal data in X. The method firstly carries out centering and whitening pretreatment on original data, and after the pretreatment, the invention adopts a rapid independent component analysis method to process the sample number, firstly, a vector W is initialized, and W is defined as A^-1And W is the row vector in W. Secondly, let w⁺＝E{Xg(w^TX)}-E{g′(w^TX) } w, where g in the above formula is a non-linear scalar function, and let w ═ w⁺/||w⁺If the above process does not converge, this step is repeated continuously. And finally, estimating several independent components containing important information by using a rapid independent component analysis method, and achieving the purpose of reducing the single cell RNA-seq data dimension.

And step four, Gaussian mixed clustering and data visualization. In the process of initializing by using a Gaussian mixture model, k-means + + is adopted to solve the problem of centroid initialization. Firstly, initializing model parameters of Gaussian mixture distribution, and then iteratively optimizing the parameters of the model repeatedly based on an expectation-maximization algorithm; the E iteration step in the expectation maximization algorithm: based on the ith Gaussian mixture component meterCalculating the posterior probability gamma of the ith sample data_ji：

and stopping iteration when the maximum iteration times are reached in the experimental process, and continuously iterating and updating the parameters if the conditions are not met. Finally, sample data x_jCluster label lambda of_jUsing λ_j＝argmaxγ_jiAnd (4) calculating. And finally, visualizing the clustering result by using a T-distribution random neighbor embedding method.

For the Gaussian mixture model, the invention uses a Gaussian mixture function in the scimit-leann mixture module, with the parameters held at default values. The Clustering performance evaluation indexes used by the invention mainly comprise Normalized Mutual Information, Clustering Accuracy and Adjusted Rand Index, and the higher the numerical values of the three evaluation indexes are, the better the Clustering performance of the method is.

The invention has the following characteristics:

1. the influence of the single cell RNA-seq data batch effect on the final clustering effect is reduced;

2. the influence of high dimensionality and large noise of single cell RNA-seq data on a clustering result is reduced;

3. in the clustering process of the single cell RNA-seq data, the data representation can be effectively learned, and the method has strong data representation capability;

4. and after the clustering is finished, the data visualization capacity is good.

The self-encoder model is an unsupervised deep learning method, the method can not only effectively reduce the dimension of input data, but also learn the implicit characteristics of the analyzed data in a mode of adjusting the number of layers of a neural network, optimizing a network training process and the like, and recover the data through a data reconstruction process. The noise reduction self-encoder allows the damaged data with noise information to become input data of a network, so that the reconstructed data obtains certain robustness to noise in the input data.

The single-cell RNA-seq data has high dimension and also contains large noise, the noise is usually expressed as that the single-cell RNA-seq data is sparse, and a large amount of zero values are derived from certain genes which are not expressed actually on one hand and are derived from the fact that the expressed gene values are not detected due to the defects of sequencing technology and the like on the other hand.

In the process of carrying out deep clustering on single cell RNA-seq data, the invention designs a single cell RNA-seq data clustering method based on a deep noise reduction self-encoder, which combines the self-encoder, rapid independent component analysis, Gaussian mixture clustering, T distribution random neighbor embedding and other methods to solve the problems of feature learning, dimension reduction, clustering, data visualization and the like in the single cell RNA-seq data clustering process, and reduces the influence of data noise on the clustering effect by introducing zero-expansion negative binomial distribution reconstruction data.

Claims

1. A single cell RNA-seq data clustering method based on a deep noise reduction self-encoder is characterized by comprising the following steps;

1) adjusting batch effect and data standardization preprocessing:

selecting 5 public real single-cell RNA-seq data sets downloaded from Arrayexpress and GEO databases to cluster single cells, wherein gene expression values in the 5 public data sets are obtained from various histiocytes, including GSE60361, GSE65525, GSE72056, GSE76312 and GSE103322, reading original single-cell RNA-seq data and carrying out batch effect adjustment and standardization preprocessing on the original single-cell RNA-seq data;

2) data reconstruction and noise reduction:

inputting the single-cell RNA-seq data subjected to logarithmic standardization into a deep noise reduction self-encoder, wherein the deep noise reduction self-encoder reconstructs the data by using zero-expansion negative binomial distribution, and the reconstructed data can better store the original characteristics of organisms;

3) and (3) data dimension reduction:

4) gaussian mixture clustering and data visualization:

2. The method for clustering single-cell RNA-seq data based on the deep noise reduction self-encoder as claimed in claim 1, wherein the step of adjusting the batch effect and the normalization pre-processing on the single-cell RNA-seq data in step 1) comprises: firstly, a hierarchical Bayesian model is used for adjusting the batch effect of single-cell RNA-seq data and solving the problem of uncertainty caused by measurement sensitivity; then screening out cells with normal gene expression quantity; then, the data was normalized for sequencing depth and gene length using a logarithmic normalization method.

3. The method for clustering single-cell RNA-seq data based on the deep denoising self-encoder as claimed in claim 1, wherein the deep denoising self-encoder used in the step 2) reconstructs the single-cell RNA-seq data by zero-dilation negative binomial distribution, and the whole self-encoder has three outputs, which respectively learn the zero-dilation factor, the mean value and the variance of the zero-dilation negative binomial distribution;

the RNA-seq data of the single cell to be analyzed is represented by X, and the coding stage in the self-encoder is represented by h (X) ═ sigma_h(WX + b), wherein W represents a weight matrix in the encoding process, b represents a bias term, a decoding stage of the self-encoder corresponds to the encoding stage, the encoded data is reconstructed, the input dimension of the self-encoder is consistent with the dimension of single-cell RNA-seq data used for training, the encoder and the decoder are respectively provided with a five-layer network, a Zero-expansion factor is added on the basis of a Negative Binomial distribution (NB) model, and the Zero-expansion factor can also be understood as adding a pulse function at a Zero point, namely modeling the single-cell RNA-seq data by using Zero-expanded Negative Binomial, and the formula is expressed as ZINB (X | π, μ, θ) πδ δ₀(X) + (1-. pi.) N BETA (X. mu., θ) if Y. sigma., (X-pi.) O_o(W 'h (X) + b') represents the last hidden layer of the decoder, after which three independent fully-connected layers are added, that is, the whole self-encoder has three outputs, respectively learning the zero-expansion factor, mean and variance of the zero-expansion negative binomial distribution, and the loss function of the noise-reduction part of the noise-reduction self-encoder is represented as L_d＝-log(ZINB(X|π,μ,θ))。

4. The method for clustering single-cell RNA-seq data based on the deep noise reduction self-encoder as claimed in claim 1, wherein the dimension of the single-cell RNA-seq data is reduced by using a fast independent component analysis in step 3), the independent component analysis is used for decomposing the original data into linear combinations of non-Gaussian data components with mutually independent statistical significance, assuming that all parts of all data are independent and all components are considered to be equally important;

5. The method for clustering single-cell RNA-seq data based on the deep noise reduction self-encoder as claimed in claim 1, wherein the step 4) uses a Gaussian mixture model to cluster cells and determine cell types, and the specific steps include:

firstly, initializing model parameters of Gaussian mixture distribution, and then iteratively optimizing the parameters of the model repeatedly based on an expectation-maximization algorithm; the E iteration step in the expectation maximization algorithm: calculating posterior probability gamma of ith sample data based on ith Gaussian mixture component_ji：

The M iteration steps in the expectation maximization algorithm: other parameters mu of the iterative optimization model_i，∑_iAnd alpha_iCalculated based on the following formula:

stopping iteration when the maximum iteration times is reached in the experimental process, continuously iterating and updating parameters if the conditions are not met, and finally, sampling data x_jCluster label lambda of_jUsing λ_j＝arg maxγ_jiAnd calculating, and visualizing the final clustering result by using a T-distribution random neighbor embedding method to display the clustering result on a two-dimensional coordinate.

6. The method according to claim 5, wherein the initialization using the Gaussian mixture model is performed by using k-means + + to solve the centroid initialization problem by randomly selecting a point from the input data point set as the first clustering center; for each object in the data set, calculating the similarity of the object to the nearest clustering center; selecting a new data point as a new cluster center according to the following selection principles: the point with larger similarity is selected as the clustering center with larger probability; the above steps are repeated until k cluster centers are selected, using the k initial cluster centers to run the standard k-means algorithm.