CN114067915A

CN114067915A - scRNA-seq data dimension reduction method based on deep antithetical variational self-encoder

Info

Publication number: CN114067915A
Application number: CN202111388132.2A
Authority: CN
Inventors: 王树林; 任亚琪
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2022-02-18

Abstract

The invention relates to data mining in bioinformatics, in particular to the mining of single cell RNA sequencing data. In particular to a method for performing dimension compression and clustering on single-cell RNA sequencing data by a deep learning method to achieve the purpose of effectively identifying cell populations. The method of the invention comprises collecting and pre-processing scRNA-seq data; constructing a depth antithesis variational self-coder model; using the constructed model to reduce the dimension of the preprocessed data; combining a depth invariant self-encoder with the Bhattacharyya distance; and carrying out clustering analysis on the result after dimensionality reduction. Our model constrains the data structure and performs dimensionality reduction by the depth-invariant autocoder module. Experiments carried out on three real scRNA-seq data sets by taking the evaluation index of standardized mutual information as a basis show that the method has good performance.

Description

scRNA-seq data dimension reduction method based on deep antithetical variational self-encoder

Technical Field

The invention relates to data mining in bioinformatics, in particular to the mining of single cell RNA sequencing data. In particular to a method for effectively identifying cell populations by performing dimension compression and clustering on single-cell RNA sequencing data.

Background

Single cell RNA sequencing (scRNA-seq) is not seemingly understood to sequence only a single cell, we can understand that this sequencing technology can sequence the genome or transcriptome of a single cell, i.e. at the single cell level. Conventional sequencing methods are typically performed at the multicellular level, thus losing information of heterogeneity. More specifically, the information displayed by conventional sequencing methods is the average information at the multicellular level.

With the development of scRNA-seq technology, massive scRNA-seq data are generated, and powerful data support is provided for distinguishing various cell populations in biological tissues and comprehensively revealing heterogeneity among cells. Through dimensionality reduction and clustering of the scRNA-seq data, the cell population can be effectively identified, and an important basis is provided for research on embryonic development processes, immune system working mechanisms, tumor cell genesis and other works. However, due to the limitation of sequencing technology means, high complexity of gene expression and the like, the scRNA-seq data generally has the characteristics of large noise, high feature dimension, strong sparsity and the like, and valuable biological information is difficult to excavate only by manpower. A crucial step in the analysis of scra-seq data is the grouping of cells belonging to the same type based on gene expression, which is essentially a problem of cell clustering. Existing studies will perform dimensionality reduction on scRNA-seq data prior to clustering. However, the characteristics of high noise, various data shapes and the like of the scRNA-seq data make many dimension reduction methods no longer applicable; clustering data with sample numbers (cells) much less than feature numbers (genes) is also a huge challenge for clustering. Therefore, how to effectively process data, exploit heterogeneity among cells, and distinguish different cell populations becomes a hot spot of current research.

The current methods for reducing dimensions of scRNA-seq data by bioinformatics can be of three types: the first category, linear methods, which are mainly represented by two, is Principal Component Analysis (PCA), one of the most commonly used dimensionality reduction methods, which transforms observations into a latent space by defining a linear combination of raw data points with continuous maximum variance (i.e., principal components). Another linear approach is the factorization ZIFA, which is similar to PCA, but whose purpose is mainly to model correlation rather than covariance by describing variability between the relevant variables. The linear method has the advantages of quick and simple realization and wide application; however, the scRNA-seq data is nonlinear in nature, and thus it is not applicable to all datasets. The second category is non-linear methods, which have the advantage of being more flexible to use, providing an aesthetic appearance, and easier to interpret by visual inspection. However, such methods often require manual parameter definition by users, and have problems of important information loss, insufficient feature extraction, and the like. The first two methods belong to the traditional methods, the third method is a deep learning-based method, and the method has the advantages that the representativeness of the extracted features is strong, but the interpretability is poor, and some methods use deep models and have the defects of relatively long running time and the like.

With the development of scRNA-seq technology, the requirements on a method for processing scRNA-seq data are higher and higher, and more accurate and efficient single-cell RNA sequencing data are always the hot problems of research. In conclusion, the existing method for reducing the dimensionality of the scRNA-seq data has the problems of important information loss, insufficient feature extraction and the like, and the subsequent data analysis is seriously influenced. Therefore, how to design a more reasonable and effective dimension reduction method is a key problem to be urgently researched and researched.

Disclosure of Invention

Aiming at the problems existing in the method and the complexity of the scRNA-seq data, the invention provides a scRNA-seq data dimension reduction method based on a deep immutation autocoder. The method can effectively solve the problems of important information loss, insufficient feature extraction and the like of the conventional dimension reduction method, and obtains better clustering precision. The steps of the described method comprise:

1. data pre-processing

At present, due to the limitations of sequencing technology means, highly complex gene expression and other reasons, scRNA-seq data generally has the characteristics of large noise, high characteristic dimension, strong sparsity and the like, so in order to eliminate the influence of large technical noise and gene expression quantity difference generated in the sequencing process on subsequent processing and keep information in original data as much as possible, data preprocessing is performed by using a variance filtering method.

We collected three scRNA-seq datasets from different species, different types, different cell numbers, true, and then pre-processed the collected data using variance filtering to obtain the 720 highest variance genes across cells. And assumes that the gene with the highest variance relative to its average expression is the result of a biological effect, not of technical noise. Using log₂(1+ C) the normalized count matrix data C is converted.

Specifically, we perform data preprocessing operations on the following three data sets.

(1) A cortix dataset with ground truth labels for 7 different cell types, consisting of 3005 cells from the somatosensory cortex and hippocampus of the mouse brain;

(2) a Macoskco-44k dataset that classifies cells of a mouse retinal region;

(3) a Zheng-73k data set consisting of a classification of fluorescence activated cells of healthy humans, mainly comprising T cells, NK cells and B cells;

reconstructing input data by utilizing zero-expansion negative binomial distribution, and providing better fitting for scRNA-seq data to obtain noiseless data;

2. constructing a depth-contrast variational autoencoder model

A depth immutation auto-encoder is an auto-encoder frame composed of a depth encoder, an intermediate hidden layer and a depth decoder, and adopts a structure of combining an immutation auto-encoder and a variational auto-encoder.

(1) Generative countermeasure networks (GAN)

The idea of the generative countermeasure network (GAN) framework is to establish a max-min countermeasure game between two neural networks (generative model G and discriminant model D). The discriminator model d (x) is a neural network that computes the probability that a point x in the data space is a sample in the data distribution that we are trying to model (i.e., positive samples), rather than generating a sample in the model (negative samples). At the same time, the generative model gradually learns to transform the sample z from the prior distribution p (z) to the data space using the function g (z). G (z) is trained to maximally confuse the discriminator, believing that the sample it generates is from the data distribution. The generative model is trained by using the gradient of d (x) with respect to x and using it to modify its parameters, thereby completely confusing the discriminative model with the samples it generates. This scheme can be formalized as a minimum maximum target of the following type:

wherein p is_dataIs the data distribution, and p (z) is the model distribution.

Both the generator G and the discriminator D can be modeled as a fully connected neural network and then trained by back propagation using a suitable optimizer. In the experiment of the invention, an adaptive moment estimation (ADAM) is used to replace the traditional random gradient descent method, and the ADAM is an extension of the random gradient descent method and is more suitable for the situation of sparse data.

(2) Antagonistic Automatic Encoder (AAE)

The Antagonistic Autoencoder (AAE) is a probabilistic autoencoder, a variant of the Generative Antagonistic Network (GAN) model, and performs variational inference by matching the aggregate posteriori of the autoencoder hidden code vectors to any prior distribution, matching the aggregate posteriori to a prior ensuring that meaningful samples are produced from any part of the prior space, and the decoder of the antagonistic autoencoder learns to map a deeply generated model applied prior to data distribution. The structure of the antagonistic automatic encoder consists of a standard automatic encoder and an antagonistic network. The encoder is also a generator of the adversarial network. The idea of a reactive autoencoder is to train the reactive network and the autoencoder simultaneously to perform the inference.

(3) Variational automatic encoder VAE

A variational autocoder VAE is a variant of an autocoder model intended to model the distribution p (x) of data points in a high-dimensional space x by means of a low-dimensional latent variable z. The entire model is divided into successive processes, first generating samples of z in a potentially low-dimensional subspace and then mapping them to the original space x. The key point is to generate a probability of having a high recovery observed data matrix x. Thus, it is possible for the generated z to capture the intrinsic information of the raw data. Theoretically, the best choice to generate z is the a posteriori P (z | x), however, it is often too complex and difficult to handle. The VAE attempts to approximate the posteriori using the variational probability Q (z | x) by minimizing the Kullback-Leibler (KL) divergence between Q (z | x) and P (z | x). This scheme can be trained by using a gradient-based approach to maximize the following objectives:

wherein D in the formula (2)_KLIs the Kullback-Leibler divergence, p_model(z) is considered a decoder.

3. Dimensionality reduction of scRNA-seq data

And carrying out dimensionality reduction on the preprocessed scRNA-seq data by using the constructed depth antagonism automatic encoder.

Firstly, learning a hidden layer feature vector by using an intermediate hidden layer of an automatic encoder, constraining prior distribution of the hidden layer feature vector, and matching the hidden layer feature vector with the selected prior distribution; the specific process is as follows:

(1) first, the constructed depth-antagonistic automatic encoder is composed of a depth encoder and a depth decoder. Where x is the input of the scRNA-seq expression level (M cells N genes), z is the potential code vector of the auto-encoder, p (z) is the prior distribution imposed on the potential code vector, q (z | x) is the N coding distributions, and p (x | z) is the decoding distribution.

(2) The depth encoder uses the variational distribution q (z | x) to provide a gaussian mean and covariance. The self-encoder learns gradually to reconstruct the input x of the scRNA-seq data by minimizing the reconstruction error, making it as realistic as possible. At this time, the encoder of the model is also the generator of the GAN framework. The encoder is trained to fool the discriminator of the GAN framework so that the potential code vector q (z) is derived from the true a priori distribution p (z).

(3) At the same time, the training discriminator distinguishes between the sample vector of p (z) and the potential code vector q (z) of the encoder (i.e., generator). Finally, the method is able to learn an unsupervised representation of the probability distribution of the scRNA-seq data.

4. Combining depth invariant self-encoder with Bhattacharyya distance

The Wasserstein distance has been shown to be more stable for GAN training, and the depth antagonistic variant autoencoder model can be combined with the Wasserstein distance. The Wasserstein distance, also known as the dozer distance, is defined as shown in equation (3):

we also propose to combine this model with the Bhattacharyya distance, another measure of similarity of two probability distributions, which is closely related to the bhattacharya coefficient, which is a measure of the amount of overlap between two statistical samples or populations. Bhattacharyya distance D between p and q distributions over the same domain X_B(p, q) is defined as

D_B(p,q)＝-ln(BC(p,q)) (4)

Wherein:

in equations (4) and (5), BC is referred to as Bhattacharyya coefficient.

Then our new goal is

Wherein p is_dataAnd p (z) again are the data distribution and the model distribution, respectively.

K-means algorithm clustering

The method uses ZINB conditional likelihood to reconstruct decoder output of scRNA-seq data, and ZINB distribution is proved to be a model which can better describe scRNA-seq data and is a generally accepted gene expression distribution structure.

In order to evaluate the effectiveness of the method, a k-means clustering algorithm is applied to cluster the data subjected to dimensionality reduction, and the index of standardized mutual information is used for evaluation. Assuming that X is the predicted clustering result and Y is the cell type with the true label, the NMI score is calculated as follows:

in formula (7), MI is the mutual entropy between X and Y, and H is shannon entropy.

From the above, it can be seen that the dimensional reduction method for scRNA-seq data based on a deep immutation autocoder provided by one or more embodiments of the present specification combines the advantages of an immutation autocoder and a variation autocoder. Our model constrains the data structure and performs dimensionality reduction by the depth-invariant autocoder module. Experiments performed on three authentic scRNA-seq datasets show that the present method can provide a more accurate low dimensional representation of scRNA-seq data.

Drawings

In order to more clearly illustrate one or more embodiments or prior art solutions in this specification, the drawings that are needed in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the description below are only one or more embodiments of the specification, and that other drawings can be obtained by those skilled in the art without inventive effort from these drawings.

FIG. 1: flow schematic diagram of scRNA-seq data dimension reduction method based on depth invariant self-encoder

FIG. 2: experimental results when the potential dimensions are 2, 10, 20

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to experiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

1. Summary of data sets

We evaluated the proposed SCAVAE model on three authentic scRNA-seq datasets from different sequencing platforms. All data sets used herein are publicly available, and the statistics of the data sets are summarized in table 1, and 720 genes with the largest variance were selected in each data set for subsequent experiments. The detailed information is shown in table 1:

table 1 data set used in this experiment

2. Experimental Environment and parameter settings

The hardware environment is mainly a PC host. The CPU of the PC host is 11th Gen Intel (R) core (TM) i5-1135G7, 2.42GHz, the memory is 16GB RAM, 64-bit operating system. The software is implemented in Python language under Pycharm environment by taking Windows 10 as a platform, the version of Python is 3.5.0, and the version of Tensorflow is 1.4.0.

In the method, the encoder, decoder and discriminator are constructed from a 1, 2, 3 or 4 layer design of a fully connected neural network having 16, 32, 64, 128, 256, 512 or 1024 nodes, respectively. The best hyper-parameter set is selected from a number of possibilities to maximize the cluster performance test data set from the grid search. All neural networks were regularized with Dropout. The activation function between two hidden layers is both the leakage REU activation function. Deep learning models have a high variance and do not give the same answer over multiple runs, and to obtain reproducible results we use Python and Tensorflow commands such as np.

The parameter settings for the various datasets of SCAVAE are shown in table 2.

Table 2: all parameter settings used for the experiment

3. Evaluation index

In our experiments, the SCAVAE model was evaluated using an index of Normalized Mutual Information (NMI), which is widely used in model performance evaluation in unsupervised learning scenarios.

NMI (normalized mutual information): NMI measures the similarity of two clusters from the point of view of information theory. Assuming that X is the predicted clustering result and Y is the cell type with the true label, the NMI score is calculated as follows:

like equation (7), in equation (8), MI is the mutual entropy between X and Y, and H is shannon entropy.

4. Analysis of Experimental results

The method is mainly tested on three real data sets of cortex, Macosko-44k and Zheng-73k, and further shows the applicability of the method on the three real scRNA-seq data sets with potential dimensions of 2, 10 and 20. Table 3 is detailed information of experimental results of the dimensionality reduction algorithm based on the NMI score.

Table 3: detailed information of the results of the experiments

The experimental results obtained in table 3 show that the SCAVAE method based on the Bhattacharyya distance is a promising new method. The method obtains better performance on three real data sets, which shows that the method can provide more accurate low-dimensional representation of scRNA-seq data.

Therefore, the SCAVAE method is a method for performing dimension reduction and cluster analysis on single-cell RNA-seq data, and has the following advantages that firstly, the SCAVAE matches potential spatial distribution with selected prior; second, SCAVAE provides a ZINB distribution, a generally accepted gene expression distribution structure; thirdly, the SCAVAE method combines the Bhattacharyya distance, so that the method is more stable; finally, the method takes into account the parallel and scalable nature of the deep neural network framework. Our model constrains the data structure and performs dimensionality reduction by the depth-invariant autocoder module. Experiments carried out on three real scRNA-seq data sets by taking the evaluation index of standardized mutual information as a basis show that the method has good performance.

Claims

1. A scRNA-seq data dimension reduction method based on a deep immutation self-encoder is characterized by comprising the following implementation steps:

(1) preprocessing data; collecting scRNA-seq data sets from different species, different types and different cell numbers; preprocessing the collected original scRNA-seq data by adopting a variance filtering method, and reconstructing the input data by utilizing zero-expansion negative binomial distribution to obtain noiseless data;

(2) constructing a depth invariant self-coder model, which is an automatic coder frame consisting of a depth coder, an intermediate hidden layer and a depth decoder;

(3) reducing the dimension of the preprocessed scRNA-seq data by using the constructed depth-invariant variational self-encoder, learning the characteristic vector of the hidden layer by using the middle hidden layer of the self-encoder, constraining the prior distribution of the characteristic vector of the hidden layer, and matching the characteristic vector of the hidden layer with the selected prior distribution;

(4) according to a scheme based on Wesserstein distance, a method for combining a depth-invariant variational automatic encoder with Bhattacharyya distance is provided;

(5) reconstructing decoder output of the scRNA-seq data by using ZINB condition likelihood, and clustering the data after dimensionality reduction by using a k-means clustering algorithm to obtain a standardized mutual information score.

2. The scRNA-seq data dimension reduction method based on the deep immutable self-encoder as claimed in claim 1, wherein the data is collected and the collected single-cell RNA sequencing data is preprocessed:

we collected three scRNA-seq datasets from different species, different types, different cell numbers, true, and then pre-processed the collected data using variance filtering to obtain the 720 highest variance genes across cells.

(2) a Macoskco-44k dataset that classifies cells of a mouse retinal region;

and input data are reconstructed by utilizing zero-expansion negative binomial distribution, so that better fitting is provided for scRNA-seq data, and noiseless data are obtained.

3. The scRNA-seq data dimension reduction method based on the depth immutation self-encoder as claimed in claim 1, wherein the method for constructing the depth immutation self-encoder is an automatic encoder framework composed of a depth encoder, an intermediate hidden layer and a depth decoder, and specifically comprises:

the depth immutation autoencoder adopts a structure that an immutation autoencoder and a variation autoencoder are combined.

(1) Antagonistic Automatic Encoder (AAE)

The Antagonism Autoencoder (AAE) is a variant of the Generative Antagonism Network (GAN) model, which is converted into a probabilistic autoencoder that generates the model by using a GAN framework.

The idea of the generative countermeasure network (GAN) framework is to establish a max-min countermeasure game between two neural networks (generative model G and discriminant model D). The discriminant model d (x) is a neural network that computes the probability that point x in the data space is a sample in the data distribution that we are trying to model (i.e., positive samples), rather than generating a sample probability in the model (negative samples). At the same time, the generative model gradually learns to transform the sample z from the prior distribution p (z) to the data space using the function g (z). G (z) is trained to maximally confuse the discriminator, believing that the sample it generates is from the data distribution. The generative model is trained by using the gradient of d (x) with respect to x and using it to modify its parameters, thereby completely confusing the discriminative model with the samples it generates. This scheme can be formalized as a minimum maximum target of the following type:

in the formula (1), p_dataIs the data distribution, and p (z) is the model distribution.

(2) Variational automatic encoder VAE

A variational autocoder VAE is a variant of an autocoder model intended to model the distribution p (x) of data points in a high-dimensional space x by means of a low-dimensional latent variable z. The entire model is divided into two processes, first generating samples of z in a potentially low-dimensional subspace and then mapping them to the original space x. The key point is to generate a probability of having a high recovery observed data matrix x. Thus, it is possible for the generated z to capture the intrinsic information of the raw data. Theoretically, the best choice to generate z is the a posteriori P (z | x), however, it is often too complex and difficult to handle. The VAE attempts to approximate the posteriori using the variational probability Q (z | x) by minimizing the Kullback-Leibler (KL) divergence between Q (z | x) and P (z | x).

This scheme can be trained by using a gradient-based approach to maximize the following objectives:

wherein D_KLIs the Kullback-Leibler divergence, p_model(z) is considered a decoder.

4. The method for reducing the dimensionality of the scRNA-seq data based on the deep antagonistic variational self-encoder according to claim 1, wherein the constructed deep antagonistic self-encoder is used for reducing the dimensionality of the preprocessed scRNA-seq data, and specifically comprises the following steps:

learning a hidden layer feature vector by using an intermediate hidden layer of an automatic encoder, constraining prior distribution of the hidden layer feature vector, and matching the hidden layer feature vector with the selected prior distribution;

5. the scRNA-seq data dimension reduction method based on the depth immutation self-encoder as claimed in claim 1, wherein the depth immutation automatic encoder model is combined with Bhattacharyya distance.

The Wasserstein distance has been shown to be more stable for GAN training, and the depth antagonistic variant autoencoder model can be combined with the Wasserstein distance. The Wasserstein distance, also known as the dozer distance, is defined as follows:

D_B(p,q)＝-ln(BC(p,q)) (4)

Wherein:

in equations (4) and (5), BC is referred to as Bhattacharyya coefficient.

Then our new goal is:

6. The scRNA-seq data dimension reduction method based on the deep immutation self-encoder as claimed in claim 1, wherein the data after dimension reduction is clustered by using k-means clustering algorithm. The method specifically comprises the following steps:

our proposed method uses the decoder output of the ZINB conditional likelihood to reconstruct the scRNA-seq data, and the ZINB distribution proves to be a model that can better describe the scRNA-seq data and is a generally accepted gene expression distribution structure.

In order to evaluate the effectiveness of the method, a k-means clustering algorithm is applied to cluster the data subjected to dimensionality reduction, the index of standardized mutual information is used for evaluation, and experiments carried out on three real scRNA-seq data sets show that the method can provide more accurate low-dimensional representation of the scRNA-seq data.