CN114067915A - scRNA-seq data dimension reduction method based on deep antithetical variational self-encoder - Google Patents

scRNA-seq data dimension reduction method based on deep antithetical variational self-encoder Download PDF

Info

Publication number
CN114067915A
CN114067915A CN202111388132.2A CN202111388132A CN114067915A CN 114067915 A CN114067915 A CN 114067915A CN 202111388132 A CN202111388132 A CN 202111388132A CN 114067915 A CN114067915 A CN 114067915A
Authority
CN
China
Prior art keywords
data
scrna
model
encoder
depth
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111388132.2A
Other languages
Chinese (zh)
Inventor
王树林
任亚琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202111388132.2A priority Critical patent/CN114067915A/en
Publication of CN114067915A publication Critical patent/CN114067915A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to data mining in bioinformatics, in particular to the mining of single cell RNA sequencing data. In particular to a method for performing dimension compression and clustering on single-cell RNA sequencing data by a deep learning method to achieve the purpose of effectively identifying cell populations. The method of the invention comprises collecting and pre-processing scRNA-seq data; constructing a depth antithesis variational self-coder model; using the constructed model to reduce the dimension of the preprocessed data; combining a depth invariant self-encoder with the Bhattacharyya distance; and carrying out clustering analysis on the result after dimensionality reduction. Our model constrains the data structure and performs dimensionality reduction by the depth-invariant autocoder module. Experiments carried out on three real scRNA-seq data sets by taking the evaluation index of standardized mutual information as a basis show that the method has good performance.

Description

scRNA-seq data dimension reduction method based on deep antithetical variational self-encoder
Technical Field
The invention relates to data mining in bioinformatics, in particular to the mining of single cell RNA sequencing data. In particular to a method for effectively identifying cell populations by performing dimension compression and clustering on single-cell RNA sequencing data.
Background
Single cell RNA sequencing (scRNA-seq) is not seemingly understood to sequence only a single cell, we can understand that this sequencing technology can sequence the genome or transcriptome of a single cell, i.e. at the single cell level. Conventional sequencing methods are typically performed at the multicellular level, thus losing information of heterogeneity. More specifically, the information displayed by conventional sequencing methods is the average information at the multicellular level.
With the development of scRNA-seq technology, massive scRNA-seq data are generated, and powerful data support is provided for distinguishing various cell populations in biological tissues and comprehensively revealing heterogeneity among cells. Through dimensionality reduction and clustering of the scRNA-seq data, the cell population can be effectively identified, and an important basis is provided for research on embryonic development processes, immune system working mechanisms, tumor cell genesis and other works. However, due to the limitation of sequencing technology means, high complexity of gene expression and the like, the scRNA-seq data generally has the characteristics of large noise, high feature dimension, strong sparsity and the like, and valuable biological information is difficult to excavate only by manpower. A crucial step in the analysis of scra-seq data is the grouping of cells belonging to the same type based on gene expression, which is essentially a problem of cell clustering. Existing studies will perform dimensionality reduction on scRNA-seq data prior to clustering. However, the characteristics of high noise, various data shapes and the like of the scRNA-seq data make many dimension reduction methods no longer applicable; clustering data with sample numbers (cells) much less than feature numbers (genes) is also a huge challenge for clustering. Therefore, how to effectively process data, exploit heterogeneity among cells, and distinguish different cell populations becomes a hot spot of current research.
The current methods for reducing dimensions of scRNA-seq data by bioinformatics can be of three types: the first category, linear methods, which are mainly represented by two, is Principal Component Analysis (PCA), one of the most commonly used dimensionality reduction methods, which transforms observations into a latent space by defining a linear combination of raw data points with continuous maximum variance (i.e., principal components). Another linear approach is the factorization ZIFA, which is similar to PCA, but whose purpose is mainly to model correlation rather than covariance by describing variability between the relevant variables. The linear method has the advantages of quick and simple realization and wide application; however, the scRNA-seq data is nonlinear in nature, and thus it is not applicable to all datasets. The second category is non-linear methods, which have the advantage of being more flexible to use, providing an aesthetic appearance, and easier to interpret by visual inspection. However, such methods often require manual parameter definition by users, and have problems of important information loss, insufficient feature extraction, and the like. The first two methods belong to the traditional methods, the third method is a deep learning-based method, and the method has the advantages that the representativeness of the extracted features is strong, but the interpretability is poor, and some methods use deep models and have the defects of relatively long running time and the like.
With the development of scRNA-seq technology, the requirements on a method for processing scRNA-seq data are higher and higher, and more accurate and efficient single-cell RNA sequencing data are always the hot problems of research. In conclusion, the existing method for reducing the dimensionality of the scRNA-seq data has the problems of important information loss, insufficient feature extraction and the like, and the subsequent data analysis is seriously influenced. Therefore, how to design a more reasonable and effective dimension reduction method is a key problem to be urgently researched and researched.
Disclosure of Invention
Aiming at the problems existing in the method and the complexity of the scRNA-seq data, the invention provides a scRNA-seq data dimension reduction method based on a deep immutation autocoder. The method can effectively solve the problems of important information loss, insufficient feature extraction and the like of the conventional dimension reduction method, and obtains better clustering precision. The steps of the described method comprise:
1. data pre-processing
At present, due to the limitations of sequencing technology means, highly complex gene expression and other reasons, scRNA-seq data generally has the characteristics of large noise, high characteristic dimension, strong sparsity and the like, so in order to eliminate the influence of large technical noise and gene expression quantity difference generated in the sequencing process on subsequent processing and keep information in original data as much as possible, data preprocessing is performed by using a variance filtering method.
We collected three scRNA-seq datasets from different species, different types, different cell numbers, true, and then pre-processed the collected data using variance filtering to obtain the 720 highest variance genes across cells. And assumes that the gene with the highest variance relative to its average expression is the result of a biological effect, not of technical noise. Using log2(1+ C) the normalized count matrix data C is converted.
Specifically, we perform data preprocessing operations on the following three data sets.
(1) A cortix dataset with ground truth labels for 7 different cell types, consisting of 3005 cells from the somatosensory cortex and hippocampus of the mouse brain;
(2) a Macoskco-44k dataset that classifies cells of a mouse retinal region;
(3) a Zheng-73k data set consisting of a classification of fluorescence activated cells of healthy humans, mainly comprising T cells, NK cells and B cells;
reconstructing input data by utilizing zero-expansion negative binomial distribution, and providing better fitting for scRNA-seq data to obtain noiseless data;
2. constructing a depth-contrast variational autoencoder model
A depth immutation auto-encoder is an auto-encoder frame composed of a depth encoder, an intermediate hidden layer and a depth decoder, and adopts a structure of combining an immutation auto-encoder and a variational auto-encoder.
(1) Generative countermeasure networks (GAN)
The idea of the generative countermeasure network (GAN) framework is to establish a max-min countermeasure game between two neural networks (generative model G and discriminant model D). The discriminator model d (x) is a neural network that computes the probability that a point x in the data space is a sample in the data distribution that we are trying to model (i.e., positive samples), rather than generating a sample in the model (negative samples). At the same time, the generative model gradually learns to transform the sample z from the prior distribution p (z) to the data space using the function g (z). G (z) is trained to maximally confuse the discriminator, believing that the sample it generates is from the data distribution. The generative model is trained by using the gradient of d (x) with respect to x and using it to modify its parameters, thereby completely confusing the discriminative model with the samples it generates. This scheme can be formalized as a minimum maximum target of the following type:
Figure BDA0003367768840000031
wherein p isdataIs the data distribution, and p (z) is the model distribution.
Both the generator G and the discriminator D can be modeled as a fully connected neural network and then trained by back propagation using a suitable optimizer. In the experiment of the invention, an adaptive moment estimation (ADAM) is used to replace the traditional random gradient descent method, and the ADAM is an extension of the random gradient descent method and is more suitable for the situation of sparse data.
(2) Antagonistic Automatic Encoder (AAE)
The Antagonistic Autoencoder (AAE) is a probabilistic autoencoder, a variant of the Generative Antagonistic Network (GAN) model, and performs variational inference by matching the aggregate posteriori of the autoencoder hidden code vectors to any prior distribution, matching the aggregate posteriori to a prior ensuring that meaningful samples are produced from any part of the prior space, and the decoder of the antagonistic autoencoder learns to map a deeply generated model applied prior to data distribution. The structure of the antagonistic automatic encoder consists of a standard automatic encoder and an antagonistic network. The encoder is also a generator of the adversarial network. The idea of a reactive autoencoder is to train the reactive network and the autoencoder simultaneously to perform the inference.
(3) Variational automatic encoder VAE
A variational autocoder VAE is a variant of an autocoder model intended to model the distribution p (x) of data points in a high-dimensional space x by means of a low-dimensional latent variable z. The entire model is divided into successive processes, first generating samples of z in a potentially low-dimensional subspace and then mapping them to the original space x. The key point is to generate a probability of having a high recovery observed data matrix x. Thus, it is possible for the generated z to capture the intrinsic information of the raw data. Theoretically, the best choice to generate z is the a posteriori P (z | x), however, it is often too complex and difficult to handle. The VAE attempts to approximate the posteriori using the variational probability Q (z | x) by minimizing the Kullback-Leibler (KL) divergence between Q (z | x) and P (z | x). This scheme can be trained by using a gradient-based approach to maximize the following objectives:
Figure BDA0003367768840000032
wherein D in the formula (2)KLIs the Kullback-Leibler divergence, pmodel(z) is considered a decoder.
3. Dimensionality reduction of scRNA-seq data
And carrying out dimensionality reduction on the preprocessed scRNA-seq data by using the constructed depth antagonism automatic encoder.
Firstly, learning a hidden layer feature vector by using an intermediate hidden layer of an automatic encoder, constraining prior distribution of the hidden layer feature vector, and matching the hidden layer feature vector with the selected prior distribution; the specific process is as follows:
(1) first, the constructed depth-antagonistic automatic encoder is composed of a depth encoder and a depth decoder. Where x is the input of the scRNA-seq expression level (M cells N genes), z is the potential code vector of the auto-encoder, p (z) is the prior distribution imposed on the potential code vector, q (z | x) is the N coding distributions, and p (x | z) is the decoding distribution.
(2) The depth encoder uses the variational distribution q (z | x) to provide a gaussian mean and covariance. The self-encoder learns gradually to reconstruct the input x of the scRNA-seq data by minimizing the reconstruction error, making it as realistic as possible. At this time, the encoder of the model is also the generator of the GAN framework. The encoder is trained to fool the discriminator of the GAN framework so that the potential code vector q (z) is derived from the true a priori distribution p (z).
(3) At the same time, the training discriminator distinguishes between the sample vector of p (z) and the potential code vector q (z) of the encoder (i.e., generator). Finally, the method is able to learn an unsupervised representation of the probability distribution of the scRNA-seq data.
4. Combining depth invariant self-encoder with Bhattacharyya distance
The Wasserstein distance has been shown to be more stable for GAN training, and the depth antagonistic variant autoencoder model can be combined with the Wasserstein distance. The Wasserstein distance, also known as the dozer distance, is defined as shown in equation (3):
Figure BDA0003367768840000041
we also propose to combine this model with the Bhattacharyya distance, another measure of similarity of two probability distributions, which is closely related to the bhattacharya coefficient, which is a measure of the amount of overlap between two statistical samples or populations. Bhattacharyya distance D between p and q distributions over the same domain XB(p, q) is defined as
DB(p,q)=-ln(BC(p,q)) (4)
Wherein:
Figure BDA0003367768840000042
in equations (4) and (5), BC is referred to as Bhattacharyya coefficient.
Then our new goal is
Figure BDA0003367768840000043
Wherein p isdataAnd p (z) again are the data distribution and the model distribution, respectively.
K-means algorithm clustering
The method uses ZINB conditional likelihood to reconstruct decoder output of scRNA-seq data, and ZINB distribution is proved to be a model which can better describe scRNA-seq data and is a generally accepted gene expression distribution structure.
In order to evaluate the effectiveness of the method, a k-means clustering algorithm is applied to cluster the data subjected to dimensionality reduction, and the index of standardized mutual information is used for evaluation. Assuming that X is the predicted clustering result and Y is the cell type with the true label, the NMI score is calculated as follows:
Figure BDA0003367768840000051
in formula (7), MI is the mutual entropy between X and Y, and H is shannon entropy.
From the above, it can be seen that the dimensional reduction method for scRNA-seq data based on a deep immutation autocoder provided by one or more embodiments of the present specification combines the advantages of an immutation autocoder and a variation autocoder. Our model constrains the data structure and performs dimensionality reduction by the depth-invariant autocoder module. Experiments performed on three authentic scRNA-seq datasets show that the present method can provide a more accurate low dimensional representation of scRNA-seq data.
Drawings
In order to more clearly illustrate one or more embodiments or prior art solutions in this specification, the drawings that are needed in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the description below are only one or more embodiments of the specification, and that other drawings can be obtained by those skilled in the art without inventive effort from these drawings.
FIG. 1: flow schematic diagram of scRNA-seq data dimension reduction method based on depth invariant self-encoder
FIG. 2: experimental results when the potential dimensions are 2, 10, 20
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to experiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
1. Summary of data sets
We evaluated the proposed SCAVAE model on three authentic scRNA-seq datasets from different sequencing platforms. All data sets used herein are publicly available, and the statistics of the data sets are summarized in table 1, and 720 genes with the largest variance were selected in each data set for subsequent experiments. The detailed information is shown in table 1:
table 1 data set used in this experiment
Figure BDA0003367768840000061
2. Experimental Environment and parameter settings
The hardware environment is mainly a PC host. The CPU of the PC host is 11th Gen Intel (R) core (TM) i5-1135G7, 2.42GHz, the memory is 16GB RAM, 64-bit operating system. The software is implemented in Python language under Pycharm environment by taking Windows 10 as a platform, the version of Python is 3.5.0, and the version of Tensorflow is 1.4.0.
In the method, the encoder, decoder and discriminator are constructed from a 1, 2, 3 or 4 layer design of a fully connected neural network having 16, 32, 64, 128, 256, 512 or 1024 nodes, respectively. The best hyper-parameter set is selected from a number of possibilities to maximize the cluster performance test data set from the grid search. All neural networks were regularized with Dropout. The activation function between two hidden layers is both the leakage REU activation function. Deep learning models have a high variance and do not give the same answer over multiple runs, and to obtain reproducible results we use Python and Tensorflow commands such as np.
The parameter settings for the various datasets of SCAVAE are shown in table 2.
Table 2: all parameter settings used for the experiment
Figure BDA0003367768840000062
Figure BDA0003367768840000071
3. Evaluation index
In our experiments, the SCAVAE model was evaluated using an index of Normalized Mutual Information (NMI), which is widely used in model performance evaluation in unsupervised learning scenarios.
NMI (normalized mutual information): NMI measures the similarity of two clusters from the point of view of information theory. Assuming that X is the predicted clustering result and Y is the cell type with the true label, the NMI score is calculated as follows:
Figure BDA0003367768840000072
like equation (7), in equation (8), MI is the mutual entropy between X and Y, and H is shannon entropy.
4. Analysis of Experimental results
The method is mainly tested on three real data sets of cortex, Macosko-44k and Zheng-73k, and further shows the applicability of the method on the three real scRNA-seq data sets with potential dimensions of 2, 10 and 20. Table 3 is detailed information of experimental results of the dimensionality reduction algorithm based on the NMI score.
Table 3: detailed information of the results of the experiments
Figure BDA0003367768840000073
The experimental results obtained in table 3 show that the SCAVAE method based on the Bhattacharyya distance is a promising new method. The method obtains better performance on three real data sets, which shows that the method can provide more accurate low-dimensional representation of scRNA-seq data.
Therefore, the SCAVAE method is a method for performing dimension reduction and cluster analysis on single-cell RNA-seq data, and has the following advantages that firstly, the SCAVAE matches potential spatial distribution with selected prior; second, SCAVAE provides a ZINB distribution, a generally accepted gene expression distribution structure; thirdly, the SCAVAE method combines the Bhattacharyya distance, so that the method is more stable; finally, the method takes into account the parallel and scalable nature of the deep neural network framework. Our model constrains the data structure and performs dimensionality reduction by the depth-invariant autocoder module. Experiments carried out on three real scRNA-seq data sets by taking the evaluation index of standardized mutual information as a basis show that the method has good performance.

Claims (6)

1. A scRNA-seq data dimension reduction method based on a deep immutation self-encoder is characterized by comprising the following implementation steps:
(1) preprocessing data; collecting scRNA-seq data sets from different species, different types and different cell numbers; preprocessing the collected original scRNA-seq data by adopting a variance filtering method, and reconstructing the input data by utilizing zero-expansion negative binomial distribution to obtain noiseless data;
(2) constructing a depth invariant self-coder model, which is an automatic coder frame consisting of a depth coder, an intermediate hidden layer and a depth decoder;
(3) reducing the dimension of the preprocessed scRNA-seq data by using the constructed depth-invariant variational self-encoder, learning the characteristic vector of the hidden layer by using the middle hidden layer of the self-encoder, constraining the prior distribution of the characteristic vector of the hidden layer, and matching the characteristic vector of the hidden layer with the selected prior distribution;
(4) according to a scheme based on Wesserstein distance, a method for combining a depth-invariant variational automatic encoder with Bhattacharyya distance is provided;
(5) reconstructing decoder output of the scRNA-seq data by using ZINB condition likelihood, and clustering the data after dimensionality reduction by using a k-means clustering algorithm to obtain a standardized mutual information score.
2. The scRNA-seq data dimension reduction method based on the deep immutable self-encoder as claimed in claim 1, wherein the data is collected and the collected single-cell RNA sequencing data is preprocessed:
we collected three scRNA-seq datasets from different species, different types, different cell numbers, true, and then pre-processed the collected data using variance filtering to obtain the 720 highest variance genes across cells.
Specifically, we perform data preprocessing operations on the following three data sets.
(1) A cortix dataset with ground truth labels for 7 different cell types, consisting of 3005 cells from the somatosensory cortex and hippocampus of the mouse brain;
(2) a Macoskco-44k dataset that classifies cells of a mouse retinal region;
(3) a Zheng-73k data set consisting of a classification of fluorescence activated cells of healthy humans, mainly comprising T cells, NK cells and B cells;
and input data are reconstructed by utilizing zero-expansion negative binomial distribution, so that better fitting is provided for scRNA-seq data, and noiseless data are obtained.
3. The scRNA-seq data dimension reduction method based on the depth immutation self-encoder as claimed in claim 1, wherein the method for constructing the depth immutation self-encoder is an automatic encoder framework composed of a depth encoder, an intermediate hidden layer and a depth decoder, and specifically comprises:
the depth immutation autoencoder adopts a structure that an immutation autoencoder and a variation autoencoder are combined.
(1) Antagonistic Automatic Encoder (AAE)
The Antagonism Autoencoder (AAE) is a variant of the Generative Antagonism Network (GAN) model, which is converted into a probabilistic autoencoder that generates the model by using a GAN framework.
The idea of the generative countermeasure network (GAN) framework is to establish a max-min countermeasure game between two neural networks (generative model G and discriminant model D). The discriminant model d (x) is a neural network that computes the probability that point x in the data space is a sample in the data distribution that we are trying to model (i.e., positive samples), rather than generating a sample probability in the model (negative samples). At the same time, the generative model gradually learns to transform the sample z from the prior distribution p (z) to the data space using the function g (z). G (z) is trained to maximally confuse the discriminator, believing that the sample it generates is from the data distribution. The generative model is trained by using the gradient of d (x) with respect to x and using it to modify its parameters, thereby completely confusing the discriminative model with the samples it generates. This scheme can be formalized as a minimum maximum target of the following type:
Figure FDA0003367768830000021
in the formula (1), pdataIs the data distribution, and p (z) is the model distribution.
Both the generator G and the discriminator D can be modeled as a fully connected neural network and then trained by back propagation using a suitable optimizer. In the experiment of the invention, an adaptive moment estimation (ADAM) is used to replace the traditional random gradient descent method, and the ADAM is an extension of the random gradient descent method and is more suitable for the situation of sparse data.
(2) Variational automatic encoder VAE
A variational autocoder VAE is a variant of an autocoder model intended to model the distribution p (x) of data points in a high-dimensional space x by means of a low-dimensional latent variable z. The entire model is divided into two processes, first generating samples of z in a potentially low-dimensional subspace and then mapping them to the original space x. The key point is to generate a probability of having a high recovery observed data matrix x. Thus, it is possible for the generated z to capture the intrinsic information of the raw data. Theoretically, the best choice to generate z is the a posteriori P (z | x), however, it is often too complex and difficult to handle. The VAE attempts to approximate the posteriori using the variational probability Q (z | x) by minimizing the Kullback-Leibler (KL) divergence between Q (z | x) and P (z | x).
This scheme can be trained by using a gradient-based approach to maximize the following objectives:
Figure FDA0003367768830000022
wherein DKLIs the Kullback-Leibler divergence, pmodel(z) is considered a decoder.
4. The method for reducing the dimensionality of the scRNA-seq data based on the deep antagonistic variational self-encoder according to claim 1, wherein the constructed deep antagonistic self-encoder is used for reducing the dimensionality of the preprocessed scRNA-seq data, and specifically comprises the following steps:
learning a hidden layer feature vector by using an intermediate hidden layer of an automatic encoder, constraining prior distribution of the hidden layer feature vector, and matching the hidden layer feature vector with the selected prior distribution;
5. the scRNA-seq data dimension reduction method based on the depth immutation self-encoder as claimed in claim 1, wherein the depth immutation automatic encoder model is combined with Bhattacharyya distance.
The Wasserstein distance has been shown to be more stable for GAN training, and the depth antagonistic variant autoencoder model can be combined with the Wasserstein distance. The Wasserstein distance, also known as the dozer distance, is defined as follows:
Figure FDA0003367768830000023
we also propose to combine this model with the Bhattacharyya distance, another measure of similarity of two probability distributions, which is closely related to the bhattacharya coefficient, which is a measure of the amount of overlap between two statistical samples or populations. Bhattacharyya distance D between p and q distributions over the same domain XB(p, q) is defined as
DB(p,q)=-ln(BC(p,q)) (4)
Wherein:
Figure FDA0003367768830000031
in equations (4) and (5), BC is referred to as Bhattacharyya coefficient.
Then our new goal is:
Figure FDA0003367768830000032
wherein p isdataAnd p (z) again are the data distribution and the model distribution, respectively.
6. The scRNA-seq data dimension reduction method based on the deep immutation self-encoder as claimed in claim 1, wherein the data after dimension reduction is clustered by using k-means clustering algorithm. The method specifically comprises the following steps:
our proposed method uses the decoder output of the ZINB conditional likelihood to reconstruct the scRNA-seq data, and the ZINB distribution proves to be a model that can better describe the scRNA-seq data and is a generally accepted gene expression distribution structure.
In order to evaluate the effectiveness of the method, a k-means clustering algorithm is applied to cluster the data subjected to dimensionality reduction, the index of standardized mutual information is used for evaluation, and experiments carried out on three real scRNA-seq data sets show that the method can provide more accurate low-dimensional representation of the scRNA-seq data.
CN202111388132.2A 2021-11-22 2021-11-22 scRNA-seq data dimension reduction method based on deep antithetical variational self-encoder Pending CN114067915A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111388132.2A CN114067915A (en) 2021-11-22 2021-11-22 scRNA-seq data dimension reduction method based on deep antithetical variational self-encoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111388132.2A CN114067915A (en) 2021-11-22 2021-11-22 scRNA-seq data dimension reduction method based on deep antithetical variational self-encoder

Publications (1)

Publication Number Publication Date
CN114067915A true CN114067915A (en) 2022-02-18

Family

ID=80279164

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111388132.2A Pending CN114067915A (en) 2021-11-22 2021-11-22 scRNA-seq data dimension reduction method based on deep antithetical variational self-encoder

Country Status (1)

Country Link
CN (1) CN114067915A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116992299A (en) * 2023-09-28 2023-11-03 北京邮电大学 Training method, detecting method and device of blockchain transaction anomaly detection model
CN117116350A (en) * 2023-10-25 2023-11-24 中国农业科学院深圳农业基因组研究所(岭南现代农业科学与技术广东省实验室深圳分中心) Correction method and device for RNA sequencing data, electronic equipment and storage medium
WO2024021075A1 (en) * 2022-07-29 2024-02-01 Oppo广东移动通信有限公司 Training method, model usage method, and wireless communication method and apparatus

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024021075A1 (en) * 2022-07-29 2024-02-01 Oppo广东移动通信有限公司 Training method, model usage method, and wireless communication method and apparatus
CN116992299A (en) * 2023-09-28 2023-11-03 北京邮电大学 Training method, detecting method and device of blockchain transaction anomaly detection model
CN116992299B (en) * 2023-09-28 2024-01-05 北京邮电大学 Training method, detecting method and device of blockchain transaction anomaly detection model
CN117116350A (en) * 2023-10-25 2023-11-24 中国农业科学院深圳农业基因组研究所(岭南现代农业科学与技术广东省实验室深圳分中心) Correction method and device for RNA sequencing data, electronic equipment and storage medium
CN117116350B (en) * 2023-10-25 2024-02-27 中国农业科学院深圳农业基因组研究所(岭南现代农业科学与技术广东省实验室深圳分中心) Correction method and device for RNA sequencing data, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111785329B (en) Single-cell RNA sequencing clustering method based on countermeasure automatic encoder
Ding et al. A survey on feature extraction for pattern recognition
Stuhlsatz et al. Feature extraction with deep neural networks by a generalized discriminant analysis
Yin The self-organizing maps: background, theories, extensions and applications
CN114067915A (en) scRNA-seq data dimension reduction method based on deep antithetical variational self-encoder
CN111738143B (en) Pedestrian re-identification method based on expectation maximization
Nejad et al. A new enhanced learning approach to automatic image classification based on Salp Swarm Algorithm
CN109389171B (en) Medical image classification method based on multi-granularity convolution noise reduction automatic encoder technology
CN104268593A (en) Multiple-sparse-representation face recognition method for solving small sample size problem
CN111738351A (en) Model training method and device, storage medium and electronic equipment
CN106886793B (en) Hyperspectral image waveband selection method based on discrimination information and manifold information
Gu et al. A study of hierarchical correlation clustering for scientific volume data
Salazar On Statistical Pattern Recognition in Independent Component Analysis Mixture Modelling
CN113033567A (en) Oracle bone rubbing image character extraction method fusing segmentation network and generation network
Abdollahifard et al. Stochastic simulation of patterns using Bayesian pattern modeling
CN116152554A (en) Knowledge-guided small sample image recognition system
Ma et al. Joint-label learning by dual augmentation for time series classification
CN113421546A (en) Cross-tested multi-mode based speech synthesis method and related equipment
Shu et al. An anomaly detection method based on random convolutional kernel and isolation forest for equipment state monitoring
CN111401440B (en) Target classification recognition method and device, computer equipment and storage medium
CN113222002A (en) Zero sample classification method based on generative discriminative contrast optimization
CN114912109B (en) Abnormal behavior sequence identification method and system based on graph embedding
CN111858343A (en) Countermeasure sample generation method based on attack capability
Dong et al. Clustering human wrist pulse signals via multiple criteria decision making
CN112465054B (en) FCN-based multivariate time series data classification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination