CN114022693A

CN114022693A - Double-self-supervision-based single-cell RNA-seq data clustering method

Info

Publication number: CN114022693A
Application number: CN202111152906.1A
Authority: CN
Inventors: 王艺杰; 曾荣汉; 杨东; 王文庆; 崔逸群; 邓楠轶; 朱博迪; 介银娟; 董夏昕; 朱召鹏; 崔鑫
Original assignee: Xian Thermal Power Research Institute Co Ltd
Current assignee: Xian Thermal Power Research Institute Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2022-02-08
Anticipated expiration: 2041-09-29
Also published as: CN114022693B

Abstract

The invention discloses a double self-supervision-based single-cell RNA-seq data clustering method, which comprises the steps of firstly constructing a deep neural network by combining gene ontology knowledge to extract single-cell RNA-seq data characteristics, and reconstructing single-cell RNA-seq data through zero-expansion negative binomial distribution to reduce data noise; secondly, constructing a graph structure by utilizing a uniform manifold approximation and projection technology and mining topological structure information among data samples by adopting a graph neural network; then, combining the graph neural network and the deep neural network by adopting a double self-supervision strategy; and finally, realizing single cell RNA-seq data clustering by using a combined loss function of a minimum deep neural network, a graph neural network, a double self-supervision module and a random Gaussian noise term. The invention adopts an automatic supervision method in the unsupervised field, and can effectively solve the problems of the existing single cell RNA-seq data clustering method, such as lack of topological structure information among learning data, poor biological interpretability and the like.

Description

Double-self-supervision-based single-cell RNA-seq data clustering method

Technical Field

The invention belongs to the technical field of single cell RNA-seq data analysis, and particularly relates to a double-self-supervision-based single cell RNA-seq data clustering method.

Background

The clustering method aiming at the single cell RNA-seq data plays an important role in the relevant research such as cell heterogeneity and the like. In the clustering problem, cells can be classified into different cell types, each having different expression profiles from other cells, according to their transcription profiles. Through the research of the clustering method of the single cell RNA-seq data, researchers can identify new cell populations in organisms, identify cell states, establish networks among cells, trace development lineages and research the reaction of in-vitro and in-vivo experiments. At present, traditional clustering methods such as k-means, hierarchical clustering and density-based clustering methods with noise are widely used, but single-cell RNA-seq data have unique characteristics, so that the traditional clustering methods cannot effectively cluster the data.

Disclosure of Invention

In order to overcome the technical problems, the invention provides a single-cell RNA-seq data clustering method based on double self-supervision, which integrates the structural information of single-cell RNA-seq data into a neural network layer, integrates the traditional deep neural network and a graph neural network into the same model through a double self-supervision strategy, and realizes the integral iterative optimization of the model. In this way, a variety of data structures from low order to high order are naturally combined with a variety of representations learned from the encoder. In the process of constructing the self-encoder, the invention firstly constructs the structural data of the single-cell RNA sequencing graph by using a uniform manifold approximation and projection technology, and reconstructs sample data by using zero-expansion negative binomial distribution, thereby not only realizing the noise reduction of the single-cell RNA-seq data, but also improving the overall performance of the model and laying a foundation for subsequent effective clustering.

In order to achieve the purpose, the invention adopts the technical scheme that:

a single cell RNA-seq data clustering method based on double self-supervision comprises the following steps;

1) the self-encoder in the pre-training deep neural network module:

selecting 5 public data sets downloaded from Arrayexpress and GEO databases, wherein gene expression values in the 5 public data sets are obtained from various tissue cells, including GSE60361, GSE65525, GSE72056, GSE76312 and GSE103322, further screening cells with normal gene expression quantity, reading original single cell RNA-seq data and carrying out standardized preprocessing, inputting the processed single cell RNA-seq data by using a designed self-encoder for training and obtaining a pre-training model;

2) initializing a clustering center:

in the initialization stage of an experiment, a self-encoder obtained by pre-training can learn the potential representation of single-cell RNA-seq data, a k-means algorithm is used for initializing a clustering center on the basis of the potential representation, the initialization is carried out for 20 times, and the optimal solution is selected as the initial clustering center;

3) randomly initializing network parameters of the l layer;

the initialization of the network parameters is very important for the training of the network, the problems of gradient explosion and gradient disappearance are avoided, the training speed is further improved, the network convergence is accelerated, and the network parameters of the l layer are initialized by using an Xavier initialization method, so that signals can be transmitted more deeply in the used neural network;

4) construct graph data structure:

the topological structure information of the single cell RNA-seq data is learned by adding a graph neural network into a model, a good graph data structure has a great promotion effect on the learning effect of the graph neural network, the tasks are completed by using a unified manifold approximation and projection algorithm, the distance from each single cell RNA-seq sample data to the nearest neighbor of the sample data is calculated at first, then the distance probability is calculated, then a matrix of a directed weighted graph is constructed, a tie matrix of an undirected graph is calculated, and K neighbor graph structure data of original data is constructed;

5) iterative full training:

in a single training, combining gene ontology to learn layer by layer to obtain the representation of single-cell RNA-seq data in a deep neural network, effectively reconstructing the single-cell RNA-seq data by using zero-expansion negative binomial distribution, reducing data noise and dimensionality, further calculating the data distribution of sample data in an underlying layer and the normalized target distribution after obtaining the representation of the data in the last layer of a self-encoder, combining two representations learned by the deep neural network and the graph neural network layer by using a transfer operator, continuously propagating the learning of the graph neural network forward, calculating the low-dimensional distribution of the graph neural network, supervising the learning processes of the two neural networks in a KL divergence mode by using the target distribution, integrating the loss functions of the three modules of the graph neural network, the deep neural network and double self-supervision as the integral loss function of the invention, inputting the obtained new training data into the current model again for training, optimizing the model parameters, and stopping iteration until the total loss function in the method model is converged;

6) and returning a final clustering result:

through effective learning of the network, the single-cell RNA-seq data learned in the graph convolution network comprises two different types of information, and the soft distribution value of the data distribution learned by the graph convolution network is used as a final clustering result to discover the cell subtype, thereby providing help for subsequent early cancer discovery and treatment.

The step of preprocessing the single-cell RNA-seq data in the step 1) comprises the following steps: firstly, screening out cells with normal gene expression quantity; then, the data was normalized for sequencing depth and gene length using a logarithmic normalization method.

The dimension of the self-encoder input in the step 1) is consistent with the dimension of single-cell RNA-seq data used for training, five layers are provided, and the dimension of each layer in the graph neural network module is consistent with the dimension in the self-encoder.

The step 4) specifically comprises the following steps: firstly, for each high-dimensional single-cell RNA-seq data point, the distance rho between the data point and the first nearest neighbor is calculated_i(ii) a Next, according to ρ_iCalculate the variance σ of the distance probability_i(ii) a Then, calculating weight values among nodes in the directed weighted graph, constructing a matrix of the directed weighted graph, and further calculating a directed adjacent matrix of the directed weighted graph; and calculating the adjacency matrix of the undirected graph by combining the Hadamard product operation according to the directed adjacency matrix.

The distance rho from a certain data point to the first nearest neighbor_i，

k denotes the total number of clusters, σ_iBy the formula

Can find that the said directed weighted graph

The node set V in the figure is all single-cell RNA-seq data, and the edge set

The weight w between nodes is calculated using the following equation

Building out directed weighted graph

After the matrix is obtained, further calculation can be carried out

Directed adjacency matrix of

And the adjacency matrix A of undirected graph G ═ V, E passes through

And (4) calculating.

The step 5) iterative full-scale training specifically comprises the following steps: obtaining each layer of representation of the deep neural network by combining gene ontology learning; calculating a data distribution Q using the last layer representation of the encoder; on the basis of the data distribution Q, performing quadratic power calculation, and performing normalization according to each soft clustering frequency to calculate a target distribution P; for each layer of output of the encoder, fusing each layer of representation of the deep neural network and the graph neural network by using a transfer operator epsilon, and further propagating forward so as to learn the next layer of representation of the graph neural network; calculating a low-dimensional distribution Z of the graph neural network; continuously inputting the representation obtained by learning in the encoder into a decoder to reconstruct the original data; respectively calculating three loss functions L in the method_res，L_clu，L_gnn(ii) a Calculating an overall loss function L of the whole network structure; and updating parameters of the whole network by using a back propagation algorithm in the whole network framework until the iteration is stopped.

The data distribution Q in the deep neural network measures data characterization h by taking student T distribution as a core for ith single cell ribonucleic acid sequencing sample data and jth cluster center data_iAnd cluster center μ_jSimilarity of (c):

h_ithe ith row of data representing the self-encoder is initialized by using a k-means algorithm in the process of pre-training the self-encoder to obtain mu_jV is the degree of freedom of student T distribution, q_ijDenotes the probability of assigning the ith sample data to the jth cluster, and Q ═ Q_ij]Seen as a distribution of all sample allocations.

The target distribution P plays the role of supervising the other two distributions, P_ij＝(q_ij ²/f_j)/(∑_j′q_ij′ ²/f_j′)，f_j＝∑_iq_ijRepresenting soft clustering frequency, for each Q of Q_iThe second power calculation is firstly carried out, and then normalization is carried out according to each soft clustering frequency to obtain each p_iChanging P to [ P ]_ij]The target distribution for all sample assignments is considered.

If L layers are shared in the encoder and L is used to represent the number of a certain layer, the data learned by the L-th layer in the encoder is represented as:

representing the activation function, W, of each layer_e ^(l)And b_e ^(l)Weight parameters and bias terms, respectively, for the l-th layer learning in the encoder, will be H⁽⁰⁾Defined as the original data X. The decoder of the model following the encoderLater, the decoder reconstructs the data through several neural network layers,

W_d ^(l)and b_d ^(l)Respectively, the weight parameters and the bias terms learned by the l layer in the decoder;

representation H due to self-encoder learning^(l-1)Can reconstruct single cell RNA-seq data, which comprises representation Z obtained by learning different from neural network^(l-1)Information of (A) is^(l-1)And H^(l-1)The two representations are combined to yield: z^％(l-1)＝(1-ε)Z^(l-1)+εH^(l-1)Epsilon is a transfer operator which is set to be 0.5, and a self-encoder in the deep neural network module and a graph convolution network in the graph neural network module are connected layer by layer;

use of

Generating Z as input to the l-th layer of a graph convolution network^(l)，

A contiguous matrix is represented that is,

expression matrix, H learned from encoder network^(l-1)By normalizing the adjacency matrix

The information learned by each layer of the self-encoder is integrated into the convolutional neural network due to different information learned by each layer of the self-encoder, L information integration processes are run together, and the last layer in the graph neural network module is a multi-classification layer:

final result z_ijE is Z, the ith sample data belongs to the jth clustering center data, and Z is probability distribution;

the final output part of the decoder is reconstruction data, and recent research progress aiming at single-cell RNA-seq data shows that the single-cell RNA-seq data is closest to Negative Binomial distribution (NB) and is formulated as

Because the dispersion of single-cell RNA-seq data is usually highly distorted, the variance tends to be larger than the mean and is therefore not suitable for approximation with a poisson distribution, whereas the variance of single-cell RNA-seq data typically changes as the mean changes. In addition to the above, single cell RNA-seq data are characterized by a particularly high number of zeros. Since the Zero values in the gene expression data may come from genes that are not expressed in the biological process (True Zero) or from technical losses in the sequencing process (Dropout Zero). In order to better capture single-cell RNA-seq data, the invention improves the traditional noise reduction self-encoder, adds a Zero-expansion factor on the basis of a Negative Binomial distribution (NB) model, and can also be understood as adding a pulse function at a Zero point, namely modeling the single-cell RNA-seq data by using Zero-expanded Negative Binomial distribution (Zero-expanded Negative Binomial). Formulated as ZINB (X | pi, μ, θ) ═ pi δ₀(X) + (1-pi) N BETA (X | mu, theta), three independent full-connected layers are added behind the last hidden layer, and the whole self-encoder has three outputs to respectively learn the zero expansion factor, the mean value and the variance of the zero expansion negative binomial distribution. L is_resTo reduce the error between the reconstructed data and the original data of the decoder, L_res＝-log(ZINB(X|π,μ,θ))；

By minimizing the KL divergence between the Q distribution and the P distribution, the target distribution P can help the deep neural network module to learn a better data representation for the clustering task, so that the data is representedThe method is closer to a clustering center, and target distribution P is obtained by performing quadratic and normalization processing on the basis of clustering distribution Q, so that single cell RNA-seq data points of the target distribution P have high confidence, and P is used as the supervision information of Q in the experimental process to continuously optimize a network model, wherein the process can be regarded as an automatic supervision strategy;

the target function of the graph neural network module optimizes the network in a KL divergence mode, so that the optimization process is more gentle, the phenomenon that the representation learning of single-cell RNA-seq data is greatly influenced is avoided, two different neural network models are integrated into the same parameter iteration updating framework, the clustering distribution Q and the distribution Z learned by the graph neural network can be supervised by the target distribution P, and the data representation and clustering performance of the whole network are jointly improved.

L＝L_res+αL_clu+βL_gnn+γ||B||，α>0 is a hyper-parameter, β, that balances the original data clustering optimization and the data structure reconstruction>0 is a coefficient for controlling the interference of the neural network module to the embedding space, B represents the random Gaussian noise added into the neural network, gamma is an influence parameter for adjusting the random Gaussian noise added into the neural network to the model, and the whole model is updated in an end-to-end mode through the optimization of the clustering loss function.

The soft distribution value of the Z distribution is used as a final clustering result, and as the data learned in the graph convolution network contains two different types of information, a label is set for the ith sample data:

the invention has the beneficial effects that:

the invention uses the self-encoder to not only reduce the dimension of the single cell RNA-seq data, but also effectively learn the representation of the data, uses the zero-expansion negative binomial distribution reconstruction data to reduce the influence of data noise on the clustering effect, simultaneously uses the graph neural network to combine the uniform manifold approximation and the projection technology to learn the topological structure information among the data, adds the biological prior knowledge such as the gene ontology and the like in the process of constructing the encoder, and improves the biological interpretability of the method model.

Drawings

FIG. 1 is a general flow chart of a single-cell RNA-seq data clustering method based on double self-supervision provided by the invention.

Detailed Description

The present invention will be described in further detail with reference to examples.

As shown in FIG. 1, six steps of the invention based on a double self-supervision strategy to improve the single cell RNA-seq data clustering effect are shown, a self-encoder in a pre-training deep neural network module, a clustering center initialization, a network parameter initialization of the l-th layer randomly, a structure diagram data structure construction, an iterative full-scale training and a final clustering result return.

The invention provides a single cell RNA-seq data clustering method based on double self-supervision, which selects 5 public data sets downloaded from Arrayexpress and GEO databases to verify the effectiveness of the invention, wherein gene expression values in the 5 public data sets are obtained from various histiocytes, including GSE60361, GSE65525, GSE72056, GSE76312 and GSE 103322. And further screening cells with normal gene expression quantity, and carrying out standardization treatment on the sequencing depth and the gene length of the data by adopting a logarithmic standardization method. The normalized data is used as the initial input data for the embodiment of the present invention. The invention comprises the following steps:

firstly, a pre-training self-encoder is used, and the pre-training is finished by using original single-cell RNA-seq data which can be well reconstructed as an indication. The invention trains all the selected data 30 times, sets the learning rate to be 0.001, sets the hyper-parameter to be 0.1, beta to be 0.01, and gamma to be 0.01. Through the training and learning of the step, the parameters learned from the encoder network can effectively perform dimension reduction and feature extraction on the original data. The latent data representation already contains information that can reconstruct the original data, and can be restored or reconstructed into the original data dimensional space by the decoding portion of the self-encoder.

And step two, initializing the clustering center, wherein in the initialization stage of the experiment, the k-means algorithm is used for initializing the clustering center for 20 times and selecting the optimal solution as the initial clustering center.

And step three, randomly initializing the network parameters of the l layer.

And step four, constructing a graph data structure, wherein unified manifold approximation and projection technology are used for constructing the single-cell RNA-seq data. The method comprises the following specific steps:

step 1, calculating the distance rho from a certain data point to the first nearest neighbor of the certain data point_i，

k represents the total number of clusters;

step 2, passing through a formula

Find σ_i；

Step 3, for the directed weighted graph

The node set V in the graph is all the data points, and the edge set

The weight w between nodes is calculated using the following equation

Step 4, constructing a directed weighted graph

After the matrix is obtained, further calculation can be carried out

Is provided withTo the adjacent matrix

Step 5, the adjacency matrix A of undirected graph G ═ V, E passes through

And (4) calculating.

And step five, carrying out iterative full-scale training. The method comprises the following specific steps:

step 1, learning by combining terms in a Gene Ontology (GO) to obtain each layer of expression of a deep neural network;

step 2, calculating data distribution Q by using the last layer representation of the encoder;

step 3, on the basis of the data distribution Q, performing quadratic power calculation and then performing normalization according to each soft clustering frequency to calculate a target distribution P;

step 4, aiming at the output of each layer of the encoder, fusing the representation of each layer of the deep neural network and the graph neural network by using a transfer operator epsilon, and further propagating forward to learn the representation of the next layer of the graph neural network;

step 5, calculating the low-dimensional distribution Z of the graph neural network;

step 6, continuously inputting the representation obtained by learning in the encoder into a decoder to reconstruct the original data;

step 7, respectively calculating three loss functions L in the method_res，L_clu，L_gnn(ii) a Calculating an overall loss function L of the whole network structure;

step 8, updating parameters of the whole network by using a back propagation algorithm in the whole network framework until iteration stops;

and step six, returning the final clustering result. The invention selects the soft distribution value of Z distribution as the final clustering result, because the data learned in the graph convolution network contains two different types of information, the label is set for the ith sample data:

for each data set, the invention ran 10 experiments in total and averaged as the final result. After the final clustering result is obtained, the final clustering effect is verified by using four measurement methods of standardized Mutual information NMI (normalized Mutual information), adjusted landed index ARI (adjusted Rand index), Homogeneity Homogenity and integrity, and the result shows that the single-cell RNA-seq data clustering can be better realized compared with the traditional clustering method.

The invention has the following characteristics:

1. the influence of high dimensionality and large noise of single cell RNA-seq data on a clustering result is reduced;

2. in the clustering process of the single cell RNA-seq data, the data representation can be effectively learned, and the method has strong data representation capability;

3. the characteristic information of the data can be learned, and the topological structure information among the data can be learned;

4. has good biological interpretability.

The method is a specific method in the unsupervised learning, data input into a model is data without labels, but labels of the data, also called "pseudo labels", are artificially constructed through the structure or characteristics of the data, and after the data has such labels, a learning mechanism similar to the supervised learning can be performed to train the deep neural network.

In the process of carrying out deep clustering on the single cell RNA-seq data, in addition to effectively reducing the dimension and reducing the noise of the high-dimensional single cell RNA-seq data, the invention designs a double self-supervision strategy, fuses a deep neural network and a graph neural network into a unified frame, and improves the performance of a model by adding biological prior knowledge and adopting a unified manifold approximation and projection technology.

Claims

1. A single cell RNA-seq data clustering method based on double self-supervision is characterized by comprising the following steps;

1) the self-encoder in the pre-training deep neural network module:

selecting 5 public data sets downloaded from Arrayexpress and GEO databases, wherein gene expression values in the 5 public data sets are obtained from various tissue cells, including GSE60361, GSE65525, GSE72056, GSE76312 and GSE103322, further screening out cells with normal gene expression quantity, reading original single cell RNA-seq data and carrying out standardized preprocessing, inputting the processed single cell RNA-seq data by using a designed self-encoder for training and obtaining a pre-training model;

2) initializing a clustering center:

3) randomly initializing network parameters of the l layer;

initializing the network parameters of the l layer by using an Xavier initialization method, so that signals can be transmitted deeper in the used neural network;

4) construct graph data structure:

constructing K neighbor graph structure data of original data by using a unified manifold approximation and projection algorithm, firstly calculating the distance from each single cell RNA-seq sample data to the nearest neighbor thereof, then calculating the distance probability, then constructing a matrix of a directed weighted graph, calculating a tie matrix of an undirected graph, and constructing the K neighbor graph structure data of the original data;

5) iterative full training:

in a single training, combining gene ontology to learn layer by layer to obtain the representation of single-cell RNA-seq data in a deep neural network, using zero-expansion negative binomial distribution to effectively reconstruct the single-cell RNA-seq data, reducing data noise and dimensionality, further calculating the data distribution of sample data in a latent layer and the target distribution after normalization after obtaining the representation of the data in the last layer of a self-encoder, using a transfer operator to combine two representations learned by the deep neural network and the graph neural network layer by layer, continuously propagating the learning of the graph neural network forward, calculating the low-dimensional distribution of the graph neural network, using the target distribution to supervise the learning processes of the two neural networks in a KL divergence mode, and integrating the loss functions of the graph neural network, the deep neural network and a double self-supervision module as the integral loss function of the invention, inputting the obtained new training data into the current model again for training, optimizing the model parameters, and stopping iteration until the total loss function in the method model is converged;

6) and returning a final clustering result:

2. The double self-supervision-based single-cell RNA-seq data clustering method according to claim 1, wherein the step of preprocessing the single-cell RNA-seq data in the step 1) comprises: firstly, screening out cells with normal gene expression quantity; then, the data was normalized for sequencing depth and gene length using a logarithmic normalization method.

3. The double self-supervision-based single-cell RNA-seq data clustering method according to claim 1, wherein the self-encoder input dimension in step 1) is consistent with the dimension of the single-cell RNA-seq data used for training, and the dimension of each layer in the graph neural network module is consistent with the dimension in the self-encoder.

4. The double self-supervision-based single-cell RNA-seq data clustering method according to claim 1, wherein the step 4) specifically comprises: firstly, for each high-dimensional single-cell RNA-seq data point, the distance rho between the data point and the first nearest neighbor is calculated_i(ii) a Tighten upThen according to ρ_iCalculate the variance σ of the distance probability_i(ii) a Then, calculating weight values among nodes in the directed weighted graph, constructing a matrix of the directed weighted graph, and further calculating a directed adjacent matrix of the directed weighted graph; and calculating the adjacency matrix of the undirected graph by combining the Hadamard product operation according to the directed adjacency matrix.

5. The double self-supervision-based single-cell RNA-seq data clustering method according to claim 4, wherein the distance p from a certain data point to the first nearest neighbor is_i，

k denotes the total number of clusters, σ_iBy the formula

Can find that the said directed weighted graph

The node set V in the figure is all single-cell RNA-seq data, and the edge set

The weight w between nodes is calculated using the following equation

Building out directed weighted graph

After the matrix is obtained, further calculation can be carried out

Directed adjacency matrix of

And the adjacency matrix A of undirected graph G ═ V, E passes through

And (4) calculating.

6. The double-self-supervision-based single-cell RNA-seq data clustering method according to claim 1, wherein the step 5) iterative full-scale training specifically comprises: obtaining each layer of the deep neural network by combining gene ontology learning; calculating a data distribution Q using the last layer representation of the encoder; on the basis of the data distribution Q, performing quadratic power calculation, and performing normalization according to each soft clustering frequency to calculate a target distribution P; for each layer of output of the encoder, fusing each layer of representation of the deep neural network and the graph neural network by using a transfer operator epsilon, and further propagating forward so as to learn the next layer of representation of the graph neural network; calculating a low-dimensional distribution Z of the graph neural network; continuously inputting the representation obtained by learning in the encoder into a decoder to reconstruct the original data; respectively calculating three loss functions L in the method_res，L_clu，L_gnn(ii) a Calculating an overall loss function L of the whole network structure; and updating parameters of the whole network by using a back propagation algorithm in the whole network framework until the iteration is stopped.

7. The method of claim 6, wherein the data distribution Q in the deep neural network module measures the data characterization h using student T distribution as a kernel for the ith single-cell RNA sequencing sample data and the jth cluster center data_iAnd cluster center μ_jSimilarity of (c):

8. The double self-supervision-based single-cell RNA-seq data clustering method according to claim 6, wherein the target distribution P plays a role in supervising the other two distributions, P_ij＝(q_ij ²/f_j)/(∑_j′q_ij′ ²/f_j′)，f_j＝∑_iq_ijRepresenting soft clustering frequency, for each Q of Q_iThe second power calculation is firstly carried out, and then normalization is carried out according to each soft clustering frequency to obtain each p_iChanging P to [ P ]_ij]The target distribution for all sample assignments is considered.

9. The double-self-supervision-based single-cell RNA-seq data clustering method according to claim 6, wherein the encoder has L layers in common, and the I is used to represent the number of a certain layer, then the data learned through the I layer in the encoder is represented as:

representing the activation function, W, of each layer_e ^(l)And b_e ^(l)Weight parameters and bias terms, respectively, learned by the l-th layer in the encoder, will be H⁽⁰⁾Defined as the original data X. The decoder of the model, which is immediately behind the encoder, reconstructs the data through several neural network layers,

W_d ^(l)and b_d ^(l)Weight parameters and bias terms learned by the l layer in a decoder respectively;

due to self-encodingRepresentation H learned by machine^(l-1)Can reconstruct single cell RNA-seq data, which comprises representation Z obtained by learning different from neural network^(l-1)Information of (A) is^(l-1)And H^(l-1)The two representations are combined to yield: z^％(l-1)＝(1-ε)Z^(l-1)+εH^(l-1)Epsilon is a transfer operator, the transfer operator is set to be 0.5, and a self-coder in the deep neural network module and a graph convolution network in the graph neural network module are connected layer by layer;

use of

Generating Z as input to the l-th layer of a graph convolution network^(l)，

A contiguous matrix is represented that is,

the final output part of the decoder is reconstruction data, and a zero is added on the basis of a negative binomial distribution (NB) modelA swelling factor, adding a pulse function at the Zero point, namely modeling single-cell RNA-seq data by using Zero-swelled Negative Binomial distribution (Zero-swelled Negative Binomial), and formulating as ZINB (X | pi, mu, theta) ═ pi delta₀(X) + (1-pi) N BETA (X | mu, theta), three independent full-connected layers are added behind the last hidden layer, the whole self-encoder has three outputs, and zero expansion factors, mean values and variances, L, of zero-expansion negative binomial distribution are learned respectively_resTo reduce the error between the reconstructed data and the original data of the decoder, L_res＝-log(ZINB(X|π,μ,θ))；

By minimizing KL divergence between Q distribution and P distribution, the target distribution P can help a deep neural network module to learn better data representation for clustering tasks, and the target distribution P is obtained by performing quadratic and normalization processing on the basis of the clustering distribution Q, so that single-cell RNA-seq data points of the target distribution P have high confidence level, and P is used as the monitoring information of Q in the experimental process to continuously optimize a network model, wherein the process can be regarded as an automatic monitoring strategy;

the target function of the graph neural network module optimizes the network in a KL divergence mode, so that the clustering distribution Q and the distribution Z learned by the graph neural network can be supervised by the target distribution P, and the data representation and clustering performance of the whole network are improved together;

10. The method of claim 6, wherein the soft distribution value of Z distribution is used as the final clustering result, and since the data learned in the graph convolution network contains two different types of information, the ith sample data is labeled: