CN114022693A - Double-self-supervision-based single-cell RNA-seq data clustering method - Google Patents
Double-self-supervision-based single-cell RNA-seq data clustering method Download PDFInfo
- Publication number
- CN114022693A CN114022693A CN202111152906.1A CN202111152906A CN114022693A CN 114022693 A CN114022693 A CN 114022693A CN 202111152906 A CN202111152906 A CN 202111152906A CN 114022693 A CN114022693 A CN 114022693A
- Authority
- CN
- China
- Prior art keywords
- data
- neural network
- distribution
- cell rna
- graph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003559 RNA-seq method Methods 0.000 title claims abstract description 71
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000013528 artificial neural network Methods 0.000 claims abstract description 77
- 238000009826 distribution Methods 0.000 claims abstract description 77
- 230000006870 function Effects 0.000 claims abstract description 21
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 11
- 238000012549 training Methods 0.000 claims description 30
- 239000011159 matrix material Substances 0.000 claims description 27
- 230000008569 process Effects 0.000 claims description 16
- 230000014509 gene expression Effects 0.000 claims description 13
- 238000010606 normalization Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000012546 transfer Methods 0.000 claims description 8
- 238000005457 optimization Methods 0.000 claims description 6
- 238000002474 experimental method Methods 0.000 claims description 5
- 230000001902 propagating effect Effects 0.000 claims description 5
- 238000012216 screening Methods 0.000 claims description 5
- 238000012163 sequencing technique Methods 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 4
- 238000000547 structure data Methods 0.000 claims description 3
- 206010028980 Neoplasm Diseases 0.000 claims description 2
- 230000004913 activation Effects 0.000 claims description 2
- 201000011510 cancer Diseases 0.000 claims description 2
- 238000012512 characterization method Methods 0.000 claims description 2
- 238000013527 convolutional neural network Methods 0.000 claims description 2
- 238000011423 initialization method Methods 0.000 claims description 2
- 230000010354 integration Effects 0.000 claims description 2
- 238000012545 processing Methods 0.000 claims description 2
- 238000012174 single-cell RNA sequencing Methods 0.000 claims description 2
- 238000012544 monitoring process Methods 0.000 claims 2
- 230000008961 swelling Effects 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 5
- 238000005065 mining Methods 0.000 abstract 1
- 210000004027 cell Anatomy 0.000 description 32
- 230000000694 effects Effects 0.000 description 5
- 238000011160 research Methods 0.000 description 4
- 230000009467 reduction Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000031018 biological processes and functions Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 210000003701 histiocyte Anatomy 0.000 description 1
- 238000000338 in vitro Methods 0.000 description 1
- 238000001727 in vivo Methods 0.000 description 1
- 238000011433 logarithmic standardization method Methods 0.000 description 1
- 238000000691 measurement method Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 101150050759 outI gene Proteins 0.000 description 1
- 229920002477 rna polymer Polymers 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/30—Unsupervised data analysis
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Biotechnology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a double self-supervision-based single-cell RNA-seq data clustering method, which comprises the steps of firstly constructing a deep neural network by combining gene ontology knowledge to extract single-cell RNA-seq data characteristics, and reconstructing single-cell RNA-seq data through zero-expansion negative binomial distribution to reduce data noise; secondly, constructing a graph structure by utilizing a uniform manifold approximation and projection technology and mining topological structure information among data samples by adopting a graph neural network; then, combining the graph neural network and the deep neural network by adopting a double self-supervision strategy; and finally, realizing single cell RNA-seq data clustering by using a combined loss function of a minimum deep neural network, a graph neural network, a double self-supervision module and a random Gaussian noise term. The invention adopts an automatic supervision method in the unsupervised field, and can effectively solve the problems of the existing single cell RNA-seq data clustering method, such as lack of topological structure information among learning data, poor biological interpretability and the like.
Description
Technical Field
The invention belongs to the technical field of single cell RNA-seq data analysis, and particularly relates to a double-self-supervision-based single cell RNA-seq data clustering method.
Background
The clustering method aiming at the single cell RNA-seq data plays an important role in the relevant research such as cell heterogeneity and the like. In the clustering problem, cells can be classified into different cell types, each having different expression profiles from other cells, according to their transcription profiles. Through the research of the clustering method of the single cell RNA-seq data, researchers can identify new cell populations in organisms, identify cell states, establish networks among cells, trace development lineages and research the reaction of in-vitro and in-vivo experiments. At present, traditional clustering methods such as k-means, hierarchical clustering and density-based clustering methods with noise are widely used, but single-cell RNA-seq data have unique characteristics, so that the traditional clustering methods cannot effectively cluster the data.
Disclosure of Invention
In order to overcome the technical problems, the invention provides a single-cell RNA-seq data clustering method based on double self-supervision, which integrates the structural information of single-cell RNA-seq data into a neural network layer, integrates the traditional deep neural network and a graph neural network into the same model through a double self-supervision strategy, and realizes the integral iterative optimization of the model. In this way, a variety of data structures from low order to high order are naturally combined with a variety of representations learned from the encoder. In the process of constructing the self-encoder, the invention firstly constructs the structural data of the single-cell RNA sequencing graph by using a uniform manifold approximation and projection technology, and reconstructs sample data by using zero-expansion negative binomial distribution, thereby not only realizing the noise reduction of the single-cell RNA-seq data, but also improving the overall performance of the model and laying a foundation for subsequent effective clustering.
In order to achieve the purpose, the invention adopts the technical scheme that:
a single cell RNA-seq data clustering method based on double self-supervision comprises the following steps;
1) the self-encoder in the pre-training deep neural network module:
selecting 5 public data sets downloaded from Arrayexpress and GEO databases, wherein gene expression values in the 5 public data sets are obtained from various tissue cells, including GSE60361, GSE65525, GSE72056, GSE76312 and GSE103322, further screening cells with normal gene expression quantity, reading original single cell RNA-seq data and carrying out standardized preprocessing, inputting the processed single cell RNA-seq data by using a designed self-encoder for training and obtaining a pre-training model;
2) initializing a clustering center:
in the initialization stage of an experiment, a self-encoder obtained by pre-training can learn the potential representation of single-cell RNA-seq data, a k-means algorithm is used for initializing a clustering center on the basis of the potential representation, the initialization is carried out for 20 times, and the optimal solution is selected as the initial clustering center;
3) randomly initializing network parameters of the l layer;
the initialization of the network parameters is very important for the training of the network, the problems of gradient explosion and gradient disappearance are avoided, the training speed is further improved, the network convergence is accelerated, and the network parameters of the l layer are initialized by using an Xavier initialization method, so that signals can be transmitted more deeply in the used neural network;
4) construct graph data structure:
the topological structure information of the single cell RNA-seq data is learned by adding a graph neural network into a model, a good graph data structure has a great promotion effect on the learning effect of the graph neural network, the tasks are completed by using a unified manifold approximation and projection algorithm, the distance from each single cell RNA-seq sample data to the nearest neighbor of the sample data is calculated at first, then the distance probability is calculated, then a matrix of a directed weighted graph is constructed, a tie matrix of an undirected graph is calculated, and K neighbor graph structure data of original data is constructed;
5) iterative full training:
in a single training, combining gene ontology to learn layer by layer to obtain the representation of single-cell RNA-seq data in a deep neural network, effectively reconstructing the single-cell RNA-seq data by using zero-expansion negative binomial distribution, reducing data noise and dimensionality, further calculating the data distribution of sample data in an underlying layer and the normalized target distribution after obtaining the representation of the data in the last layer of a self-encoder, combining two representations learned by the deep neural network and the graph neural network layer by using a transfer operator, continuously propagating the learning of the graph neural network forward, calculating the low-dimensional distribution of the graph neural network, supervising the learning processes of the two neural networks in a KL divergence mode by using the target distribution, integrating the loss functions of the three modules of the graph neural network, the deep neural network and double self-supervision as the integral loss function of the invention, inputting the obtained new training data into the current model again for training, optimizing the model parameters, and stopping iteration until the total loss function in the method model is converged;
6) and returning a final clustering result:
through effective learning of the network, the single-cell RNA-seq data learned in the graph convolution network comprises two different types of information, and the soft distribution value of the data distribution learned by the graph convolution network is used as a final clustering result to discover the cell subtype, thereby providing help for subsequent early cancer discovery and treatment.
The step of preprocessing the single-cell RNA-seq data in the step 1) comprises the following steps: firstly, screening out cells with normal gene expression quantity; then, the data was normalized for sequencing depth and gene length using a logarithmic normalization method.
The dimension of the self-encoder input in the step 1) is consistent with the dimension of single-cell RNA-seq data used for training, five layers are provided, and the dimension of each layer in the graph neural network module is consistent with the dimension in the self-encoder.
The step 4) specifically comprises the following steps: firstly, for each high-dimensional single-cell RNA-seq data point, the distance rho between the data point and the first nearest neighbor is calculatedi(ii) a Next, according to ρiCalculate the variance σ of the distance probabilityi(ii) a Then, calculating weight values among nodes in the directed weighted graph, constructing a matrix of the directed weighted graph, and further calculating a directed adjacent matrix of the directed weighted graph; and calculating the adjacency matrix of the undirected graph by combining the Hadamard product operation according to the directed adjacency matrix.
The distance rho from a certain data point to the first nearest neighbori, k denotes the total number of clusters, σiBy the formulaCan find that the said directed weighted graphThe node set V in the figure is all single-cell RNA-seq data, and the edge setThe weight w between nodes is calculated using the following equationBuilding out directed weighted graphAfter the matrix is obtained, further calculation can be carried outDirected adjacency matrix ofAnd the adjacency matrix A of undirected graph G ═ V, E passes throughAnd (4) calculating.
The step 5) iterative full-scale training specifically comprises the following steps: obtaining each layer of representation of the deep neural network by combining gene ontology learning; calculating a data distribution Q using the last layer representation of the encoder; on the basis of the data distribution Q, performing quadratic power calculation, and performing normalization according to each soft clustering frequency to calculate a target distribution P; for each layer of output of the encoder, fusing each layer of representation of the deep neural network and the graph neural network by using a transfer operator epsilon, and further propagating forward so as to learn the next layer of representation of the graph neural network; calculating a low-dimensional distribution Z of the graph neural network; continuously inputting the representation obtained by learning in the encoder into a decoder to reconstruct the original data; respectively calculating three loss functions L in the methodres,Lclu,Lgnn(ii) a Calculating an overall loss function L of the whole network structure; and updating parameters of the whole network by using a back propagation algorithm in the whole network framework until the iteration is stopped.
The data distribution Q in the deep neural network measures data characterization h by taking student T distribution as a core for ith single cell ribonucleic acid sequencing sample data and jth cluster center dataiAnd cluster center μjSimilarity of (c):hithe ith row of data representing the self-encoder is initialized by using a k-means algorithm in the process of pre-training the self-encoder to obtain mujV is the degree of freedom of student T distribution, qijDenotes the probability of assigning the ith sample data to the jth cluster, and Q ═ Qij]Seen as a distribution of all sample allocations.
The target distribution P plays the role of supervising the other two distributions, Pij=(qij 2/fj)/(∑j′qij′ 2/fj′),fj=∑iqijRepresenting soft clustering frequency, for each Q of QiThe second power calculation is firstly carried out, and then normalization is carried out according to each soft clustering frequency to obtain each piChanging P to [ P ]ij]The target distribution for all sample assignments is considered.
If L layers are shared in the encoder and L is used to represent the number of a certain layer, the data learned by the L-th layer in the encoder is represented as: representing the activation function, W, of each layere (l)And be (l)Weight parameters and bias terms, respectively, for the l-th layer learning in the encoder, will be H(0)Defined as the original data X. The decoder of the model following the encoderLater, the decoder reconstructs the data through several neural network layers,Wd (l)and bd (l)Respectively, the weight parameters and the bias terms learned by the l layer in the decoder;
representation H due to self-encoder learning(l-1)Can reconstruct single cell RNA-seq data, which comprises representation Z obtained by learning different from neural network(l-1)Information of (A) is(l-1)And H(l-1)The two representations are combined to yield: z%(l-1)=(1-ε)Z(l-1)+εH(l-1)Epsilon is a transfer operator which is set to be 0.5, and a self-encoder in the deep neural network module and a graph convolution network in the graph neural network module are connected layer by layer;
use ofGenerating Z as input to the l-th layer of a graph convolution network(l), A contiguous matrix is represented that is,expression matrix, H learned from encoder network(l-1)By normalizing the adjacency matrixThe information learned by each layer of the self-encoder is integrated into the convolutional neural network due to different information learned by each layer of the self-encoder, L information integration processes are run together, and the last layer in the graph neural network module is a multi-classification layer:final result zijE is Z, the ith sample data belongs to the jth clustering center data, and Z is probability distribution;
the final output part of the decoder is reconstruction data, and recent research progress aiming at single-cell RNA-seq data shows that the single-cell RNA-seq data is closest to Negative Binomial distribution (NB) and is formulated asBecause the dispersion of single-cell RNA-seq data is usually highly distorted, the variance tends to be larger than the mean and is therefore not suitable for approximation with a poisson distribution, whereas the variance of single-cell RNA-seq data typically changes as the mean changes. In addition to the above, single cell RNA-seq data are characterized by a particularly high number of zeros. Since the Zero values in the gene expression data may come from genes that are not expressed in the biological process (True Zero) or from technical losses in the sequencing process (Dropout Zero). In order to better capture single-cell RNA-seq data, the invention improves the traditional noise reduction self-encoder, adds a Zero-expansion factor on the basis of a Negative Binomial distribution (NB) model, and can also be understood as adding a pulse function at a Zero point, namely modeling the single-cell RNA-seq data by using Zero-expanded Negative Binomial distribution (Zero-expanded Negative Binomial). Formulated as ZINB (X | pi, μ, θ) ═ pi δ0(X) + (1-pi) N BETA (X | mu, theta), three independent full-connected layers are added behind the last hidden layer, and the whole self-encoder has three outputs to respectively learn the zero expansion factor, the mean value and the variance of the zero expansion negative binomial distribution. L isresTo reduce the error between the reconstructed data and the original data of the decoder, Lres=-log(ZINB(X|π,μ,θ));
By minimizing the KL divergence between the Q distribution and the P distribution, the target distribution P can help the deep neural network module to learn a better data representation for the clustering task, so that the data is representedThe method is closer to a clustering center, and target distribution P is obtained by performing quadratic and normalization processing on the basis of clustering distribution Q, so that single cell RNA-seq data points of the target distribution P have high confidence, and P is used as the supervision information of Q in the experimental process to continuously optimize a network model, wherein the process can be regarded as an automatic supervision strategy;
the target function of the graph neural network module optimizes the network in a KL divergence mode, so that the optimization process is more gentle, the phenomenon that the representation learning of single-cell RNA-seq data is greatly influenced is avoided, two different neural network models are integrated into the same parameter iteration updating framework, the clustering distribution Q and the distribution Z learned by the graph neural network can be supervised by the target distribution P, and the data representation and clustering performance of the whole network are jointly improved.
L=Lres+αLclu+βLgnn+γ||B||,α>0 is a hyper-parameter, β, that balances the original data clustering optimization and the data structure reconstruction>0 is a coefficient for controlling the interference of the neural network module to the embedding space, B represents the random Gaussian noise added into the neural network, gamma is an influence parameter for adjusting the random Gaussian noise added into the neural network to the model, and the whole model is updated in an end-to-end mode through the optimization of the clustering loss function.
The soft distribution value of the Z distribution is used as a final clustering result, and as the data learned in the graph convolution network contains two different types of information, a label is set for the ith sample data:
the invention has the beneficial effects that:
the invention uses the self-encoder to not only reduce the dimension of the single cell RNA-seq data, but also effectively learn the representation of the data, uses the zero-expansion negative binomial distribution reconstruction data to reduce the influence of data noise on the clustering effect, simultaneously uses the graph neural network to combine the uniform manifold approximation and the projection technology to learn the topological structure information among the data, adds the biological prior knowledge such as the gene ontology and the like in the process of constructing the encoder, and improves the biological interpretability of the method model.
Drawings
FIG. 1 is a general flow chart of a single-cell RNA-seq data clustering method based on double self-supervision provided by the invention.
Detailed Description
The present invention will be described in further detail with reference to examples.
As shown in FIG. 1, six steps of the invention based on a double self-supervision strategy to improve the single cell RNA-seq data clustering effect are shown, a self-encoder in a pre-training deep neural network module, a clustering center initialization, a network parameter initialization of the l-th layer randomly, a structure diagram data structure construction, an iterative full-scale training and a final clustering result return.
The invention provides a single cell RNA-seq data clustering method based on double self-supervision, which selects 5 public data sets downloaded from Arrayexpress and GEO databases to verify the effectiveness of the invention, wherein gene expression values in the 5 public data sets are obtained from various histiocytes, including GSE60361, GSE65525, GSE72056, GSE76312 and GSE 103322. And further screening cells with normal gene expression quantity, and carrying out standardization treatment on the sequencing depth and the gene length of the data by adopting a logarithmic standardization method. The normalized data is used as the initial input data for the embodiment of the present invention. The invention comprises the following steps:
firstly, a pre-training self-encoder is used, and the pre-training is finished by using original single-cell RNA-seq data which can be well reconstructed as an indication. The invention trains all the selected data 30 times, sets the learning rate to be 0.001, sets the hyper-parameter to be 0.1, beta to be 0.01, and gamma to be 0.01. Through the training and learning of the step, the parameters learned from the encoder network can effectively perform dimension reduction and feature extraction on the original data. The latent data representation already contains information that can reconstruct the original data, and can be restored or reconstructed into the original data dimensional space by the decoding portion of the self-encoder.
And step two, initializing the clustering center, wherein in the initialization stage of the experiment, the k-means algorithm is used for initializing the clustering center for 20 times and selecting the optimal solution as the initial clustering center.
And step three, randomly initializing the network parameters of the l layer.
And step four, constructing a graph data structure, wherein unified manifold approximation and projection technology are used for constructing the single-cell RNA-seq data. The method comprises the following specific steps:
step 1, calculating the distance rho from a certain data point to the first nearest neighbor of the certain data pointi, k represents the total number of clusters;
Step 3, for the directed weighted graphThe node set V in the graph is all the data points, and the edge setThe weight w between nodes is calculated using the following equation
Step 4, constructing a directed weighted graphAfter the matrix is obtained, further calculation can be carried outIs provided withTo the adjacent matrix
And step five, carrying out iterative full-scale training. The method comprises the following specific steps:
step 1, learning by combining terms in a Gene Ontology (GO) to obtain each layer of expression of a deep neural network;
step 2, calculating data distribution Q by using the last layer representation of the encoder;
step 3, on the basis of the data distribution Q, performing quadratic power calculation and then performing normalization according to each soft clustering frequency to calculate a target distribution P;
step 4, aiming at the output of each layer of the encoder, fusing the representation of each layer of the deep neural network and the graph neural network by using a transfer operator epsilon, and further propagating forward to learn the representation of the next layer of the graph neural network;
step 5, calculating the low-dimensional distribution Z of the graph neural network;
step 6, continuously inputting the representation obtained by learning in the encoder into a decoder to reconstruct the original data;
step 7, respectively calculating three loss functions L in the methodres,Lclu,Lgnn(ii) a Calculating an overall loss function L of the whole network structure;
step 8, updating parameters of the whole network by using a back propagation algorithm in the whole network framework until iteration stops;
and step six, returning the final clustering result. The invention selects the soft distribution value of Z distribution as the final clustering result, because the data learned in the graph convolution network contains two different types of information, the label is set for the ith sample data:for each data set, the invention ran 10 experiments in total and averaged as the final result. After the final clustering result is obtained, the final clustering effect is verified by using four measurement methods of standardized Mutual information NMI (normalized Mutual information), adjusted landed index ARI (adjusted Rand index), Homogeneity Homogenity and integrity, and the result shows that the single-cell RNA-seq data clustering can be better realized compared with the traditional clustering method.
The invention has the following characteristics:
1. the influence of high dimensionality and large noise of single cell RNA-seq data on a clustering result is reduced;
2. in the clustering process of the single cell RNA-seq data, the data representation can be effectively learned, and the method has strong data representation capability;
3. the characteristic information of the data can be learned, and the topological structure information among the data can be learned;
4. has good biological interpretability.
The method is a specific method in the unsupervised learning, data input into a model is data without labels, but labels of the data, also called "pseudo labels", are artificially constructed through the structure or characteristics of the data, and after the data has such labels, a learning mechanism similar to the supervised learning can be performed to train the deep neural network.
In the process of carrying out deep clustering on the single cell RNA-seq data, in addition to effectively reducing the dimension and reducing the noise of the high-dimensional single cell RNA-seq data, the invention designs a double self-supervision strategy, fuses a deep neural network and a graph neural network into a unified frame, and improves the performance of a model by adding biological prior knowledge and adopting a unified manifold approximation and projection technology.
Claims (10)
1. A single cell RNA-seq data clustering method based on double self-supervision is characterized by comprising the following steps;
1) the self-encoder in the pre-training deep neural network module:
selecting 5 public data sets downloaded from Arrayexpress and GEO databases, wherein gene expression values in the 5 public data sets are obtained from various tissue cells, including GSE60361, GSE65525, GSE72056, GSE76312 and GSE103322, further screening out cells with normal gene expression quantity, reading original single cell RNA-seq data and carrying out standardized preprocessing, inputting the processed single cell RNA-seq data by using a designed self-encoder for training and obtaining a pre-training model;
2) initializing a clustering center:
in the initialization stage of an experiment, a self-encoder obtained by pre-training can learn the potential representation of single-cell RNA-seq data, a k-means algorithm is used for initializing a clustering center on the basis of the potential representation, the initialization is carried out for 20 times, and the optimal solution is selected as the initial clustering center;
3) randomly initializing network parameters of the l layer;
initializing the network parameters of the l layer by using an Xavier initialization method, so that signals can be transmitted deeper in the used neural network;
4) construct graph data structure:
constructing K neighbor graph structure data of original data by using a unified manifold approximation and projection algorithm, firstly calculating the distance from each single cell RNA-seq sample data to the nearest neighbor thereof, then calculating the distance probability, then constructing a matrix of a directed weighted graph, calculating a tie matrix of an undirected graph, and constructing the K neighbor graph structure data of the original data;
5) iterative full training:
in a single training, combining gene ontology to learn layer by layer to obtain the representation of single-cell RNA-seq data in a deep neural network, using zero-expansion negative binomial distribution to effectively reconstruct the single-cell RNA-seq data, reducing data noise and dimensionality, further calculating the data distribution of sample data in a latent layer and the target distribution after normalization after obtaining the representation of the data in the last layer of a self-encoder, using a transfer operator to combine two representations learned by the deep neural network and the graph neural network layer by layer, continuously propagating the learning of the graph neural network forward, calculating the low-dimensional distribution of the graph neural network, using the target distribution to supervise the learning processes of the two neural networks in a KL divergence mode, and integrating the loss functions of the graph neural network, the deep neural network and a double self-supervision module as the integral loss function of the invention, inputting the obtained new training data into the current model again for training, optimizing the model parameters, and stopping iteration until the total loss function in the method model is converged;
6) and returning a final clustering result:
through effective learning of the network, the single-cell RNA-seq data learned in the graph convolution network comprises two different types of information, and the soft distribution value of the data distribution learned by the graph convolution network is used as a final clustering result to discover the cell subtype, thereby providing help for subsequent early cancer discovery and treatment.
2. The double self-supervision-based single-cell RNA-seq data clustering method according to claim 1, wherein the step of preprocessing the single-cell RNA-seq data in the step 1) comprises: firstly, screening out cells with normal gene expression quantity; then, the data was normalized for sequencing depth and gene length using a logarithmic normalization method.
3. The double self-supervision-based single-cell RNA-seq data clustering method according to claim 1, wherein the self-encoder input dimension in step 1) is consistent with the dimension of the single-cell RNA-seq data used for training, and the dimension of each layer in the graph neural network module is consistent with the dimension in the self-encoder.
4. The double self-supervision-based single-cell RNA-seq data clustering method according to claim 1, wherein the step 4) specifically comprises: firstly, for each high-dimensional single-cell RNA-seq data point, the distance rho between the data point and the first nearest neighbor is calculatedi(ii) a Tighten upThen according to ρiCalculate the variance σ of the distance probabilityi(ii) a Then, calculating weight values among nodes in the directed weighted graph, constructing a matrix of the directed weighted graph, and further calculating a directed adjacent matrix of the directed weighted graph; and calculating the adjacency matrix of the undirected graph by combining the Hadamard product operation according to the directed adjacency matrix.
5. The double self-supervision-based single-cell RNA-seq data clustering method according to claim 4, wherein the distance p from a certain data point to the first nearest neighbor isi,k denotes the total number of clusters, σiBy the formulaCan find that the said directed weighted graphThe node set V in the figure is all single-cell RNA-seq data, and the edge setThe weight w between nodes is calculated using the following equationBuilding out directed weighted graphAfter the matrix is obtained, further calculation can be carried outDirected adjacency matrix ofAnd the adjacency matrix A of undirected graph G ═ V, E passes throughAnd (4) calculating.
6. The double-self-supervision-based single-cell RNA-seq data clustering method according to claim 1, wherein the step 5) iterative full-scale training specifically comprises: obtaining each layer of the deep neural network by combining gene ontology learning; calculating a data distribution Q using the last layer representation of the encoder; on the basis of the data distribution Q, performing quadratic power calculation, and performing normalization according to each soft clustering frequency to calculate a target distribution P; for each layer of output of the encoder, fusing each layer of representation of the deep neural network and the graph neural network by using a transfer operator epsilon, and further propagating forward so as to learn the next layer of representation of the graph neural network; calculating a low-dimensional distribution Z of the graph neural network; continuously inputting the representation obtained by learning in the encoder into a decoder to reconstruct the original data; respectively calculating three loss functions L in the methodres,Lclu,Lgnn(ii) a Calculating an overall loss function L of the whole network structure; and updating parameters of the whole network by using a back propagation algorithm in the whole network framework until the iteration is stopped.
7. The method of claim 6, wherein the data distribution Q in the deep neural network module measures the data characterization h using student T distribution as a kernel for the ith single-cell RNA sequencing sample data and the jth cluster center dataiAnd cluster center μjSimilarity of (c):hithe ith row of data representing the self-encoder is initialized by using a k-means algorithm in the process of pre-training the self-encoder to obtain mujV is the degree of freedom of student T distribution, qijDenotes the probability of assigning the ith sample data to the jth cluster, and Q ═ Qij]Seen as a distribution of all sample allocations.
8. The double self-supervision-based single-cell RNA-seq data clustering method according to claim 6, wherein the target distribution P plays a role in supervising the other two distributions, Pij=(qij 2/fj)/(∑j′qij′ 2/fj′),fj=∑iqijRepresenting soft clustering frequency, for each Q of QiThe second power calculation is firstly carried out, and then normalization is carried out according to each soft clustering frequency to obtain each piChanging P to [ P ]ij]The target distribution for all sample assignments is considered.
9. The double-self-supervision-based single-cell RNA-seq data clustering method according to claim 6, wherein the encoder has L layers in common, and the I is used to represent the number of a certain layer, then the data learned through the I layer in the encoder is represented as: representing the activation function, W, of each layere (l)And be (l)Weight parameters and bias terms, respectively, learned by the l-th layer in the encoder, will be H(0)Defined as the original data X. The decoder of the model, which is immediately behind the encoder, reconstructs the data through several neural network layers,Wd (l)and bd (l)Weight parameters and bias terms learned by the l layer in a decoder respectively;
due to self-encodingRepresentation H learned by machine(l-1)Can reconstruct single cell RNA-seq data, which comprises representation Z obtained by learning different from neural network(l-1)Information of (A) is(l-1)And H(l-1)The two representations are combined to yield: z%(l-1)=(1-ε)Z(l-1)+εH(l-1)Epsilon is a transfer operator, the transfer operator is set to be 0.5, and a self-coder in the deep neural network module and a graph convolution network in the graph neural network module are connected layer by layer;
use ofGenerating Z as input to the l-th layer of a graph convolution network(l), A contiguous matrix is represented that is,expression matrix, H learned from encoder network(l-1)By normalizing the adjacency matrixThe information learned by each layer of the self-encoder is integrated into the convolutional neural network due to different information learned by each layer of the self-encoder, L information integration processes are run together, and the last layer in the graph neural network module is a multi-classification layer:final result zijE is Z, the ith sample data belongs to the jth clustering center data, and Z is probability distribution;
the final output part of the decoder is reconstruction data, and a zero is added on the basis of a negative binomial distribution (NB) modelA swelling factor, adding a pulse function at the Zero point, namely modeling single-cell RNA-seq data by using Zero-swelled Negative Binomial distribution (Zero-swelled Negative Binomial), and formulating as ZINB (X | pi, mu, theta) ═ pi delta0(X) + (1-pi) N BETA (X | mu, theta), three independent full-connected layers are added behind the last hidden layer, the whole self-encoder has three outputs, and zero expansion factors, mean values and variances, L, of zero-expansion negative binomial distribution are learned respectivelyresTo reduce the error between the reconstructed data and the original data of the decoder, Lres=-log(ZINB(X|π,μ,θ));
By minimizing KL divergence between Q distribution and P distribution, the target distribution P can help a deep neural network module to learn better data representation for clustering tasks, and the target distribution P is obtained by performing quadratic and normalization processing on the basis of the clustering distribution Q, so that single-cell RNA-seq data points of the target distribution P have high confidence level, and P is used as the monitoring information of Q in the experimental process to continuously optimize a network model, wherein the process can be regarded as an automatic monitoring strategy;
the target function of the graph neural network module optimizes the network in a KL divergence mode, so that the clustering distribution Q and the distribution Z learned by the graph neural network can be supervised by the target distribution P, and the data representation and clustering performance of the whole network are improved together;
L=Lres+αLclu+βLgnn+γ||B||,α>0 is a hyper-parameter, β, that balances the original data clustering optimization and the data structure reconstruction>0 is a coefficient for controlling the interference of the neural network module to the embedding space, B represents the random Gaussian noise added into the neural network, gamma is an influence parameter for adjusting the random Gaussian noise added into the neural network to the model, and the whole model is updated in an end-to-end mode through the optimization of the clustering loss function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111152906.1A CN114022693B (en) | 2021-09-29 | 2021-09-29 | Single-cell RNA-seq data clustering method based on double self-supervision |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111152906.1A CN114022693B (en) | 2021-09-29 | 2021-09-29 | Single-cell RNA-seq data clustering method based on double self-supervision |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114022693A true CN114022693A (en) | 2022-02-08 |
CN114022693B CN114022693B (en) | 2024-02-27 |
Family
ID=80055158
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111152906.1A Active CN114022693B (en) | 2021-09-29 | 2021-09-29 | Single-cell RNA-seq data clustering method based on double self-supervision |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114022693B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115223657A (en) * | 2022-09-20 | 2022-10-21 | 吉林农业大学 | Medicinal plant transcription regulation and control map prediction method |
CN115240772A (en) * | 2022-08-22 | 2022-10-25 | 南京医科大学 | Method for analyzing active pathway in unicellular multiomics based on graph neural network |
CN114462548B (en) * | 2022-02-23 | 2023-07-18 | 曲阜师范大学 | Method for improving accuracy of single-cell deep clustering algorithm |
CN116452910A (en) * | 2023-03-28 | 2023-07-18 | 河南科技大学 | scRNA-seq data characteristic representation and cell type identification method based on graph neural network |
CN116665786A (en) * | 2023-07-21 | 2023-08-29 | 曲阜师范大学 | RNA layered embedding clustering method based on graph convolution neural network |
CN116844649A (en) * | 2023-08-31 | 2023-10-03 | 杭州木攸目医疗数据有限公司 | Interpretable cell data analysis method based on gene selection |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111259979A (en) * | 2020-02-10 | 2020-06-09 | 大连理工大学 | Deep semi-supervised image clustering method based on label self-adaptive strategy |
US20200218971A1 (en) * | 2016-12-27 | 2020-07-09 | Obschestvo S Ogranichennoy Otvetstvennostyu "Vizhnlabs" | Training of deep neural networks on the basis of distributions of paired similarity measures |
CN111785329A (en) * | 2020-07-24 | 2020-10-16 | 中国人民解放军国防科技大学 | Single-cell RNA sequencing clustering method based on confrontation automatic encoder |
-
2021
- 2021-09-29 CN CN202111152906.1A patent/CN114022693B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200218971A1 (en) * | 2016-12-27 | 2020-07-09 | Obschestvo S Ogranichennoy Otvetstvennostyu "Vizhnlabs" | Training of deep neural networks on the basis of distributions of paired similarity measures |
CN111259979A (en) * | 2020-02-10 | 2020-06-09 | 大连理工大学 | Deep semi-supervised image clustering method based on label self-adaptive strategy |
CN111785329A (en) * | 2020-07-24 | 2020-10-16 | 中国人民解放军国防科技大学 | Single-cell RNA sequencing clustering method based on confrontation automatic encoder |
Non-Patent Citations (2)
Title |
---|
任伟;: "基于稀疏自编码深度神经网络的入侵检测方法", 移动通信, no. 08, 15 August 2018 (2018-08-15) * |
高美加;: "基于loess回归加权的单细胞RNA-seq数据预处理算法", 智能计算机与应用, no. 05, 1 May 2020 (2020-05-01) * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114462548B (en) * | 2022-02-23 | 2023-07-18 | 曲阜师范大学 | Method for improving accuracy of single-cell deep clustering algorithm |
CN115240772A (en) * | 2022-08-22 | 2022-10-25 | 南京医科大学 | Method for analyzing active pathway in unicellular multiomics based on graph neural network |
CN115240772B (en) * | 2022-08-22 | 2023-08-22 | 南京医科大学 | Method for analyzing single cell pathway activity based on graph neural network |
CN115223657A (en) * | 2022-09-20 | 2022-10-21 | 吉林农业大学 | Medicinal plant transcription regulation and control map prediction method |
CN115223657B (en) * | 2022-09-20 | 2022-12-06 | 吉林农业大学 | Medicinal plant transcriptional regulation map prediction method |
CN116452910A (en) * | 2023-03-28 | 2023-07-18 | 河南科技大学 | scRNA-seq data characteristic representation and cell type identification method based on graph neural network |
CN116452910B (en) * | 2023-03-28 | 2023-11-28 | 河南科技大学 | scRNA-seq data characteristic representation and cell type identification method based on graph neural network |
CN116665786A (en) * | 2023-07-21 | 2023-08-29 | 曲阜师范大学 | RNA layered embedding clustering method based on graph convolution neural network |
CN116844649A (en) * | 2023-08-31 | 2023-10-03 | 杭州木攸目医疗数据有限公司 | Interpretable cell data analysis method based on gene selection |
CN116844649B (en) * | 2023-08-31 | 2023-11-21 | 杭州木攸目医疗数据有限公司 | Interpretable cell data analysis method based on gene selection |
Also Published As
Publication number | Publication date |
---|---|
CN114022693B (en) | 2024-02-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114022693B (en) | Single-cell RNA-seq data clustering method based on double self-supervision | |
Obayashi et al. | Multi-objective design exploration for aerodynamic configurations | |
CN104751842B (en) | The optimization method and system of deep neural network | |
CN113889192B (en) | Single-cell RNA-seq data clustering method based on deep noise reduction self-encoder | |
CN112464004A (en) | Multi-view depth generation image clustering method | |
US20230197205A1 (en) | Bioretrosynthetic method and system based on and-or tree and single-step reaction template prediction | |
CN115952424A (en) | Graph convolution neural network clustering method based on multi-view structure | |
CN116152554A (en) | Knowledge-guided small sample image recognition system | |
CN116386729A (en) | scRNA-seq data dimension reduction method based on graph neural network | |
CN113537365A (en) | Multitask learning self-adaptive balancing method based on information entropy dynamic weighting | |
Rad et al. | GP-RVM: Genetic programing-based symbolic regression using relevance vector machine | |
CN117497038B (en) | Method for rapidly optimizing culture medium formula based on nuclear method | |
CN117556532A (en) | Optimization method for multi-element matching of novel turbine disc pre-rotation system | |
Chowdhury et al. | UICPC: centrality-based clustering for scRNA-seq data analysis without user input | |
Örkçü et al. | A hybrid applied optimization algorithm for training multi-layer neural networks in data classification | |
CN115661498A (en) | Self-optimization single cell clustering method | |
CN115906959A (en) | Parameter training method of neural network model based on DE-BP algorithm | |
CN115618272A (en) | Method for automatically identifying single cell type based on depth residual error generation algorithm | |
Bai et al. | Clustering single-cell rna sequencing data by deep learning algorithm | |
Mrabah et al. | Toward Convex Manifolds: A Geometric Perspective for Deep Graph Clustering of Single-cell RNA-seq Data. | |
CN113011091A (en) | Automatic-grouping multi-scale light-weight deep convolution neural network optimization method | |
Liu et al. | Multi-objective evolutionary algorithm for mining 3D clusters in gene-sample-time microarray data | |
CN114219069B (en) | Brain effect connection network learning method based on automatic variation self-encoder | |
CN117912573B (en) | Deep learning-based multi-level biomolecular network construction method | |
CN113506593B (en) | Intelligent inference method for large-scale gene regulation network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |