CN114022693A - Double-self-supervision-based single-cell RNA-seq data clustering method - Google Patents

Double-self-supervision-based single-cell RNA-seq data clustering method Download PDF

Info

Publication number
CN114022693A
CN114022693A CN202111152906.1A CN202111152906A CN114022693A CN 114022693 A CN114022693 A CN 114022693A CN 202111152906 A CN202111152906 A CN 202111152906A CN 114022693 A CN114022693 A CN 114022693A
Authority
CN
China
Prior art keywords
data
neural network
distribution
cell rna
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111152906.1A
Other languages
Chinese (zh)
Other versions
CN114022693B (en
Inventor
王艺杰
曾荣汉
杨东
王文庆
崔逸群
邓楠轶
朱博迪
介银娟
董夏昕
朱召鹏
崔鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Thermal Power Research Institute Co Ltd
Original Assignee
Xian Thermal Power Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Thermal Power Research Institute Co Ltd filed Critical Xian Thermal Power Research Institute Co Ltd
Priority to CN202111152906.1A priority Critical patent/CN114022693B/en
Publication of CN114022693A publication Critical patent/CN114022693A/en
Application granted granted Critical
Publication of CN114022693B publication Critical patent/CN114022693B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a double self-supervision-based single-cell RNA-seq data clustering method, which comprises the steps of firstly constructing a deep neural network by combining gene ontology knowledge to extract single-cell RNA-seq data characteristics, and reconstructing single-cell RNA-seq data through zero-expansion negative binomial distribution to reduce data noise; secondly, constructing a graph structure by utilizing a uniform manifold approximation and projection technology and mining topological structure information among data samples by adopting a graph neural network; then, combining the graph neural network and the deep neural network by adopting a double self-supervision strategy; and finally, realizing single cell RNA-seq data clustering by using a combined loss function of a minimum deep neural network, a graph neural network, a double self-supervision module and a random Gaussian noise term. The invention adopts an automatic supervision method in the unsupervised field, and can effectively solve the problems of the existing single cell RNA-seq data clustering method, such as lack of topological structure information among learning data, poor biological interpretability and the like.

Description

Double-self-supervision-based single-cell RNA-seq data clustering method
Technical Field
The invention belongs to the technical field of single cell RNA-seq data analysis, and particularly relates to a double-self-supervision-based single cell RNA-seq data clustering method.
Background
The clustering method aiming at the single cell RNA-seq data plays an important role in the relevant research such as cell heterogeneity and the like. In the clustering problem, cells can be classified into different cell types, each having different expression profiles from other cells, according to their transcription profiles. Through the research of the clustering method of the single cell RNA-seq data, researchers can identify new cell populations in organisms, identify cell states, establish networks among cells, trace development lineages and research the reaction of in-vitro and in-vivo experiments. At present, traditional clustering methods such as k-means, hierarchical clustering and density-based clustering methods with noise are widely used, but single-cell RNA-seq data have unique characteristics, so that the traditional clustering methods cannot effectively cluster the data.
Disclosure of Invention
In order to overcome the technical problems, the invention provides a single-cell RNA-seq data clustering method based on double self-supervision, which integrates the structural information of single-cell RNA-seq data into a neural network layer, integrates the traditional deep neural network and a graph neural network into the same model through a double self-supervision strategy, and realizes the integral iterative optimization of the model. In this way, a variety of data structures from low order to high order are naturally combined with a variety of representations learned from the encoder. In the process of constructing the self-encoder, the invention firstly constructs the structural data of the single-cell RNA sequencing graph by using a uniform manifold approximation and projection technology, and reconstructs sample data by using zero-expansion negative binomial distribution, thereby not only realizing the noise reduction of the single-cell RNA-seq data, but also improving the overall performance of the model and laying a foundation for subsequent effective clustering.
In order to achieve the purpose, the invention adopts the technical scheme that:
a single cell RNA-seq data clustering method based on double self-supervision comprises the following steps;
1) the self-encoder in the pre-training deep neural network module:
selecting 5 public data sets downloaded from Arrayexpress and GEO databases, wherein gene expression values in the 5 public data sets are obtained from various tissue cells, including GSE60361, GSE65525, GSE72056, GSE76312 and GSE103322, further screening cells with normal gene expression quantity, reading original single cell RNA-seq data and carrying out standardized preprocessing, inputting the processed single cell RNA-seq data by using a designed self-encoder for training and obtaining a pre-training model;
2) initializing a clustering center:
in the initialization stage of an experiment, a self-encoder obtained by pre-training can learn the potential representation of single-cell RNA-seq data, a k-means algorithm is used for initializing a clustering center on the basis of the potential representation, the initialization is carried out for 20 times, and the optimal solution is selected as the initial clustering center;
3) randomly initializing network parameters of the l layer;
the initialization of the network parameters is very important for the training of the network, the problems of gradient explosion and gradient disappearance are avoided, the training speed is further improved, the network convergence is accelerated, and the network parameters of the l layer are initialized by using an Xavier initialization method, so that signals can be transmitted more deeply in the used neural network;
4) construct graph data structure:
the topological structure information of the single cell RNA-seq data is learned by adding a graph neural network into a model, a good graph data structure has a great promotion effect on the learning effect of the graph neural network, the tasks are completed by using a unified manifold approximation and projection algorithm, the distance from each single cell RNA-seq sample data to the nearest neighbor of the sample data is calculated at first, then the distance probability is calculated, then a matrix of a directed weighted graph is constructed, a tie matrix of an undirected graph is calculated, and K neighbor graph structure data of original data is constructed;
5) iterative full training:
in a single training, combining gene ontology to learn layer by layer to obtain the representation of single-cell RNA-seq data in a deep neural network, effectively reconstructing the single-cell RNA-seq data by using zero-expansion negative binomial distribution, reducing data noise and dimensionality, further calculating the data distribution of sample data in an underlying layer and the normalized target distribution after obtaining the representation of the data in the last layer of a self-encoder, combining two representations learned by the deep neural network and the graph neural network layer by using a transfer operator, continuously propagating the learning of the graph neural network forward, calculating the low-dimensional distribution of the graph neural network, supervising the learning processes of the two neural networks in a KL divergence mode by using the target distribution, integrating the loss functions of the three modules of the graph neural network, the deep neural network and double self-supervision as the integral loss function of the invention, inputting the obtained new training data into the current model again for training, optimizing the model parameters, and stopping iteration until the total loss function in the method model is converged;
6) and returning a final clustering result:
through effective learning of the network, the single-cell RNA-seq data learned in the graph convolution network comprises two different types of information, and the soft distribution value of the data distribution learned by the graph convolution network is used as a final clustering result to discover the cell subtype, thereby providing help for subsequent early cancer discovery and treatment.
The step of preprocessing the single-cell RNA-seq data in the step 1) comprises the following steps: firstly, screening out cells with normal gene expression quantity; then, the data was normalized for sequencing depth and gene length using a logarithmic normalization method.
The dimension of the self-encoder input in the step 1) is consistent with the dimension of single-cell RNA-seq data used for training, five layers are provided, and the dimension of each layer in the graph neural network module is consistent with the dimension in the self-encoder.
The step 4) specifically comprises the following steps: firstly, for each high-dimensional single-cell RNA-seq data point, the distance rho between the data point and the first nearest neighbor is calculatedi(ii) a Next, according to ρiCalculate the variance σ of the distance probabilityi(ii) a Then, calculating weight values among nodes in the directed weighted graph, constructing a matrix of the directed weighted graph, and further calculating a directed adjacent matrix of the directed weighted graph; and calculating the adjacency matrix of the undirected graph by combining the Hadamard product operation according to the directed adjacency matrix.
The distance rho from a certain data point to the first nearest neighbori
Figure BDA0003287702680000041
k denotes the total number of clusters, σiBy the formula
Figure BDA0003287702680000042
Can find that the said directed weighted graph
Figure BDA0003287702680000051
The node set V in the figure is all single-cell RNA-seq data, and the edge set
Figure BDA0003287702680000052
The weight w between nodes is calculated using the following equation
Figure BDA0003287702680000053
Building out directed weighted graph
Figure BDA0003287702680000054
After the matrix is obtained, further calculation can be carried out
Figure BDA0003287702680000055
Directed adjacency matrix of
Figure BDA0003287702680000056
And the adjacency matrix A of undirected graph G ═ V, E passes through
Figure BDA0003287702680000057
And (4) calculating.
The step 5) iterative full-scale training specifically comprises the following steps: obtaining each layer of representation of the deep neural network by combining gene ontology learning; calculating a data distribution Q using the last layer representation of the encoder; on the basis of the data distribution Q, performing quadratic power calculation, and performing normalization according to each soft clustering frequency to calculate a target distribution P; for each layer of output of the encoder, fusing each layer of representation of the deep neural network and the graph neural network by using a transfer operator epsilon, and further propagating forward so as to learn the next layer of representation of the graph neural network; calculating a low-dimensional distribution Z of the graph neural network; continuously inputting the representation obtained by learning in the encoder into a decoder to reconstruct the original data; respectively calculating three loss functions L in the methodres,Lclu,Lgnn(ii) a Calculating an overall loss function L of the whole network structure; and updating parameters of the whole network by using a back propagation algorithm in the whole network framework until the iteration is stopped.
The data distribution Q in the deep neural network measures data characterization h by taking student T distribution as a core for ith single cell ribonucleic acid sequencing sample data and jth cluster center dataiAnd cluster center μjSimilarity of (c):
Figure BDA0003287702680000058
hithe ith row of data representing the self-encoder is initialized by using a k-means algorithm in the process of pre-training the self-encoder to obtain mujV is the degree of freedom of student T distribution, qijDenotes the probability of assigning the ith sample data to the jth cluster, and Q ═ Qij]Seen as a distribution of all sample allocations.
The target distribution P plays the role of supervising the other two distributions, Pij=(qij 2/fj)/(∑j′qij′ 2/fj′),fj=∑iqijRepresenting soft clustering frequency, for each Q of QiThe second power calculation is firstly carried out, and then normalization is carried out according to each soft clustering frequency to obtain each piChanging P to [ P ]ij]The target distribution for all sample assignments is considered.
If L layers are shared in the encoder and L is used to represent the number of a certain layer, the data learned by the L-th layer in the encoder is represented as:
Figure BDA0003287702680000061
Figure BDA0003287702680000062
representing the activation function, W, of each layere (l)And be (l)Weight parameters and bias terms, respectively, for the l-th layer learning in the encoder, will be H(0)Defined as the original data X. The decoder of the model following the encoderLater, the decoder reconstructs the data through several neural network layers,
Figure BDA0003287702680000063
Wd (l)and bd (l)Respectively, the weight parameters and the bias terms learned by the l layer in the decoder;
representation H due to self-encoder learning(l-1)Can reconstruct single cell RNA-seq data, which comprises representation Z obtained by learning different from neural network(l-1)Information of (A) is(l-1)And H(l-1)The two representations are combined to yield: z%(l-1)=(1-ε)Z(l-1)+εH(l-1)Epsilon is a transfer operator which is set to be 0.5, and a self-encoder in the deep neural network module and a graph convolution network in the graph neural network module are connected layer by layer;
use of
Figure BDA0003287702680000065
Generating Z as input to the l-th layer of a graph convolution network(l)
Figure BDA0003287702680000066
Figure BDA0003287702680000067
A contiguous matrix is represented that is,
Figure BDA0003287702680000068
expression matrix, H learned from encoder network(l-1)By normalizing the adjacency matrix
Figure BDA0003287702680000069
The information learned by each layer of the self-encoder is integrated into the convolutional neural network due to different information learned by each layer of the self-encoder, L information integration processes are run together, and the last layer in the graph neural network module is a multi-classification layer:
Figure BDA00032877026800000610
final result zijE is Z, the ith sample data belongs to the jth clustering center data, and Z is probability distribution;
the final output part of the decoder is reconstruction data, and recent research progress aiming at single-cell RNA-seq data shows that the single-cell RNA-seq data is closest to Negative Binomial distribution (NB) and is formulated as
Figure BDA0003287702680000071
Because the dispersion of single-cell RNA-seq data is usually highly distorted, the variance tends to be larger than the mean and is therefore not suitable for approximation with a poisson distribution, whereas the variance of single-cell RNA-seq data typically changes as the mean changes. In addition to the above, single cell RNA-seq data are characterized by a particularly high number of zeros. Since the Zero values in the gene expression data may come from genes that are not expressed in the biological process (True Zero) or from technical losses in the sequencing process (Dropout Zero). In order to better capture single-cell RNA-seq data, the invention improves the traditional noise reduction self-encoder, adds a Zero-expansion factor on the basis of a Negative Binomial distribution (NB) model, and can also be understood as adding a pulse function at a Zero point, namely modeling the single-cell RNA-seq data by using Zero-expanded Negative Binomial distribution (Zero-expanded Negative Binomial). Formulated as ZINB (X | pi, μ, θ) ═ pi δ0(X) + (1-pi) N BETA (X | mu, theta), three independent full-connected layers are added behind the last hidden layer, and the whole self-encoder has three outputs to respectively learn the zero expansion factor, the mean value and the variance of the zero expansion negative binomial distribution. L isresTo reduce the error between the reconstructed data and the original data of the decoder, Lres=-log(ZINB(X|π,μ,θ));
Figure BDA0003287702680000072
By minimizing the KL divergence between the Q distribution and the P distribution, the target distribution P can help the deep neural network module to learn a better data representation for the clustering task, so that the data is representedThe method is closer to a clustering center, and target distribution P is obtained by performing quadratic and normalization processing on the basis of clustering distribution Q, so that single cell RNA-seq data points of the target distribution P have high confidence, and P is used as the supervision information of Q in the experimental process to continuously optimize a network model, wherein the process can be regarded as an automatic supervision strategy;
Figure BDA0003287702680000081
the target function of the graph neural network module optimizes the network in a KL divergence mode, so that the optimization process is more gentle, the phenomenon that the representation learning of single-cell RNA-seq data is greatly influenced is avoided, two different neural network models are integrated into the same parameter iteration updating framework, the clustering distribution Q and the distribution Z learned by the graph neural network can be supervised by the target distribution P, and the data representation and clustering performance of the whole network are jointly improved.
L=Lres+αLclu+βLgnn+γ||B||,α>0 is a hyper-parameter, β, that balances the original data clustering optimization and the data structure reconstruction>0 is a coefficient for controlling the interference of the neural network module to the embedding space, B represents the random Gaussian noise added into the neural network, gamma is an influence parameter for adjusting the random Gaussian noise added into the neural network to the model, and the whole model is updated in an end-to-end mode through the optimization of the clustering loss function.
The soft distribution value of the Z distribution is used as a final clustering result, and as the data learned in the graph convolution network contains two different types of information, a label is set for the ith sample data:
Figure BDA0003287702680000082
the invention has the beneficial effects that:
the invention uses the self-encoder to not only reduce the dimension of the single cell RNA-seq data, but also effectively learn the representation of the data, uses the zero-expansion negative binomial distribution reconstruction data to reduce the influence of data noise on the clustering effect, simultaneously uses the graph neural network to combine the uniform manifold approximation and the projection technology to learn the topological structure information among the data, adds the biological prior knowledge such as the gene ontology and the like in the process of constructing the encoder, and improves the biological interpretability of the method model.
Drawings
FIG. 1 is a general flow chart of a single-cell RNA-seq data clustering method based on double self-supervision provided by the invention.
Detailed Description
The present invention will be described in further detail with reference to examples.
As shown in FIG. 1, six steps of the invention based on a double self-supervision strategy to improve the single cell RNA-seq data clustering effect are shown, a self-encoder in a pre-training deep neural network module, a clustering center initialization, a network parameter initialization of the l-th layer randomly, a structure diagram data structure construction, an iterative full-scale training and a final clustering result return.
The invention provides a single cell RNA-seq data clustering method based on double self-supervision, which selects 5 public data sets downloaded from Arrayexpress and GEO databases to verify the effectiveness of the invention, wherein gene expression values in the 5 public data sets are obtained from various histiocytes, including GSE60361, GSE65525, GSE72056, GSE76312 and GSE 103322. And further screening cells with normal gene expression quantity, and carrying out standardization treatment on the sequencing depth and the gene length of the data by adopting a logarithmic standardization method. The normalized data is used as the initial input data for the embodiment of the present invention. The invention comprises the following steps:
firstly, a pre-training self-encoder is used, and the pre-training is finished by using original single-cell RNA-seq data which can be well reconstructed as an indication. The invention trains all the selected data 30 times, sets the learning rate to be 0.001, sets the hyper-parameter to be 0.1, beta to be 0.01, and gamma to be 0.01. Through the training and learning of the step, the parameters learned from the encoder network can effectively perform dimension reduction and feature extraction on the original data. The latent data representation already contains information that can reconstruct the original data, and can be restored or reconstructed into the original data dimensional space by the decoding portion of the self-encoder.
And step two, initializing the clustering center, wherein in the initialization stage of the experiment, the k-means algorithm is used for initializing the clustering center for 20 times and selecting the optimal solution as the initial clustering center.
And step three, randomly initializing the network parameters of the l layer.
And step four, constructing a graph data structure, wherein unified manifold approximation and projection technology are used for constructing the single-cell RNA-seq data. The method comprises the following specific steps:
step 1, calculating the distance rho from a certain data point to the first nearest neighbor of the certain data pointi
Figure BDA0003287702680000101
k represents the total number of clusters;
step 2, passing through a formula
Figure BDA0003287702680000102
Find σi
Step 3, for the directed weighted graph
Figure BDA0003287702680000103
The node set V in the graph is all the data points, and the edge set
Figure BDA0003287702680000104
The weight w between nodes is calculated using the following equation
Figure BDA0003287702680000105
Step 4, constructing a directed weighted graph
Figure BDA0003287702680000109
After the matrix is obtained, further calculation can be carried out
Figure BDA0003287702680000106
Is provided withTo the adjacent matrix
Figure BDA0003287702680000107
Step 5, the adjacency matrix A of undirected graph G ═ V, E passes through
Figure BDA0003287702680000108
And (4) calculating.
And step five, carrying out iterative full-scale training. The method comprises the following specific steps:
step 1, learning by combining terms in a Gene Ontology (GO) to obtain each layer of expression of a deep neural network;
step 2, calculating data distribution Q by using the last layer representation of the encoder;
step 3, on the basis of the data distribution Q, performing quadratic power calculation and then performing normalization according to each soft clustering frequency to calculate a target distribution P;
step 4, aiming at the output of each layer of the encoder, fusing the representation of each layer of the deep neural network and the graph neural network by using a transfer operator epsilon, and further propagating forward to learn the representation of the next layer of the graph neural network;
step 5, calculating the low-dimensional distribution Z of the graph neural network;
step 6, continuously inputting the representation obtained by learning in the encoder into a decoder to reconstruct the original data;
step 7, respectively calculating three loss functions L in the methodres,Lclu,Lgnn(ii) a Calculating an overall loss function L of the whole network structure;
step 8, updating parameters of the whole network by using a back propagation algorithm in the whole network framework until iteration stops;
and step six, returning the final clustering result. The invention selects the soft distribution value of Z distribution as the final clustering result, because the data learned in the graph convolution network contains two different types of information, the label is set for the ith sample data:
Figure BDA0003287702680000111
for each data set, the invention ran 10 experiments in total and averaged as the final result. After the final clustering result is obtained, the final clustering effect is verified by using four measurement methods of standardized Mutual information NMI (normalized Mutual information), adjusted landed index ARI (adjusted Rand index), Homogeneity Homogenity and integrity, and the result shows that the single-cell RNA-seq data clustering can be better realized compared with the traditional clustering method.
The invention has the following characteristics:
1. the influence of high dimensionality and large noise of single cell RNA-seq data on a clustering result is reduced;
2. in the clustering process of the single cell RNA-seq data, the data representation can be effectively learned, and the method has strong data representation capability;
3. the characteristic information of the data can be learned, and the topological structure information among the data can be learned;
4. has good biological interpretability.
The method is a specific method in the unsupervised learning, data input into a model is data without labels, but labels of the data, also called "pseudo labels", are artificially constructed through the structure or characteristics of the data, and after the data has such labels, a learning mechanism similar to the supervised learning can be performed to train the deep neural network.
In the process of carrying out deep clustering on the single cell RNA-seq data, in addition to effectively reducing the dimension and reducing the noise of the high-dimensional single cell RNA-seq data, the invention designs a double self-supervision strategy, fuses a deep neural network and a graph neural network into a unified frame, and improves the performance of a model by adding biological prior knowledge and adopting a unified manifold approximation and projection technology.

Claims (10)

1. A single cell RNA-seq data clustering method based on double self-supervision is characterized by comprising the following steps;
1) the self-encoder in the pre-training deep neural network module:
selecting 5 public data sets downloaded from Arrayexpress and GEO databases, wherein gene expression values in the 5 public data sets are obtained from various tissue cells, including GSE60361, GSE65525, GSE72056, GSE76312 and GSE103322, further screening out cells with normal gene expression quantity, reading original single cell RNA-seq data and carrying out standardized preprocessing, inputting the processed single cell RNA-seq data by using a designed self-encoder for training and obtaining a pre-training model;
2) initializing a clustering center:
in the initialization stage of an experiment, a self-encoder obtained by pre-training can learn the potential representation of single-cell RNA-seq data, a k-means algorithm is used for initializing a clustering center on the basis of the potential representation, the initialization is carried out for 20 times, and the optimal solution is selected as the initial clustering center;
3) randomly initializing network parameters of the l layer;
initializing the network parameters of the l layer by using an Xavier initialization method, so that signals can be transmitted deeper in the used neural network;
4) construct graph data structure:
constructing K neighbor graph structure data of original data by using a unified manifold approximation and projection algorithm, firstly calculating the distance from each single cell RNA-seq sample data to the nearest neighbor thereof, then calculating the distance probability, then constructing a matrix of a directed weighted graph, calculating a tie matrix of an undirected graph, and constructing the K neighbor graph structure data of the original data;
5) iterative full training:
in a single training, combining gene ontology to learn layer by layer to obtain the representation of single-cell RNA-seq data in a deep neural network, using zero-expansion negative binomial distribution to effectively reconstruct the single-cell RNA-seq data, reducing data noise and dimensionality, further calculating the data distribution of sample data in a latent layer and the target distribution after normalization after obtaining the representation of the data in the last layer of a self-encoder, using a transfer operator to combine two representations learned by the deep neural network and the graph neural network layer by layer, continuously propagating the learning of the graph neural network forward, calculating the low-dimensional distribution of the graph neural network, using the target distribution to supervise the learning processes of the two neural networks in a KL divergence mode, and integrating the loss functions of the graph neural network, the deep neural network and a double self-supervision module as the integral loss function of the invention, inputting the obtained new training data into the current model again for training, optimizing the model parameters, and stopping iteration until the total loss function in the method model is converged;
6) and returning a final clustering result:
through effective learning of the network, the single-cell RNA-seq data learned in the graph convolution network comprises two different types of information, and the soft distribution value of the data distribution learned by the graph convolution network is used as a final clustering result to discover the cell subtype, thereby providing help for subsequent early cancer discovery and treatment.
2. The double self-supervision-based single-cell RNA-seq data clustering method according to claim 1, wherein the step of preprocessing the single-cell RNA-seq data in the step 1) comprises: firstly, screening out cells with normal gene expression quantity; then, the data was normalized for sequencing depth and gene length using a logarithmic normalization method.
3. The double self-supervision-based single-cell RNA-seq data clustering method according to claim 1, wherein the self-encoder input dimension in step 1) is consistent with the dimension of the single-cell RNA-seq data used for training, and the dimension of each layer in the graph neural network module is consistent with the dimension in the self-encoder.
4. The double self-supervision-based single-cell RNA-seq data clustering method according to claim 1, wherein the step 4) specifically comprises: firstly, for each high-dimensional single-cell RNA-seq data point, the distance rho between the data point and the first nearest neighbor is calculatedi(ii) a Tighten upThen according to ρiCalculate the variance σ of the distance probabilityi(ii) a Then, calculating weight values among nodes in the directed weighted graph, constructing a matrix of the directed weighted graph, and further calculating a directed adjacent matrix of the directed weighted graph; and calculating the adjacency matrix of the undirected graph by combining the Hadamard product operation according to the directed adjacency matrix.
5. The double self-supervision-based single-cell RNA-seq data clustering method according to claim 4, wherein the distance p from a certain data point to the first nearest neighbor isi
Figure FDA0003287702670000031
k denotes the total number of clusters, σiBy the formula
Figure FDA0003287702670000032
Can find that the said directed weighted graph
Figure FDA0003287702670000033
The node set V in the figure is all single-cell RNA-seq data, and the edge set
Figure FDA0003287702670000034
The weight w between nodes is calculated using the following equation
Figure FDA0003287702670000035
Building out directed weighted graph
Figure FDA0003287702670000036
After the matrix is obtained, further calculation can be carried out
Figure FDA0003287702670000037
Directed adjacency matrix of
Figure FDA0003287702670000038
And the adjacency matrix A of undirected graph G ═ V, E passes through
Figure FDA0003287702670000039
And (4) calculating.
6. The double-self-supervision-based single-cell RNA-seq data clustering method according to claim 1, wherein the step 5) iterative full-scale training specifically comprises: obtaining each layer of the deep neural network by combining gene ontology learning; calculating a data distribution Q using the last layer representation of the encoder; on the basis of the data distribution Q, performing quadratic power calculation, and performing normalization according to each soft clustering frequency to calculate a target distribution P; for each layer of output of the encoder, fusing each layer of representation of the deep neural network and the graph neural network by using a transfer operator epsilon, and further propagating forward so as to learn the next layer of representation of the graph neural network; calculating a low-dimensional distribution Z of the graph neural network; continuously inputting the representation obtained by learning in the encoder into a decoder to reconstruct the original data; respectively calculating three loss functions L in the methodres,Lclu,Lgnn(ii) a Calculating an overall loss function L of the whole network structure; and updating parameters of the whole network by using a back propagation algorithm in the whole network framework until the iteration is stopped.
7. The method of claim 6, wherein the data distribution Q in the deep neural network module measures the data characterization h using student T distribution as a kernel for the ith single-cell RNA sequencing sample data and the jth cluster center dataiAnd cluster center μjSimilarity of (c):
Figure FDA0003287702670000041
hithe ith row of data representing the self-encoder is initialized by using a k-means algorithm in the process of pre-training the self-encoder to obtain mujV is the degree of freedom of student T distribution, qijDenotes the probability of assigning the ith sample data to the jth cluster, and Q ═ Qij]Seen as a distribution of all sample allocations.
8. The double self-supervision-based single-cell RNA-seq data clustering method according to claim 6, wherein the target distribution P plays a role in supervising the other two distributions, Pij=(qij 2/fj)/(∑j′qij′ 2/fj′),fj=∑iqijRepresenting soft clustering frequency, for each Q of QiThe second power calculation is firstly carried out, and then normalization is carried out according to each soft clustering frequency to obtain each piChanging P to [ P ]ij]The target distribution for all sample assignments is considered.
9. The double-self-supervision-based single-cell RNA-seq data clustering method according to claim 6, wherein the encoder has L layers in common, and the I is used to represent the number of a certain layer, then the data learned through the I layer in the encoder is represented as:
Figure RE-FDA0003455542010000051
Figure RE-FDA0003455542010000052
representing the activation function, W, of each layere (l)And be (l)Weight parameters and bias terms, respectively, learned by the l-th layer in the encoder, will be H(0)Defined as the original data X. The decoder of the model, which is immediately behind the encoder, reconstructs the data through several neural network layers,
Figure RE-FDA0003455542010000053
Wd (l)and bd (l)Weight parameters and bias terms learned by the l layer in a decoder respectively;
due to self-encodingRepresentation H learned by machine(l-1)Can reconstruct single cell RNA-seq data, which comprises representation Z obtained by learning different from neural network(l-1)Information of (A) is(l-1)And H(l-1)The two representations are combined to yield: z%(l-1)=(1-ε)Z(l-1)+εH(l-1)Epsilon is a transfer operator, the transfer operator is set to be 0.5, and a self-coder in the deep neural network module and a graph convolution network in the graph neural network module are connected layer by layer;
use of
Figure RE-FDA0003455542010000054
Generating Z as input to the l-th layer of a graph convolution network(l)
Figure RE-FDA0003455542010000055
Figure RE-FDA0003455542010000056
A contiguous matrix is represented that is,
Figure RE-FDA0003455542010000057
expression matrix, H learned from encoder network(l-1)By normalizing the adjacency matrix
Figure RE-FDA0003455542010000058
The information learned by each layer of the self-encoder is integrated into the convolutional neural network due to different information learned by each layer of the self-encoder, L information integration processes are run together, and the last layer in the graph neural network module is a multi-classification layer:
Figure RE-FDA0003455542010000059
final result zijE is Z, the ith sample data belongs to the jth clustering center data, and Z is probability distribution;
the final output part of the decoder is reconstruction data, and a zero is added on the basis of a negative binomial distribution (NB) modelA swelling factor, adding a pulse function at the Zero point, namely modeling single-cell RNA-seq data by using Zero-swelled Negative Binomial distribution (Zero-swelled Negative Binomial), and formulating as ZINB (X | pi, mu, theta) ═ pi delta0(X) + (1-pi) N BETA (X | mu, theta), three independent full-connected layers are added behind the last hidden layer, the whole self-encoder has three outputs, and zero expansion factors, mean values and variances, L, of zero-expansion negative binomial distribution are learned respectivelyresTo reduce the error between the reconstructed data and the original data of the decoder, Lres=-log(ZINB(X|π,μ,θ));
Figure RE-FDA0003455542010000061
By minimizing KL divergence between Q distribution and P distribution, the target distribution P can help a deep neural network module to learn better data representation for clustering tasks, and the target distribution P is obtained by performing quadratic and normalization processing on the basis of the clustering distribution Q, so that single-cell RNA-seq data points of the target distribution P have high confidence level, and P is used as the monitoring information of Q in the experimental process to continuously optimize a network model, wherein the process can be regarded as an automatic monitoring strategy;
Figure RE-FDA0003455542010000062
the target function of the graph neural network module optimizes the network in a KL divergence mode, so that the clustering distribution Q and the distribution Z learned by the graph neural network can be supervised by the target distribution P, and the data representation and clustering performance of the whole network are improved together;
L=Lres+αLclu+βLgnn+γ||B||,α>0 is a hyper-parameter, β, that balances the original data clustering optimization and the data structure reconstruction>0 is a coefficient for controlling the interference of the neural network module to the embedding space, B represents the random Gaussian noise added into the neural network, gamma is an influence parameter for adjusting the random Gaussian noise added into the neural network to the model, and the whole model is updated in an end-to-end mode through the optimization of the clustering loss function.
10. The method of claim 6, wherein the soft distribution value of Z distribution is used as the final clustering result, and since the data learned in the graph convolution network contains two different types of information, the ith sample data is labeled:
Figure FDA0003287702670000071
CN202111152906.1A 2021-09-29 2021-09-29 Single-cell RNA-seq data clustering method based on double self-supervision Active CN114022693B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111152906.1A CN114022693B (en) 2021-09-29 2021-09-29 Single-cell RNA-seq data clustering method based on double self-supervision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111152906.1A CN114022693B (en) 2021-09-29 2021-09-29 Single-cell RNA-seq data clustering method based on double self-supervision

Publications (2)

Publication Number Publication Date
CN114022693A true CN114022693A (en) 2022-02-08
CN114022693B CN114022693B (en) 2024-02-27

Family

ID=80055158

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111152906.1A Active CN114022693B (en) 2021-09-29 2021-09-29 Single-cell RNA-seq data clustering method based on double self-supervision

Country Status (1)

Country Link
CN (1) CN114022693B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115223657A (en) * 2022-09-20 2022-10-21 吉林农业大学 Medicinal plant transcription regulation and control map prediction method
CN115240772A (en) * 2022-08-22 2022-10-25 南京医科大学 Method for analyzing active pathway in unicellular multiomics based on graph neural network
CN114462548B (en) * 2022-02-23 2023-07-18 曲阜师范大学 Method for improving accuracy of single-cell deep clustering algorithm
CN116452910A (en) * 2023-03-28 2023-07-18 河南科技大学 scRNA-seq data characteristic representation and cell type identification method based on graph neural network
CN116665786A (en) * 2023-07-21 2023-08-29 曲阜师范大学 RNA layered embedding clustering method based on graph convolution neural network
CN116844649A (en) * 2023-08-31 2023-10-03 杭州木攸目医疗数据有限公司 Interpretable cell data analysis method based on gene selection

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259979A (en) * 2020-02-10 2020-06-09 大连理工大学 Deep semi-supervised image clustering method based on label self-adaptive strategy
US20200218971A1 (en) * 2016-12-27 2020-07-09 Obschestvo S Ogranichennoy Otvetstvennostyu "Vizhnlabs" Training of deep neural networks on the basis of distributions of paired similarity measures
CN111785329A (en) * 2020-07-24 2020-10-16 中国人民解放军国防科技大学 Single-cell RNA sequencing clustering method based on confrontation automatic encoder

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200218971A1 (en) * 2016-12-27 2020-07-09 Obschestvo S Ogranichennoy Otvetstvennostyu "Vizhnlabs" Training of deep neural networks on the basis of distributions of paired similarity measures
CN111259979A (en) * 2020-02-10 2020-06-09 大连理工大学 Deep semi-supervised image clustering method based on label self-adaptive strategy
CN111785329A (en) * 2020-07-24 2020-10-16 中国人民解放军国防科技大学 Single-cell RNA sequencing clustering method based on confrontation automatic encoder

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
任伟;: "基于稀疏自编码深度神经网络的入侵检测方法", 移动通信, no. 08, 15 August 2018 (2018-08-15) *
高美加;: "基于loess回归加权的单细胞RNA-seq数据预处理算法", 智能计算机与应用, no. 05, 1 May 2020 (2020-05-01) *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114462548B (en) * 2022-02-23 2023-07-18 曲阜师范大学 Method for improving accuracy of single-cell deep clustering algorithm
CN115240772A (en) * 2022-08-22 2022-10-25 南京医科大学 Method for analyzing active pathway in unicellular multiomics based on graph neural network
CN115240772B (en) * 2022-08-22 2023-08-22 南京医科大学 Method for analyzing single cell pathway activity based on graph neural network
CN115223657A (en) * 2022-09-20 2022-10-21 吉林农业大学 Medicinal plant transcription regulation and control map prediction method
CN115223657B (en) * 2022-09-20 2022-12-06 吉林农业大学 Medicinal plant transcriptional regulation map prediction method
CN116452910A (en) * 2023-03-28 2023-07-18 河南科技大学 scRNA-seq data characteristic representation and cell type identification method based on graph neural network
CN116452910B (en) * 2023-03-28 2023-11-28 河南科技大学 scRNA-seq data characteristic representation and cell type identification method based on graph neural network
CN116665786A (en) * 2023-07-21 2023-08-29 曲阜师范大学 RNA layered embedding clustering method based on graph convolution neural network
CN116844649A (en) * 2023-08-31 2023-10-03 杭州木攸目医疗数据有限公司 Interpretable cell data analysis method based on gene selection
CN116844649B (en) * 2023-08-31 2023-11-21 杭州木攸目医疗数据有限公司 Interpretable cell data analysis method based on gene selection

Also Published As

Publication number Publication date
CN114022693B (en) 2024-02-27

Similar Documents

Publication Publication Date Title
CN114022693B (en) Single-cell RNA-seq data clustering method based on double self-supervision
Obayashi et al. Multi-objective design exploration for aerodynamic configurations
CN104751842B (en) The optimization method and system of deep neural network
CN113889192B (en) Single-cell RNA-seq data clustering method based on deep noise reduction self-encoder
CN112464004A (en) Multi-view depth generation image clustering method
US20230197205A1 (en) Bioretrosynthetic method and system based on and-or tree and single-step reaction template prediction
CN115952424A (en) Graph convolution neural network clustering method based on multi-view structure
CN116152554A (en) Knowledge-guided small sample image recognition system
CN116386729A (en) scRNA-seq data dimension reduction method based on graph neural network
CN113537365A (en) Multitask learning self-adaptive balancing method based on information entropy dynamic weighting
Rad et al. GP-RVM: Genetic programing-based symbolic regression using relevance vector machine
CN117497038B (en) Method for rapidly optimizing culture medium formula based on nuclear method
CN117556532A (en) Optimization method for multi-element matching of novel turbine disc pre-rotation system
Chowdhury et al. UICPC: centrality-based clustering for scRNA-seq data analysis without user input
Örkçü et al. A hybrid applied optimization algorithm for training multi-layer neural networks in data classification
CN115661498A (en) Self-optimization single cell clustering method
CN115906959A (en) Parameter training method of neural network model based on DE-BP algorithm
CN115618272A (en) Method for automatically identifying single cell type based on depth residual error generation algorithm
Bai et al. Clustering single-cell rna sequencing data by deep learning algorithm
Mrabah et al. Toward Convex Manifolds: A Geometric Perspective for Deep Graph Clustering of Single-cell RNA-seq Data.
CN113011091A (en) Automatic-grouping multi-scale light-weight deep convolution neural network optimization method
Liu et al. Multi-objective evolutionary algorithm for mining 3D clusters in gene-sample-time microarray data
CN114219069B (en) Brain effect connection network learning method based on automatic variation self-encoder
CN117912573B (en) Deep learning-based multi-level biomolecular network construction method
CN113506593B (en) Intelligent inference method for large-scale gene regulation network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant