CN116665786A - RNA layered embedding clustering method based on graph convolution neural network - Google Patents

RNA layered embedding clustering method based on graph convolution neural network Download PDF

Info

Publication number
CN116665786A
CN116665786A CN202310896057.3A CN202310896057A CN116665786A CN 116665786 A CN116665786 A CN 116665786A CN 202310896057 A CN202310896057 A CN 202310896057A CN 116665786 A CN116665786 A CN 116665786A
Authority
CN
China
Prior art keywords
data
clustering
encoder
graph
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310896057.3A
Other languages
Chinese (zh)
Inventor
鲁大营
刘化社
孔晨曦
鲁克
曹鲁成
柴华
樊稳稳
刘原
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qufu Normal University
Shanxian Power Supply Co of State Grid Shandong Electric Power Co Ltd
Original Assignee
Qufu Normal University
Shanxian Power Supply Co of State Grid Shandong Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qufu Normal University, Shanxian Power Supply Co of State Grid Shandong Electric Power Co Ltd filed Critical Qufu Normal University
Priority to CN202310896057.3A priority Critical patent/CN116665786A/en
Publication of CN116665786A publication Critical patent/CN116665786A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Medical Informatics (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Epidemiology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioethics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Public Health (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to the technical field of bioinformatics, in particular to an RNA layering embedding clustering method based on a graph convolution neural network, which comprises the following steps: s1, data preprocessing, S2, data noise reduction, S3, data dimension reduction, S4 and data clustering, wherein a negative log likelihood function with zero expansion negative binomial distribution is used as a loss function of a noise reduction encoder in the noise reduction process to process dropout noise in data, a double-decoding-diagram convolution self-encoder is used in a dimension reduction task to obtain low-dimensional characteristics of the data, a KL (moment of inertia) divergence function is used as a loss function of clustering in the clustering task to perform deep embedding clustering, the improvement of clustering precision is promoted, and a better clustering effect is achieved.

Description

RNA layered embedding clustering method based on graph convolution neural network
Technical Field
The invention relates to the technical field of bioinformatics, in particular to an RNA layering embedding clustering method based on a graph convolution neural network.
Background
Thousands of gene species in a single cell cause a dimensional disaster in RNA sequencing data, and a large number of false zero values exist in the data due to dropout noise generated by low RNA acquisition rate. The problem of high and strong noise exists in single cell RNA sequencing data at the same time.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an RNA hierarchical embedded clustering method based on a graph convolution neural network, which solves the problems of dropout noise and high-dimensional disasters existing in data at the same time through a fused data noise reduction self-encoder and a data dimension reduction graph convolution self-encoder, optimizes a clustering result by adopting deep embedded clustering, and improves clustering precision.
The invention is realized by the following technical scheme:
the RNA hierarchical embedding clustering method based on the graph convolution neural network comprises the following steps:
s1, data preprocessing: processing the read count data of single-cell RNA sequencing to obtain a data matrix X, wherein rows in the matrix represent cells, columns represent genes, and each cell has the same kind of genes;
s2, data noise reduction:
inputting the preprocessed data matrix X into a noise reduction self-encoder, introducing random Gaussian noise e into each layer of a coding layer of the noise reduction self-encoder, and adding the noise into the data matrix X, wherein the data matrix X is:
X (count ) =X+e#(1);
encoder function z=f w =(X corrupt ) Decoder function X' =g w’ (z) the noise reduction self-encoder adopts a negative log likelihood function of the ZINB as a loss function, and calculates and interpolates a dropout value in the data to realize data noise reduction;
s3, data dimension reduction:
selecting a graph rolling automatic encoder with double decoders, generating a KNN graph from the noise-reduced data by using a K nearest neighbor algorithm, using an adjacent matrix A to represent a graph structure as one input of the graph rolling automatic encoder, using a noise-reduced data matrix X representing graph node gene characteristics as the other input of the graph rolling automatic encoder, and respectively reconstructing the adjacent matrix and the node characteristic matrix of the input KNN graph data by using two decoders so that the graph rolling generates learned low-dimensional data characteristics and structural characteristics among the data from the potential space of the encoder;
saving coding layer learned data feature vectors as potential variables of potential spaceh=e (X, a), variable is passed through two decodershDecoding to reconstruct data node matrix X r =D Xh) And reconstructing an edge matrix ar=d Ah) The total loss of the adjacency matrix a and the loss of the data node X are generated, and the whole learning process is a process of gradually reducing the total loss function until reaching a minimum value, and the total loss calculation formula of the graph convolution self-encoder is as follows:
Wherein lambda is a super parameter, the experiment sets the parameter lambda to 0.6, and the optimization loss function is continuously trained, so that the graph convolution self-encoder realizes the dimension reduction of single-cell RNA sequencing data and simultaneously maintains the topological structure characteristics among high-dimension data;
s4, data clustering:
and initializing a graph convolution self-encoder to generate a target distribution P, using a KL divergence function as a clustering loss function for potential space data of the graph convolution self-encoding, and continuously updating clustering parameters by the graph convolution self-encoder through continuous iterative training until the clustering distribution Q output by an encoding layer is fitted with the target distribution P, so as to generate an optimal clustering result in the whole clustering updating process.
Further, in step S1, the read count data of single cell RNA sequencing is preprocessed by using scanpy in python package, the genes that do not express count values in any cell are filtered out, the count matrix is normalized according to the library size and the size factor is calculated, and the read count is logarithmically transformed and scaled to make the count value follow zero mean and unit variance.
Further, in step S2, the expression of the ZINB is the following formula:
wherein:μas an average value of the values,θin order for the dispersion to be a degree of dispersion,πa dropout probability;
the corresponding optimization targets are as follows:
M,Θ,Пrespectively represent the mean valueμDegree of dispersionθAnd loss probabilityπMatrix form, NLL ZINB Negative log likelihood function representing ZINB distribution, using this loss functionThe data noise reduction is realized by calculating and interpolating the dropout value through the number optimization mean value, the dispersion and the dropout probability.
Further, in step S2, the structure of the noise reduction self-encoder is an encoding layer, a bottleneck layer, and a decoding layer, the encoding layer and the decoding layer are symmetrical with respect to the bottleneck layer, and are all fully connected neural network layers, and a ReLU is used as an activation function of each layer of neural network.
Further, in step S2, each layer of neuron structure adopted by the coding layer of the model is d-256-54-32, where d is the dimension of input data, 32 is the number of neurons of the bottleneck layer, the structure of the decoding layer from left to right is 32-54-256-d, d is the dimension of output data, and is symmetrical to the structure of the coding layer, the batch size selected in the noise reduction self-encoder training process is 256, the gaussian noise intensity introduced in the coding layer is 2.5, and the model is optimized by using Adam optimization algorithm.
Further, in step S4, the training pattern convolves the self-encoder with the loss function L of equation (4) r Minimizing, alternating training allows initializing the neural network parameters obtained by training as optimization of the parameters of the second stage, the loss function of the training stage being defined as L, the loss simultaneously governing the loss L of the graph self-encoder r And clustering loss L c
Wherein:representing cluster centroid, wherein gamma is a super parameter of a cluster model experiment and is set to be 2.5, and a group of initial cluster centroid is given in the training process>Embedding the calculated data of the L update coding network layer into the potential space by minimizing the calculated data, and updating the cluster centroid by carrying out cluster iteration on the embedded data>The steps are carried outAlternating steps until the loss function converges; clustering loss L c The KL divergence loss is adopted:
and continuously updating the clustering result through iterative training until the loss function reaches the minimum value, and generating the optimal clustering result for the potential space data.
The invention has the beneficial effects that:
the fusion noise reduction and dimension reduction method in the clustering method constructed aiming at single-cell RNA sequencing data effectively solves the problems of high dimensionality and strong noise in the data at the same time, and improves the clustering precision. The deep noise reduction self-encoder with the function of interpolating a dropout event is realized in a layered mode, the double-decoding-diagram convolution dimension reduction self-encoder which simultaneously captures the topological structure characteristics of data and the self characteristics of the data is realized, and the KL divergence function is used as a clustered loss function to carry out deep embedding clustering.
In the noise reduction, ZINB distribution is adopted to fit single-cell RNA sequencing data distribution, and the negative log likelihood function of the distribution is used as a loss function to average valueμDegree of dispersionθProbability of dropoutπThe three target parameters are optimized, so that the dropout value in the data is effectively interpolated.
The self-encoder of the graph based on the graph rolling network reduces the dimension of the data, the self-encoder of the graph rolling reserves the topological structure among the data, and the clustering efficiency is effectively improved by utilizing the neighborhood information among cells. Experimental results prove that the fused data noise reduction self-encoder and the data dimension reduction graph convolution self-encoder are used for solving the problem of dropout noise and high-dimension disasters existing in data at the same time, and clustering results are optimized by adopting depth embedding clustering. The data noise reduction, dimension reduction and clustering are realized through layering, each layer promotes the improvement of the clustering precision, and experiments are carried out on 9 real data sets with high dimension and high noise, so that the clustering method has better clustering effect compared with other traditional clustering methods.
Drawings
Fig. 1 is an overall flow chart of the present invention.
Fig. 2 is a block diagram of a noise reduction self-encoder employed in the present invention.
FIG. 3 is a flow chart of the operation of the convolutional dimension-reduction self-encoder of the present invention.
Detailed Description
In order to clearly illustrate the technical characteristics of the scheme, the scheme is explained below through a specific embodiment.
An RNA layering embedding clustering method based on a graph convolution neural network comprises the following steps:
s1, data preprocessing: the read count data from single cell RNA sequencing is processed to obtain a data matrix X in which rows represent cells and columns represent genes, each cell having the same species of gene.
The read count data of single cell RNA sequencing is preprocessed by using scanpy in python package, genes which do not express count values in any cells are filtered out, the count matrix is normalized according to the library size and the size factor is calculated, and then the read count is subjected to logarithmic transformation and scaling, so that the count value follows zero mean and unit variance.
S2, data noise reduction:
inputting the preprocessed data matrix X into a noise reduction self-encoder, introducing random Gaussian noise e into each layer of a coding layer of the noise reduction self-encoder, and adding the noise into the data matrix X, wherein the data matrix X is:
X (count ) =X+e#(1);
encoder function z=f w =(X corrupt ) Decoder function X' =g w’ And (z) the noise reduction self-encoder adopts a negative log likelihood function of the ZINB as a loss function, calculates and interpolates a dropout value in the data to realize data noise reduction.
The expression of ZINB is the following:
wherein:μas an average value of the values,θin order for the dispersion to be a degree of dispersion,πa dropout probability;
estimating an average of ZINB distributions using a noise reduction self-encoderμDegree of dispersionθAnd dropout probabilityπThe neural network of the model DAE has three output layers instead of one, which represent the 3 parameters of the gene respectivelyμ,θ,π)The three output layers remain in the same dimension as the input layer.
The corresponding optimization targets are as follows:
M,Θ,Пrespectively represent the mean valueμDegree of dispersionθAnd loss probabilityπMatrix form, NLL ZINB And a negative log likelihood function representing ZINB distribution, optimizing the mean value, the dispersion and the dropout probability by using the loss function, and realizing data noise reduction by calculating and interpolating the dropout value. Because of the mean valueμAnd dispersion degreeθAlways positive, so the selected activation function is of exponential form, with an additional coefficientπRepresenting the dropout probability of the input, the dropout probability being [0,1 ]]The activation function selected is sigmoid.
As shown in fig. 2, the noise reduction self-encoder has a structure of an encoding layer, a bottleneck layer and a decoding layer, the neural network structures of the encoding layer and the decoding layer are symmetrical about the bottleneck layer and are all fully connected neural network layers, and a ReLU is adopted as an activation function of each layer of the neural network.
The coding layer of the model adopts a neuron structure of d-256-54-32, wherein d is the dimension of input data, 32 is the neuron number of a bottleneck layer, the structure of the decoding layer from left to right is 32-54-256-d, d is the dimension of output data, the decoding layer is symmetrical to the coding layer structure, the batch size selected in the noise reduction self-encoder training process is 256, the Gaussian noise intensity introduced in the coding layer is 2.5, and the model is optimized by using an Adam optimization algorithm.
The self-encoder is an artificial neural network that performs data feature learning in an unsupervised manner. The noise-reducing self-encoder is one of the self-encoders, and enhances the robustness of learning data features after adding noise to the encoding layer, because it has the ability to learn input data corrupted by small variations. The data is corrupted by introducing gaussian noise at the encoding layer, the main data features are learned by using the multi-layer encoding layer of the encoder, and the decoder restores the uncorrupted original data based on the main data features. Noise reduction is typically performed from the middle of the encoder by a low-dimensional bottleneck layer through which the data features learned by the encoder are output, and the output low-dimensional data is stored in potential space.
S3, data dimension reduction:
the automatic graph rolling encoder with double decoders is selected, and is an artificial neural network for performing unsupervised learning and feature extraction on graph structure data, and a bottleneck layer of the automatic graph rolling encoder can obtain data features with lower dimensionality, so that the automatic graph rolling encoder is a good dimension reduction method for the graph structure data. And generating a KNN graph from the denoised data by using a K nearest neighbor algorithm, using an adjacent matrix A to represent a graph structure as one input of the graph rolling automatic encoder, using a denoised data matrix X representing graph node gene characteristics as the other input of the graph rolling automatic encoder, and respectively reconstructing the adjacent matrix and the node characteristic matrix of the input KNN graph data by using two decoders so that the graph rolling automatic encoder potential space generates learned low-dimensional data characteristics and structural characteristics among the data.
Saving coding layer learned data feature vectors as potential variables of potential spaceh=e (X, a), variable is passed through two decodershDecoding to reconstruct data node matrix X r =D Xh) And reconstructing an edge matrix ar=d Ah) The total loss of the adjacency matrix a and the loss of the data node X are generated, and the whole learning process is a process of gradually reducing the total loss function until reaching a minimum value, and the total loss calculation formula of the graph convolution self-encoder is as follows:
wherein lambda is a super parameter, the experiment sets the parameter lambda to 0.6, and the optimization loss function is continuously trained, so that the graph convolution self-encoder realizes the dimension reduction of single-cell RNA sequencing data and simultaneously maintains the topological structure characteristics among high-dimension data.
The coding layer is set as a two-layer graph rolling layer, the numbers of the neurons of the bottleneck layer are respectively tested by 5, 10, 15, 20 and 30 values, and when the number of the neurons of the bottleneck layer is 10, the clustering index on the data set is generally highest, so that the neuron structure of the two-layer graph rolling neural network contained in the graph rolling coder is 123-10. Data feature decoder D X The four-layer fully-connected neural network comprises four layers of fully-connected neural network layers, and the neuron structure of the four layers of fully-connected neural network is set to be 10-64-256-512. Adjacency matrix decoder D A Consisting of a fully connected layer and orthogonalization and activation functions. The double-decoding structure of the graph convolution self-encoder realizes the study of the data self-characteristics of single-cell RNA sequencing data and the study of the structure characteristics among cells, and the optimization loss function is continuously trained, so that the graph convolution self-encoder realizes the dimension reduction of the single-cell RNA sequencing data and simultaneously retains the topological structure characteristics among high-dimensional data.
S4, data clustering:
the noise reduction and dimension reduction steps before the step are used for reducing the interference noise of data and removing the redundant dimension of the data, and more accurate cell clustering is carried out, wherein the clustering is an important link of single-cell RNA sequencing data analysis, and the clustering is used for enabling the distance between cell clusters with similarity gene characteristics to be closer and enabling the distance between the cell clusters with different gene characteristics to be further.
Unlike the traditional K-means clustering algorithm to cluster low-dimensional data directly, the method selects a depth embedding clustering method based on deep learning to cluster the low-dimensional data in the potential space generated by the graph convolution self-encoder module, and the depth embedding clustering is the optimization of clustering target iteration through mapping from the high-dimensional space to the low-dimensional space of the learning data by updating the clustering distribution while updating the learned data features by using a depth neural network.
And initializing a graph convolution self-encoder to generate a target distribution P, using a KL divergence function as a clustering loss function for potential space data of the graph convolution self-encoding, and continuously updating clustering parameters by the graph convolution self-encoder through continuous iterative training until the clustering distribution Q output by an encoding layer is fitted with the target distribution P, so as to generate an optimal clustering result in the whole clustering updating process.
Specifically, the training pattern convolves the self-encoder with the loss function L of equation (4) r Minimizing, alternating training allows initializing the neural network parameters obtained by training as optimization of the parameters of the second stage, the loss function of the training stage being defined as L, the loss simultaneously governing the loss L of the graph self-encoder r And clustering loss L c
Wherein:representing cluster centroid, wherein gamma is a super parameter of a cluster model experiment and is set to be 2.5, and a group of initial cluster centroid is given in the training process>Embedding the calculated data of the L update coding network layer into the potential space by minimizing the calculated data, and updating the cluster centroid by carrying out cluster iteration on the embedded data>The steps are alternately carried out until the loss function converges; clustering loss L c The KL divergence loss is adopted:
and continuously updating the clustering result through iterative training until the loss function reaches the minimum value, and generating the optimal clustering result for the potential space data.
In a specific experiment, the initial cluster centroidThe method comprises the steps of obtaining initial embedded data according to neural network pretraining, obtaining initialization through Louvain clustering, and minimizing clustering loss L by a model through a random gradient descent algorithm c To update the cluster centroid set. Clustering loss L c A high performance soft allocation penalty is chosen that uses student t distribution as a core to measure similarity between embedded nodes and centroid, L c Is the KL divergence of the empirical cluster distribution Q from the target distribution P. Initializing parameters by a pre-training graph convolution self-encoder to obtain initialization of clustering target distribution P and potential spatial clustering distribution Q, wherein KL divergence loss between P and Q is called clustering loss, the loss represents the difference between the two distributions, and the smaller the loss is, the more similar the generated clustering distribution and the clustering result of the target distribution are.
The clustering method constructed in the invention realizes the noise reduction, dimension reduction and clustering of the data in a layered mode, and the overall architecture is shown in figure 1 and mainly comprises four modules of data preprocessing, data noise reduction, dimension reduction and data clustering. Firstly, preprocessing original data, inputting the preprocessed data into a DDAE module of a depth noise reduction self-encoder to reduce noise of the data, solving a dropout event in the data, constructing a KNN diagram by using the noise reduction data, inputting diagram structure data into a GCNAE module of a diagram convolution self-encoder, reconstructing node information and diagram adjacent matrix, obtaining low-dimensional data characteristics and intercellular structure characteristics in a potential space of the diagram convolution self-encoder, realizing dimension reduction of the data, and clustering cells by using a depth embedding clustering method by using GCNAE potential space data.
The preprocessed data set is input to a first core module depth noise reduction self-encoder, a noise reduction module adopts a superposition self-encoding neural network which converts a traditional mean square error loss function into a negative log likelihood function of ZINB, calculation and interpolation of dropout noise in single-cell RNA sequencing data are achieved, the dimensionality of encoding layer data passing through the self-encoder is reduced, and the learned noise reduction data feature vector is output to a potential space. After the noise reduction data are stored, a KNN diagram is constructed by using a K nearest neighbor algorithm, an adjacent matrix A and a node characteristic matrix X of the diagram are simultaneously input into a second core module diagram convolution self-encoder, and the two decoders are adopted to respectively decode data node characteristic information and adjacent matrix information, so that the characteristics of single-cell RNA sequencing data and the structural characteristics among cells in the data are simultaneously acquired, and the low-dimensional data space simultaneously contains cell gene characteristic information and the structural characteristic information among the cells in the data. Finally, carrying out depth embedding clustering on a third core module, initializing a graph convolution self-encoder to generate target distribution P, using KL divergence function as a clustering loss function on potential space data of the graph convolution self-encoding, continuously updating clustering parameters of the graph convolution self-encoder until the clustering distribution Q output by an encoding layer is basically fitted with the target distribution P through continuous iterative training, and generating an optimal clustering result in the whole clustering updating process, wherein the model completes noise reduction, dimension reduction and clustering on single-cell RNA sequencing data.
Of course, the above description is not limited to the above examples, and the technical features of the present invention that are not described may be implemented by or by using the prior art, which is not described herein again; the above examples and drawings are only for illustrating the technical scheme of the present invention and not for limiting the same, and the present invention has been described in detail with reference to the preferred embodiments, and it should be understood by those skilled in the art that changes, modifications, additions or substitutions made by those skilled in the art without departing from the spirit of the present invention and the scope of the appended claims.

Claims (6)

1. An RNA layering embedding clustering method based on a graph convolution neural network is characterized by comprising the following steps of: the method comprises the following steps:
s1, data preprocessing: processing the read count data of single-cell RNA sequencing to obtain a data matrix X, wherein rows in the matrix represent cells, columns represent genes, and each cell has the same kind of genes;
s2, data noise reduction:
inputting the preprocessed data matrix X into a noise reduction self-encoder, introducing random Gaussian noise e into each layer of a coding layer of the noise reduction self-encoder, and adding the noise into the data matrix X, wherein the data matrix X is:
X (count ) =X+e#(1);
encoder function z=f w =(X corrupt ) Decoder function X' =g w’ (z) the noise reduction self-encoder adopts a negative log likelihood function of the ZINB as a loss function, and calculates and interpolates a dropout value in the data to realize data noise reduction;
s3, data dimension reduction:
selecting a graph rolling automatic encoder with double decoders, generating a KNN graph from the noise-reduced data by using a K nearest neighbor algorithm, using an adjacent matrix A to represent a graph structure as one input of the graph rolling automatic encoder, using a noise-reduced data matrix X representing graph node gene characteristics as the other input of the graph rolling automatic encoder, and respectively reconstructing the adjacent matrix and the node characteristic matrix of the input KNN graph data by using two decoders so that the graph rolling generates learned low-dimensional data characteristics and structural characteristics among the data from the potential space of the encoder;
saving coding layer learned data feature vectors as potential variables of potential spaceh=e (X, a), variable is passed through two decodershDecoding to reconstruct data node matrix X r =D Xh) And reconstructing an edge matrix ar=d Ah) The total loss of the adjacency matrix a and the loss of the data node X are generated, and the whole learning process is a process of gradually reducing the total loss function until reaching a minimum value, and the total loss calculation formula of the graph convolution self-encoder is as follows:
wherein lambda is a super parameter, the experiment sets the parameter lambda to 0.6, and the optimization loss function is continuously trained, so that the graph convolution self-encoder realizes the dimension reduction of single-cell RNA sequencing data and simultaneously maintains the topological structure characteristics among high-dimension data;
s4, data clustering:
and initializing a graph convolution self-encoder to generate a target distribution P, using a KL divergence function as a clustering loss function for potential space data of the graph convolution self-encoding, and continuously updating clustering parameters by the graph convolution self-encoder through continuous iterative training until the clustering distribution Q output by an encoding layer is fitted with the target distribution P, so as to generate an optimal clustering result in the whole clustering updating process.
2. The method for RNA hierarchical embedding and clustering based on graph rolling neural network according to claim 1, wherein the method comprises the following steps: in step S1, the read count data of single cell RNA sequencing is preprocessed by using scanpy in the python package, genes which do not express count values in any cells are filtered, the count matrix is normalized according to the library size and the size factor is calculated, and then the read count is subjected to logarithmic transformation and scaling, so that the count value follows zero mean and unit variance.
3. The method for RNA hierarchical embedding and clustering based on graph rolling neural network according to claim 1, wherein the method comprises the following steps: in step S2, the expression of the ZINB is the following formula:
wherein:μas an average value of the values,θin order for the dispersion to be a degree of dispersion,πa dropout probability;
the corresponding optimization targets are as follows:
M,Θ,Пrespectively represent the mean valueμDegree of dispersionθAnd loss probabilityπMatrix form, NLL ZINB And a negative log likelihood function representing ZINB distribution, optimizing the mean value, the dispersion and the dropout probability by using the loss function, and realizing data noise reduction by calculating and interpolating the dropout value.
4. The method for RNA hierarchical embedding and clustering based on graph rolling neural network according to claim 1, wherein the method comprises the following steps: in step S2, the structure of the noise reduction self-encoder is an encoding layer, a bottleneck layer, and a decoding layer, the encoding layer and the decoding layer are symmetrical with respect to the bottleneck layer, and are all fully connected neural network layers, and a ReLU is used as an activation function of each layer of neural network.
5. The method for RNA hierarchical embedding and clustering based on graph rolling neural network according to claim 4, wherein the method comprises the following steps: in step S2, each layer of neuron structure adopted by the coding layer of the model is d-256-54-32, wherein d is the dimension of input data, 32 is the number of neurons of a bottleneck layer, the structure from left to right of the decoding layer structure is 32-54-256-d, d is the dimension of output data, the decoding layer structure is symmetrical to the coding layer structure, the batch size selected in the noise reduction self-encoder training process is 256, the intensity of Gaussian noise introduced in the coding layer is 2.5, and the model is optimized by utilizing an Adam optimization algorithm.
6. The method for RNA hierarchical embedding and clustering based on graph rolling neural network according to claim 1, wherein the method comprises the following steps: in step S4, the training pattern convolves the self-encoder with the loss function L of equation (4) r Minimizing, alternating training allows initializing the neural network parameters obtained by training as optimization of the parameters of the second stage, the loss function of the training stage being defined as L, the loss simultaneously governing the loss L of the graph self-encoder r And clustering loss L c
Wherein:representing cluster centroid, wherein gamma is a super parameter of a cluster model experiment and is set to be 2.5, and a group of initial cluster centroid is given in the training process>Embedding the calculated data of the L update coding network layer into the potential space by minimizing the calculated data, and updating the cluster centroid by carrying out cluster iteration on the embedded data>The steps are alternately carried out until the loss function converges; clustering loss L c The KL divergence loss is adopted:
and continuously updating the clustering result through iterative training until the loss function reaches the minimum value, and generating the optimal clustering result for the potential space data.
CN202310896057.3A 2023-07-21 2023-07-21 RNA layered embedding clustering method based on graph convolution neural network Pending CN116665786A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310896057.3A CN116665786A (en) 2023-07-21 2023-07-21 RNA layered embedding clustering method based on graph convolution neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310896057.3A CN116665786A (en) 2023-07-21 2023-07-21 RNA layered embedding clustering method based on graph convolution neural network

Publications (1)

Publication Number Publication Date
CN116665786A true CN116665786A (en) 2023-08-29

Family

ID=87715486

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310896057.3A Pending CN116665786A (en) 2023-07-21 2023-07-21 RNA layered embedding clustering method based on graph convolution neural network

Country Status (1)

Country Link
CN (1) CN116665786A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117235584A (en) * 2023-11-15 2023-12-15 之江实验室 Picture data classification method, device, electronic device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113889192A (en) * 2021-09-29 2022-01-04 西安热工研究院有限公司 Single cell RNA-seq data clustering method based on deep noise reduction self-encoder
CN114022693A (en) * 2021-09-29 2022-02-08 西安热工研究院有限公司 Double-self-supervision-based single-cell RNA-seq data clustering method
CN115661498A (en) * 2022-11-09 2023-01-31 云南大学 Self-optimization single cell clustering method
CN116386729A (en) * 2022-12-23 2023-07-04 湖南大学 scRNA-seq data dimension reduction method based on graph neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113889192A (en) * 2021-09-29 2022-01-04 西安热工研究院有限公司 Single cell RNA-seq data clustering method based on deep noise reduction self-encoder
CN114022693A (en) * 2021-09-29 2022-02-08 西安热工研究院有限公司 Double-self-supervision-based single-cell RNA-seq data clustering method
CN115661498A (en) * 2022-11-09 2023-01-31 云南大学 Self-optimization single cell clustering method
CN116386729A (en) * 2022-12-23 2023-07-04 湖南大学 scRNA-seq data dimension reduction method based on graph neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GOKCEN ERASLAN等: "Single-cell RNA-seq denoising using a deep count autoencoder", NATURE COMMUNICATIONS, pages 1 - 3 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117235584A (en) * 2023-11-15 2023-12-15 之江实验室 Picture data classification method, device, electronic device and storage medium
CN117235584B (en) * 2023-11-15 2024-04-02 之江实验室 Picture data classification method, device, electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN107622182B (en) Method and system for predicting local structural features of protein
CN112418406B (en) Wind power tower inclination angle missing data supplementing method based on SSA-LSTM model
JP2002230514A (en) Evolutionary optimizing method
CN114022693B (en) Single-cell RNA-seq data clustering method based on double self-supervision
CN112464004A (en) Multi-view depth generation image clustering method
CN107609648B (en) Genetic algorithm combined with stacking noise reduction sparse automatic encoder
CN114970774B (en) Intelligent transformer fault prediction method and device
CN116665786A (en) RNA layered embedding clustering method based on graph convolution neural network
CN113361606A (en) Deep map attention confrontation variational automatic encoder training method and system
CN110674774A (en) Improved deep learning facial expression recognition method and system
CN112733273A (en) Method for determining Bayesian network parameters based on genetic algorithm and maximum likelihood estimation
CN112884149A (en) Deep neural network pruning method and system based on random sensitivity ST-SM
CN115732034A (en) Identification method and system of spatial transcriptome cell expression pattern
CN114925767A (en) Scene generation method and device based on variational self-encoder
CN115062587B (en) Knowledge graph embedding and replying generation method based on surrounding information
CN109214401B (en) SAR image classification method and device based on hierarchical automatic encoder
CN115640842A (en) Network representation learning method based on graph attention self-encoder
CN114880538A (en) Attribute graph community detection method based on self-supervision
CN113743301B (en) Solid-state nanopore sequencing electric signal noise reduction processing method based on residual self-encoder convolutional neural network
CN112329918A (en) Anti-regularization network embedding method based on attention mechanism
CN117375983A (en) Power grid false data injection identification method based on improved CNN-LSTM
CN117196963A (en) Point cloud denoising method based on noise reduction self-encoder
CN112668378A (en) Facial expression recognition method based on combination of image fusion and convolutional neural network
CN115661498A (en) Self-optimization single cell clustering method
CN113344202A (en) Novel deep multi-core learning network model training method, system and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20230829