CN116665786A

CN116665786A - RNA layered embedding clustering method based on graph convolution neural network

Info

Publication number: CN116665786A
Application number: CN202310896057.3A
Authority: CN
Inventors: 鲁大营; 刘化社; 孔晨曦; 鲁克; 曹鲁成; 柴华; 樊稳稳; 刘原
Original assignee: Qufu Normal University; Shanxian Power Supply Co of State Grid Shandong Electric Power Co Ltd
Current assignee: Qufu Normal University; Shanxian Power Supply Co of State Grid Shandong Electric Power Co Ltd
Priority date: 2023-07-21
Filing date: 2023-07-21
Publication date: 2023-08-29

Abstract

The invention relates to the technical field of bioinformatics, in particular to an RNA layering embedding clustering method based on a graph convolution neural network, which comprises the following steps: s1, data preprocessing, S2, data noise reduction, S3, data dimension reduction, S4 and data clustering, wherein a negative log likelihood function with zero expansion negative binomial distribution is used as a loss function of a noise reduction encoder in the noise reduction process to process dropout noise in data, a double-decoding-diagram convolution self-encoder is used in a dimension reduction task to obtain low-dimensional characteristics of the data, a KL (moment of inertia) divergence function is used as a loss function of clustering in the clustering task to perform deep embedding clustering, the improvement of clustering precision is promoted, and a better clustering effect is achieved.

Description

RNA layered embedding clustering method based on graph convolution neural network

Technical Field

The invention relates to the technical field of bioinformatics, in particular to an RNA layering embedding clustering method based on a graph convolution neural network.

Background

Thousands of gene species in a single cell cause a dimensional disaster in RNA sequencing data, and a large number of false zero values exist in the data due to dropout noise generated by low RNA acquisition rate. The problem of high and strong noise exists in single cell RNA sequencing data at the same time.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an RNA hierarchical embedded clustering method based on a graph convolution neural network, which solves the problems of dropout noise and high-dimensional disasters existing in data at the same time through a fused data noise reduction self-encoder and a data dimension reduction graph convolution self-encoder, optimizes a clustering result by adopting deep embedded clustering, and improves clustering precision.

The invention is realized by the following technical scheme:

the RNA hierarchical embedding clustering method based on the graph convolution neural network comprises the following steps:

s1, data preprocessing: processing the read count data of single-cell RNA sequencing to obtain a data matrix X, wherein rows in the matrix represent cells, columns represent genes, and each cell has the same kind of genes;

s2, data noise reduction:

inputting the preprocessed data matrix X into a noise reduction self-encoder, introducing random Gaussian noise e into each layer of a coding layer of the noise reduction self-encoder, and adding the noise into the data matrix X, wherein the data matrix X is:

X ^{(count )} =X+e#(1)；

encoder function z=f _w =(X ^corrupt ) Decoder function X' =g _w’ (z) the noise reduction self-encoder adopts a negative log likelihood function of the ZINB as a loss function, and calculates and interpolates a dropout value in the data to realize data noise reduction;

s3, data dimension reduction:

selecting a graph rolling automatic encoder with double decoders, generating a KNN graph from the noise-reduced data by using a K nearest neighbor algorithm, using an adjacent matrix A to represent a graph structure as one input of the graph rolling automatic encoder, using a noise-reduced data matrix X representing graph node gene characteristics as the other input of the graph rolling automatic encoder, and respectively reconstructing the adjacent matrix and the node characteristic matrix of the input KNN graph data by using two decoders so that the graph rolling generates learned low-dimensional data characteristics and structural characteristics among the data from the potential space of the encoder;

saving coding layer learned data feature vectors as potential variables of potential spaceh=e (X, a), variable is passed through two decodershDecoding to reconstruct data node matrix X _r =D _X （h) And reconstructing an edge matrix ar=d _A （h) The total loss of the adjacency matrix a and the loss of the data node X are generated, and the whole learning process is a process of gradually reducing the total loss function until reaching a minimum value, and the total loss calculation formula of the graph convolution self-encoder is as follows：

；

Wherein lambda is a super parameter, the experiment sets the parameter lambda to 0.6, and the optimization loss function is continuously trained, so that the graph convolution self-encoder realizes the dimension reduction of single-cell RNA sequencing data and simultaneously maintains the topological structure characteristics among high-dimension data;

s4, data clustering:

and initializing a graph convolution self-encoder to generate a target distribution P, using a KL divergence function as a clustering loss function for potential space data of the graph convolution self-encoding, and continuously updating clustering parameters by the graph convolution self-encoder through continuous iterative training until the clustering distribution Q output by an encoding layer is fitted with the target distribution P, so as to generate an optimal clustering result in the whole clustering updating process.

Further, in step S1, the read count data of single cell RNA sequencing is preprocessed by using scanpy in python package, the genes that do not express count values in any cell are filtered out, the count matrix is normalized according to the library size and the size factor is calculated, and the read count is logarithmically transformed and scaled to make the count value follow zero mean and unit variance.

Further, in step S2, the expression of the ZINB is the following formula:

；

wherein:μas an average value of the values,θin order for the dispersion to be a degree of dispersion,πa dropout probability;

the corresponding optimization targets are as follows:

；

M，Θ，Пrespectively represent the mean valueμDegree of dispersionθAnd loss probabilityπMatrix form, NLL _ZINB Negative log likelihood function representing ZINB distribution, using this loss functionThe data noise reduction is realized by calculating and interpolating the dropout value through the number optimization mean value, the dispersion and the dropout probability.

Further, in step S2, the structure of the noise reduction self-encoder is an encoding layer, a bottleneck layer, and a decoding layer, the encoding layer and the decoding layer are symmetrical with respect to the bottleneck layer, and are all fully connected neural network layers, and a ReLU is used as an activation function of each layer of neural network.

Further, in step S2, each layer of neuron structure adopted by the coding layer of the model is d-256-54-32, where d is the dimension of input data, 32 is the number of neurons of the bottleneck layer, the structure of the decoding layer from left to right is 32-54-256-d, d is the dimension of output data, and is symmetrical to the structure of the coding layer, the batch size selected in the noise reduction self-encoder training process is 256, the gaussian noise intensity introduced in the coding layer is 2.5, and the model is optimized by using Adam optimization algorithm.

Further, in step S4, the training pattern convolves the self-encoder with the loss function L of equation (4) _r Minimizing, alternating training allows initializing the neural network parameters obtained by training as optimization of the parameters of the second stage, the loss function of the training stage being defined as L, the loss simultaneously governing the loss L of the graph self-encoder _r And clustering loss L _c ；

；

Wherein:representing cluster centroid, wherein gamma is a super parameter of a cluster model experiment and is set to be 2.5, and a group of initial cluster centroid is given in the training process>Embedding the calculated data of the L update coding network layer into the potential space by minimizing the calculated data, and updating the cluster centroid by carrying out cluster iteration on the embedded data>The steps are carried outAlternating steps until the loss function converges; clustering loss L _c The KL divergence loss is adopted:

；

and continuously updating the clustering result through iterative training until the loss function reaches the minimum value, and generating the optimal clustering result for the potential space data.

The invention has the beneficial effects that:

the fusion noise reduction and dimension reduction method in the clustering method constructed aiming at single-cell RNA sequencing data effectively solves the problems of high dimensionality and strong noise in the data at the same time, and improves the clustering precision. The deep noise reduction self-encoder with the function of interpolating a dropout event is realized in a layered mode, the double-decoding-diagram convolution dimension reduction self-encoder which simultaneously captures the topological structure characteristics of data and the self characteristics of the data is realized, and the KL divergence function is used as a clustered loss function to carry out deep embedding clustering.

In the noise reduction, ZINB distribution is adopted to fit single-cell RNA sequencing data distribution, and the negative log likelihood function of the distribution is used as a loss function to average valueμDegree of dispersionθProbability of dropoutπThe three target parameters are optimized, so that the dropout value in the data is effectively interpolated.

The self-encoder of the graph based on the graph rolling network reduces the dimension of the data, the self-encoder of the graph rolling reserves the topological structure among the data, and the clustering efficiency is effectively improved by utilizing the neighborhood information among cells. Experimental results prove that the fused data noise reduction self-encoder and the data dimension reduction graph convolution self-encoder are used for solving the problem of dropout noise and high-dimension disasters existing in data at the same time, and clustering results are optimized by adopting depth embedding clustering. The data noise reduction, dimension reduction and clustering are realized through layering, each layer promotes the improvement of the clustering precision, and experiments are carried out on 9 real data sets with high dimension and high noise, so that the clustering method has better clustering effect compared with other traditional clustering methods.

Drawings

Fig. 1 is an overall flow chart of the present invention.

Fig. 2 is a block diagram of a noise reduction self-encoder employed in the present invention.

FIG. 3 is a flow chart of the operation of the convolutional dimension-reduction self-encoder of the present invention.

Detailed Description

In order to clearly illustrate the technical characteristics of the scheme, the scheme is explained below through a specific embodiment.

An RNA layering embedding clustering method based on a graph convolution neural network comprises the following steps:

s1, data preprocessing: the read count data from single cell RNA sequencing is processed to obtain a data matrix X in which rows represent cells and columns represent genes, each cell having the same species of gene.

The read count data of single cell RNA sequencing is preprocessed by using scanpy in python package, genes which do not express count values in any cells are filtered out, the count matrix is normalized according to the library size and the size factor is calculated, and then the read count is subjected to logarithmic transformation and scaling, so that the count value follows zero mean and unit variance.

S2, data noise reduction:

X ^{(count )} =X+e#(1)；

encoder function z=f _w =(X ^corrupt ) Decoder function X' =g _w’ And (z) the noise reduction self-encoder adopts a negative log likelihood function of the ZINB as a loss function, calculates and interpolates a dropout value in the data to realize data noise reduction.

The expression of ZINB is the following:

；

estimating an average of ZINB distributions using a noise reduction self-encoderμDegree of dispersionθAnd dropout probabilityπThe neural network of the model DAE has three output layers instead of one, which represent the 3 parameters of the gene respectivelyμ，θ，π）The three output layers remain in the same dimension as the input layer.

The corresponding optimization targets are as follows:

；

M，Θ，Пrespectively represent the mean valueμDegree of dispersionθAnd loss probabilityπMatrix form, NLL _ZINB And a negative log likelihood function representing ZINB distribution, optimizing the mean value, the dispersion and the dropout probability by using the loss function, and realizing data noise reduction by calculating and interpolating the dropout value. Because of the mean valueμAnd dispersion degreeθAlways positive, so the selected activation function is of exponential form, with an additional coefficientπRepresenting the dropout probability of the input, the dropout probability being [0,1 ]]The activation function selected is sigmoid.

As shown in fig. 2, the noise reduction self-encoder has a structure of an encoding layer, a bottleneck layer and a decoding layer, the neural network structures of the encoding layer and the decoding layer are symmetrical about the bottleneck layer and are all fully connected neural network layers, and a ReLU is adopted as an activation function of each layer of the neural network.

The coding layer of the model adopts a neuron structure of d-256-54-32, wherein d is the dimension of input data, 32 is the neuron number of a bottleneck layer, the structure of the decoding layer from left to right is 32-54-256-d, d is the dimension of output data, the decoding layer is symmetrical to the coding layer structure, the batch size selected in the noise reduction self-encoder training process is 256, the Gaussian noise intensity introduced in the coding layer is 2.5, and the model is optimized by using an Adam optimization algorithm.

The self-encoder is an artificial neural network that performs data feature learning in an unsupervised manner. The noise-reducing self-encoder is one of the self-encoders, and enhances the robustness of learning data features after adding noise to the encoding layer, because it has the ability to learn input data corrupted by small variations. The data is corrupted by introducing gaussian noise at the encoding layer, the main data features are learned by using the multi-layer encoding layer of the encoder, and the decoder restores the uncorrupted original data based on the main data features. Noise reduction is typically performed from the middle of the encoder by a low-dimensional bottleneck layer through which the data features learned by the encoder are output, and the output low-dimensional data is stored in potential space.

S3, data dimension reduction:

the automatic graph rolling encoder with double decoders is selected, and is an artificial neural network for performing unsupervised learning and feature extraction on graph structure data, and a bottleneck layer of the automatic graph rolling encoder can obtain data features with lower dimensionality, so that the automatic graph rolling encoder is a good dimension reduction method for the graph structure data. And generating a KNN graph from the denoised data by using a K nearest neighbor algorithm, using an adjacent matrix A to represent a graph structure as one input of the graph rolling automatic encoder, using a denoised data matrix X representing graph node gene characteristics as the other input of the graph rolling automatic encoder, and respectively reconstructing the adjacent matrix and the node characteristic matrix of the input KNN graph data by using two decoders so that the graph rolling automatic encoder potential space generates learned low-dimensional data characteristics and structural characteristics among the data.

Saving coding layer learned data feature vectors as potential variables of potential spaceh=e (X, a), variable is passed through two decodershDecoding to reconstruct data node matrix X _r =D _X （h) And reconstructing an edge matrix ar=d _A （h) The total loss of the adjacency matrix a and the loss of the data node X are generated, and the whole learning process is a process of gradually reducing the total loss function until reaching a minimum value, and the total loss calculation formula of the graph convolution self-encoder is as follows:

；

wherein lambda is a super parameter, the experiment sets the parameter lambda to 0.6, and the optimization loss function is continuously trained, so that the graph convolution self-encoder realizes the dimension reduction of single-cell RNA sequencing data and simultaneously maintains the topological structure characteristics among high-dimension data.

The coding layer is set as a two-layer graph rolling layer, the numbers of the neurons of the bottleneck layer are respectively tested by 5, 10, 15, 20 and 30 values, and when the number of the neurons of the bottleneck layer is 10, the clustering index on the data set is generally highest, so that the neuron structure of the two-layer graph rolling neural network contained in the graph rolling coder is 123-10. Data feature decoder D _X The four-layer fully-connected neural network comprises four layers of fully-connected neural network layers, and the neuron structure of the four layers of fully-connected neural network is set to be 10-64-256-512. Adjacency matrix decoder D _A Consisting of a fully connected layer and orthogonalization and activation functions. The double-decoding structure of the graph convolution self-encoder realizes the study of the data self-characteristics of single-cell RNA sequencing data and the study of the structure characteristics among cells, and the optimization loss function is continuously trained, so that the graph convolution self-encoder realizes the dimension reduction of the single-cell RNA sequencing data and simultaneously retains the topological structure characteristics among high-dimensional data.

S4, data clustering:

the noise reduction and dimension reduction steps before the step are used for reducing the interference noise of data and removing the redundant dimension of the data, and more accurate cell clustering is carried out, wherein the clustering is an important link of single-cell RNA sequencing data analysis, and the clustering is used for enabling the distance between cell clusters with similarity gene characteristics to be closer and enabling the distance between the cell clusters with different gene characteristics to be further.

Unlike the traditional K-means clustering algorithm to cluster low-dimensional data directly, the method selects a depth embedding clustering method based on deep learning to cluster the low-dimensional data in the potential space generated by the graph convolution self-encoder module, and the depth embedding clustering is the optimization of clustering target iteration through mapping from the high-dimensional space to the low-dimensional space of the learning data by updating the clustering distribution while updating the learned data features by using a depth neural network.

Specifically, the training pattern convolves the self-encoder with the loss function L of equation (4) _r Minimizing, alternating training allows initializing the neural network parameters obtained by training as optimization of the parameters of the second stage, the loss function of the training stage being defined as L, the loss simultaneously governing the loss L of the graph self-encoder _r And clustering loss L _c ；

；

Wherein:representing cluster centroid, wherein gamma is a super parameter of a cluster model experiment and is set to be 2.5, and a group of initial cluster centroid is given in the training process>Embedding the calculated data of the L update coding network layer into the potential space by minimizing the calculated data, and updating the cluster centroid by carrying out cluster iteration on the embedded data>The steps are alternately carried out until the loss function converges; clustering loss L _c The KL divergence loss is adopted:

。

In a specific experiment, the initial cluster centroidThe method comprises the steps of obtaining initial embedded data according to neural network pretraining, obtaining initialization through Louvain clustering, and minimizing clustering loss L by a model through a random gradient descent algorithm _c To update the cluster centroid set. Clustering loss L _c A high performance soft allocation penalty is chosen that uses student t distribution as a core to measure similarity between embedded nodes and centroid, L _c Is the KL divergence of the empirical cluster distribution Q from the target distribution P. Initializing parameters by a pre-training graph convolution self-encoder to obtain initialization of clustering target distribution P and potential spatial clustering distribution Q, wherein KL divergence loss between P and Q is called clustering loss, the loss represents the difference between the two distributions, and the smaller the loss is, the more similar the generated clustering distribution and the clustering result of the target distribution are.

The clustering method constructed in the invention realizes the noise reduction, dimension reduction and clustering of the data in a layered mode, and the overall architecture is shown in figure 1 and mainly comprises four modules of data preprocessing, data noise reduction, dimension reduction and data clustering. Firstly, preprocessing original data, inputting the preprocessed data into a DDAE module of a depth noise reduction self-encoder to reduce noise of the data, solving a dropout event in the data, constructing a KNN diagram by using the noise reduction data, inputting diagram structure data into a GCNAE module of a diagram convolution self-encoder, reconstructing node information and diagram adjacent matrix, obtaining low-dimensional data characteristics and intercellular structure characteristics in a potential space of the diagram convolution self-encoder, realizing dimension reduction of the data, and clustering cells by using a depth embedding clustering method by using GCNAE potential space data.

The preprocessed data set is input to a first core module depth noise reduction self-encoder, a noise reduction module adopts a superposition self-encoding neural network which converts a traditional mean square error loss function into a negative log likelihood function of ZINB, calculation and interpolation of dropout noise in single-cell RNA sequencing data are achieved, the dimensionality of encoding layer data passing through the self-encoder is reduced, and the learned noise reduction data feature vector is output to a potential space. After the noise reduction data are stored, a KNN diagram is constructed by using a K nearest neighbor algorithm, an adjacent matrix A and a node characteristic matrix X of the diagram are simultaneously input into a second core module diagram convolution self-encoder, and the two decoders are adopted to respectively decode data node characteristic information and adjacent matrix information, so that the characteristics of single-cell RNA sequencing data and the structural characteristics among cells in the data are simultaneously acquired, and the low-dimensional data space simultaneously contains cell gene characteristic information and the structural characteristic information among the cells in the data. Finally, carrying out depth embedding clustering on a third core module, initializing a graph convolution self-encoder to generate target distribution P, using KL divergence function as a clustering loss function on potential space data of the graph convolution self-encoding, continuously updating clustering parameters of the graph convolution self-encoder until the clustering distribution Q output by an encoding layer is basically fitted with the target distribution P through continuous iterative training, and generating an optimal clustering result in the whole clustering updating process, wherein the model completes noise reduction, dimension reduction and clustering on single-cell RNA sequencing data.

Of course, the above description is not limited to the above examples, and the technical features of the present invention that are not described may be implemented by or by using the prior art, which is not described herein again; the above examples and drawings are only for illustrating the technical scheme of the present invention and not for limiting the same, and the present invention has been described in detail with reference to the preferred embodiments, and it should be understood by those skilled in the art that changes, modifications, additions or substitutions made by those skilled in the art without departing from the spirit of the present invention and the scope of the appended claims.

Claims

1. An RNA layering embedding clustering method based on a graph convolution neural network is characterized by comprising the following steps of: the method comprises the following steps:

s2, data noise reduction:

X ^{(count )} =X+e#(1)；

s3, data dimension reduction:

；

s4, data clustering:

2. The method for RNA hierarchical embedding and clustering based on graph rolling neural network according to claim 1, wherein the method comprises the following steps: in step S1, the read count data of single cell RNA sequencing is preprocessed by using scanpy in the python package, genes which do not express count values in any cells are filtered, the count matrix is normalized according to the library size and the size factor is calculated, and then the read count is subjected to logarithmic transformation and scaling, so that the count value follows zero mean and unit variance.

3. The method for RNA hierarchical embedding and clustering based on graph rolling neural network according to claim 1, wherein the method comprises the following steps: in step S2, the expression of the ZINB is the following formula:

；

the corresponding optimization targets are as follows:

；

M，Θ，Пrespectively represent the mean valueμDegree of dispersionθAnd loss probabilityπMatrix form, NLL _ZINB And a negative log likelihood function representing ZINB distribution, optimizing the mean value, the dispersion and the dropout probability by using the loss function, and realizing data noise reduction by calculating and interpolating the dropout value.

4. The method for RNA hierarchical embedding and clustering based on graph rolling neural network according to claim 1, wherein the method comprises the following steps: in step S2, the structure of the noise reduction self-encoder is an encoding layer, a bottleneck layer, and a decoding layer, the encoding layer and the decoding layer are symmetrical with respect to the bottleneck layer, and are all fully connected neural network layers, and a ReLU is used as an activation function of each layer of neural network.

5. The method for RNA hierarchical embedding and clustering based on graph rolling neural network according to claim 4, wherein the method comprises the following steps: in step S2, each layer of neuron structure adopted by the coding layer of the model is d-256-54-32, wherein d is the dimension of input data, 32 is the number of neurons of a bottleneck layer, the structure from left to right of the decoding layer structure is 32-54-256-d, d is the dimension of output data, the decoding layer structure is symmetrical to the coding layer structure, the batch size selected in the noise reduction self-encoder training process is 256, the intensity of Gaussian noise introduced in the coding layer is 2.5, and the model is optimized by utilizing an Adam optimization algorithm.

6. The method for RNA hierarchical embedding and clustering based on graph rolling neural network according to claim 1, wherein the method comprises the following steps: in step S4, the training pattern convolves the self-encoder with the loss function L of equation (4) _r Minimizing, alternating training allows initializing the neural network parameters obtained by training as optimization of the parameters of the second stage, the loss function of the training stage being defined as L, the loss simultaneously governing the loss L of the graph self-encoder _r And clustering loss L _c ；

；