CN114783526A

CN114783526A - Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder

Info

Publication number: CN114783526A
Application number: CN202210506799.6A
Authority: CN
Inventors: 曾婉雯; 张爽; 范蕊
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2022-05-11
Filing date: 2022-05-11
Publication date: 2022-07-22

Abstract

The invention discloses a deep unsupervised single cell clustering method based on a Gaussian mixture graph variation self-encoder, which initializes a gene regulation network A by using protein-protein interaction relation PPIs (or regulation element interaction HiChIP); initializing a cell cluster C of each cell by using a K-means method; obtaining a hidden layer by a gene regulation network A and single cell gene expression data X (or regulation element openness data X) through a graph encoder; obtaining a cell cluster C, sampling from a Gaussian mixture model GMM to obtain a cell low-dimensional expression Z: predicting a gene regulation network a using a decoder; calculating a loss function, performing back propagation updating A, GCN, and repeating the steps until convergence; and outputting a gene regulation network A, expressing the cell low dimension Z and clustering the cells C. The invention completes the cell clustering and the dimension reduction of cell representation in the process of constructing the gene regulation network A.

Description

Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder

Technical Field

The invention relates to the technical field of single cell clustering, in particular to a depth unsupervised single cell clustering method based on a Gaussian mixture graph variation self-encoder.

Background

The single cell sequencing technology is a technology for performing high-throughput sequencing analysis on genomes, transcriptomes and epigenomes on the level of single cells, can reveal the gene structure and gene expression state of the single cells, reflects the heterogeneity among the cells, plays an important role in the fields of tumors, developmental biology, microbiology, neuroscience and the like, and is the focus of life science research. One of the first tasks to be solved for single cell sequencing is to identify the cell types contained in the sample, i.e. to perform cell clustering based on unsupervised algorithms. Single cell clustering can establish a comprehensive reference for all cell types in different developmental stages in an organism or tissue, and can also serve as a reference basis for disease research in addition to providing a deeper understanding of basic biology. Since many downstream analyses are based on cell clustering, single cell clustering results may have a significant impact on downstream applications. Therefore, how to obtain reliable single-cell clustering is one of the key challenges in the single-cell field.

Some existing methods are designed to process high-dimensional and sparse genome sequencing data, especially single cell RNA-seq (scRNA-seq) data (RNA-seq is a transcriptome sequencing technology, i.e. mRNA, smallRNA, non coding RNA, etc. or some of them are subjected to sequencing analysis by high-throughput sequencing technology to reflect their expression level) and single cell ATAC-seq (scATAC-seq) data (ATAC-seq is a transposase accessible chromatin sequencing technology, an effective method for labeling the open position of whole genome chromatin with a sequencing adapter using Tn5 transposase). Since genome sequencing data is highly dimensional and sparse, distances between cells become similar and the differences in distances tend to be small, making it difficult to identify cell types. Feature selection or dimensionality reduction can reduce noise and increase computation speed, and dimensionality reduction techniques such as Principal Component Analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) are often used to visualize data and to check the distribution of input data. There are many types of single cell clustering methods, the most popular clustering algorithm is k-means, which iteratively identifies k clustering centers, assigning each cell to the nearest center. However, the algorithm is greedy and does not guarantee that a global minimum is found. Another disadvantage is that it tends to define clusters of equal size, which may result in rare cell types being hidden within larger clusters. Another clustering algorithm widely used in scra-seq is hierarchical clustering, which combines individual cells sequentially into larger clusters or splits clusters into smaller groups, which has the disadvantage that the time and memory requirements grow at least quadratically with the number of data points, which means that the cost of using hierarchical clustering for large datasets will be very high.

Recent studies developed several methods specifically for scATAC-seq data analysis: chromavar, evaluating a set of peaks (peaks) sharing the same motif (motif) or functional annotation; scABC, weighting the cells according to the sequencing depth and applying weighted K-medoid clustering to reduce the impact of missing values, then calculating the label for each cluster and assigning the cells to the nearest label based on Spearman correlation. However, each method has significant problems: chromovar can only analyze a group of peaks, and the judgment of a single peak is lacked; scABC, in turn, relies heavily on tag samples with high sequencing depth, and for data with a large number of missing values, especially for scATAC-seq data, Spearman correlation coefficients may define errors.

In summary, most of the input data of the existing single cell clustering method is cell gene expression or chromatin openness, and interaction relations among the genes or open regions are ignored, and only vector form is used as model input. Secondly, because no interaction between genes or open regions is considered, the existing methods can only achieve single cell clustering and cannot predict cell-class specific gene regulatory networks at the same time.

Disclosure of Invention

The invention aims to provide a deep unsupervised single cell clustering method based on a Gaussian mixture graph variation self-encoder aiming at the technical defects in the prior art.

The invention provides a depth unsupervised single cell clustering method based on a Gaussian mixture graph variation self-encoder, which comprises the following steps:

s1, inputting protein-protein interaction relation PPIs data and single cell gene expression data X;

s2, initializing a trainable gene regulation and control network A by using PPIs;

s3, initializing cell clusters C of each cell by using a K-means method;

s4, enabling A, X to pass through a Graph Encoder to obtain a hidden layer;

hidden＝GCN(A,X)，

s5, obtaining a cell cluster C based on a Gaussian mixture model GMM, and then sampling from the Gaussian mixture model GMM to obtain a cell low-dimensional expression Z:

s5.1, updating parameters of a Gaussian mixture model GMM based on the assumption that the hidden layer obeys Gaussian mixture distribution:

μ＝GCN(hidden,X)

σ＝GCN(hidden,X)

wherein N (. mu.) is_k,σ_k) Represents the mean value of μ_kVariance is

Gaussian distribution of alpha_kRepresenting the probability that sample m obeys the gaussian distribution;

s5.2, obtaining cell cluster C according to a Gaussian mixture model GMM:

C＝argmax_k(α_k)

s5.3, sampling from a Gaussian mixture model GMM to obtain a low-dimensional representation Z of a cell: first, sampling from a standard gaussian distribution with a mean of 0 and a variance of 1, and then transforming by standard deviation and mean:

Z＝μ+N(0,1)*σ

s6, expressing Z based on the low dimension of the cell, and predicting a gene regulation network A by using a Decoder Graph Decoder;

A＝ReLU(Z^T*Z)

s7, calculating a loss function, and reversely propagating and updating A, GCN, wherein the loss function comprises cross entropy loss Cross Encopy and KL divergence, and the calculation formula is as follows:

loss＝CrossEntropy-KL

s8, repeating the steps S2-S5 until convergence;

and S9, outputting a gene regulation network A, expressing the cell low dimension Z, and clustering the cells C.

The method comprises the following steps of initializing a gene regulation network A:

numbering the genes numerically;

if the protein expressed by the ith gene and the protein expressed by the jth gene in the PPIs have an interaction relationship, the ith row and the jth column of the gene regulatory network A are initialized to 1, otherwise, the ith row and the jth column of the gene regulatory network A are initialized to 0.

The invention not only takes gene expression data as characteristics, but also adds the relationship among genes, and utilizes the graph convolution neural network GCN to learn the gene regulation and control network with single cell low-dimensional expression, cell clustering and cell category specificity. Each single cell is represented by an undirected graph, nodes represent genes, edges represent gene-gene interactions, cells are learned by an improved graph variation automatic encoder, the model considers that a hidden layer is compliant with a Gaussian mixture model GMM, cells are clustered according to the hidden layer, and cell-class-specific gene-gene interactions are learned by a trainable adjacency matrix.

By comparing the clustered regulatory networks, the similarities and differences of the gene regulatory networks in the structure under different background states can be found, and the practical significance of the gene regulatory networks in biology can be further analyzed.

s1, inputting HiChIP data of interaction of a regulatory element, unicell gene expression data or regulatory element openness data X;

s2, initializing a trainable gene regulation and control network A by using HiChIP;

s3, initializing cell clusters C of each cell by using a K-means method;

s4, enabling A, X to pass through a Graph Encoder to obtain a hidden layer;

hidden＝GCN(A,X)，

s5, obtaining cell cluster C based on a Gaussian mixture model GMM, and then sampling from the Gaussian mixture model GMM to obtain cell low-dimensional representation Z:

μ＝GCN(hidden,X)

σ＝GCN(hidden,X)

wherein N (. mu.) is_k,σ_k) Represents the mean value of μ_kVariance of

s5.2, obtaining a cell cluster C according to a Gaussian mixture model GMM:

C＝argmax_k(α_k)

s5.3, sampling from a Gaussian mixture model GMM to obtain a low-dimensional representation Z of the cell: first, sampling from a standard gaussian distribution with a mean of 0 and a variance of 1, and then transforming by standard deviation and mean:

Z＝μ+N(0,1)*σ

A＝ReLU(Z^T*Z)

loss＝CrossEntropy-KL

s8, repeating the steps S2-S5 until convergence;

and S9, outputting a gene regulation network A, wherein the cell low dimension represents Z, and the cell cluster C.

The steps for initializing the gene regulation network A are as follows:

numbering the genes numerically;

if there is an interaction between the ith and jth regulatory elements in the HiChIP, row i and column j of the gene regulatory network a are initialized to 1, otherwise to 0.

The invention is not only characterized by the opening degree of chromatin, but also adds the relationship among the regulating elements, and simultaneously learns the gene regulating network with single cell low-dimensional expression, cell clustering and cell type specificity by using the graph convolution neural network GCN. Each single cell is represented by an undirected graph, nodes represent regulatory elements, edges represent regulatory element-regulatory element interactions, the cell representation is learned by a modified graph variational autocoder, the model considers hidden layers (H) to obey a gaussian mixture model GMM, according to which the cells are clustered, trainable adjacency matrices are used to learn the cell class specific regulatory element-regulatory element interactions.

The cell data is analyzed in a hidden feature extraction mode, a Variational Graph Auto-Encoders (VGAE) and a Gaussian Mixture Model (GMM) are combined in an algorithm framework, hidden layer features of low dimensions of the data are extracted through a deep learning method to cluster the cell data and mine biological information, and the problem of high-dimension sparsity of the cell data is effectively solved. The low-dimensional hidden layer characteristics meeting the Gaussian mixture model are extracted through an Encoder (Graph Encoder) of a Graph variation self-Encoder, so that low coupling performance is achieved among different dimensions, modes of different cell types are represented respectively, cells of the same type share parameters of the Gaussian mixture model, the effect of mutually compensating for missing information is achieved by sharing information, and finally a final gene regulation and control network A is obtained through prediction of a Decoder (Graph Decoder).

Drawings

FIG. 1 is a schematic processing flow diagram of the deep unsupervised single-cell clustering method based on the Gaussian mixture diagram variation self-encoder of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

As shown in fig. 1, in a first aspect of the embodiments of the present invention, a method for deep unsupervised single-cell clustering based on a gaussian mixture graph-variant self-encoder (VGAE) is a method for obtaining a gene regulatory network a, a cell low-dimensional representation Z and a cell cluster C by using a graph-variant self-encoder based on a gaussian mixture distribution, and includes the following steps:

step 1: inputting protein-protein interaction relationships PPIs and single cell gene expression data X;

and 2, step: initializing a trainable gene regulation and control network A by using PPIs, and specifically comprising the following steps:

step 2.1: numbering the genes numerically;

step 2.2: if the protein expressed by the ith gene and the protein expressed by the jth gene in the PPIs have an interaction relationship, initializing the jth row and the jth column of the gene regulation network A to 1, otherwise initializing to 0;

and 3, step 3: initializing a cell cluster C of each cell by using a K-means method;

and 4, step 4: a, X through Graph Encoder to obtain hidden layer;

hidden＝GCN(A,X)

wherein, GCN represents the neural network of volume of picture;

and 5: obtaining cell clusters and cell low-dimensional representation, specifically comprising the following steps:

step 5.1: updating parameters of the gaussian mixture model GMM based on the assumption that the hidden layer (H) obeys a gaussian mixture distribution:

μ＝GCN(hidden,X)

σ＝GCN(hidden,X)

wherein N (. mu.) is_k,σ_k) Represents the mean value of μ_kVariance of

Gaussian distribution of alpha_kRepresents the probability that sample m obeys the gaussian distribution;

and step 5.2: obtaining cell cluster C according to Gaussian mixture model GMM:

C＝argmax_k(α_k)

step 5.3: sampling from the gaussian mixture model GMM yields a low dimensional representation Z of the cells: first, sampling from a standard gaussian distribution with a mean of 0 and a variance of 1, and then transforming by standard deviation and mean:

Z＝μ+N(0,1)*σ

step 6: predicting a gene regulation network A by using a Decoder Graph Decoder;

A＝ReLU(Z^T*Z)

and 7: compute the penalty function (including cross entropy penalty crossEncopy and KL divergence), propagate the update back A, GCN;

loss＝CrossEntropy-KL

and 8: repeating the steps 2-5 until convergence;

and step 9: and outputting a gene regulation network A, expressing the cell low dimension Z and clustering the cells C.

In the embodiment of the invention, a Graph variation Auto-Encoder (VGAE) is adopted, an n × n convolution kernel is used in the Encoder process to change the Encoder process into a Graph convolution process of a Graph convolution neural network (GCN), and the Decode process is changed into a Graph Decode process, and the specific implementation is that the change of a front adjacent matrix and a rear adjacent matrix is loss. Drawing (A)The input of the variational self-encoder VGAE is an adjacency matrix (such as a gene control network A) of an undirected graph and a characteristic matrix (single-cell gene expression data X) of nodes in the undirected graph, and Gaussian distribution (mu, sigma) of low-dimensional representation of node vectors can be obtained through a graph convolution neural network (GCN)²) Then, the adjacency matrix (A') of the graph is generated through a decoder to obtain the gene regulatory network.

As shown in fig. 1, a second aspect of the embodiments of the present invention provides a deep unsupervised single cell clustering method based on a gaussian mixture map-variant self-encoder, which is a method for obtaining a gene regulation network a, a cell low-dimensional representation X and a cell cluster C by using a gaussian mixture distribution-based map-variant self-encoder, and includes the following steps:

step 1: inputting HiChIP data of interaction of the regulatory elements, single-cell gene expression data or regulatory element openness data X.

Step 2: the method comprises the following steps of initializing a trainable gene control network A by using HiChIP, wherein the method comprises the following steps:

step 2.1: numbering the genes digitally;

step 2.2: if the ith regulatory element and the jth regulatory element in the HiChIP have an interaction relation, initializing the ith row and the jth column of the gene regulatory network A to 1, otherwise initializing to 0;

and 4, step 4: a, X through Graph Encoder to obtain hidden layer;

hidden＝GCN(A,X)

step 5.1: updating the parameters of the gaussian mixture model GCN based on the assumption that the hidden layer (H) obeys a gaussian mixture distribution:

μ＝GCN(hidden,X)

σ＝GCN(hidden,X)

wherein N (. mu.) is_k,σ_k) Represents the mean value of μ_kVariance of

Gaussian distribution of alpha_kRepresenting the probability that sample m obeys the gaussian distribution.

Step 5.2: obtaining a cell cluster C according to a Gaussian mixture model GMM:

C＝argmax_k(α_k)

step 5.3: sampling from the gaussian mixture model GMM yields a low dimensional representation Z of the cells: first, sampling is performed from a standard gaussian distribution with a mean of 0 and a variance of 1, and then transformation is performed by standard deviation and mean:

Z＝μ+N(0,1)*σ

A＝ReLU(Z^T*Z)

and 7: computing a penalty function (including cross entropy penalty CrossEncopy and KL divergence), back-propagating the update A, GCN;

loss＝CrossEntropy-KL

and step 8: repeating the steps 2-5 until convergence;

The invention learns the cell type specific gene regulatory network A by using a learnable adjacency matrix, and obtains a low-dimensional representation Z of cells and a cell cluster C by using Gaussian mixture distribution in a variable-density-analysis (VGAE) self-encoder. According to the invention, a Gaussian mixture distribution model is applied to a variational graph self-encoder (VGAE), and the cell clustering and the dimension reduction of cell representation are completed in the process of constructing a gene regulation network A, so that three purposes are achieved; in addition, compared with the traditional method for single cell clustering through cell characteristics, the method introduces the interaction relationship of gene-gene/regulatory element-regulatory element, and provides a basis for cell clustering from the perspective of a gene regulatory network.

While there have been shown and described what are at present considered to be the basic principles and essential features of the invention and advantages thereof, it will be apparent to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, but is capable of other embodiments without departing from the spirit or essential characteristics thereof;

the present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein, and any reference signs in the claims are not intended to be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. The deep unsupervised single cell clustering method based on the Gaussian mixture graph variation self-encoder is characterized by comprising the following steps of:

s3, initializing a cell cluster C of each cell by using a K-means method;

s4, enabling A, X to pass through a Graph Encoder to obtain a hidden layer;

hidden＝GCN(A，X)，

μ＝GCN(hidden，X)

σ＝GCN(hidden，X)

wherein N (. mu.) is_k，σ_k) Represents the mean value of μ_kVariance of

s5.2, obtaining cell cluster C according to a Gaussian mixture model GMM:

C＝argmax_k(α_k)

s5.3, sampling from a Gaussian mixture model GMM to obtain a low-dimensional representation Z of a cell: first, sampling is performed from a standard gaussian distribution with a mean of 0 and a variance of 1, and then transformation is performed by standard deviation and mean:

Z＝μ+N(0，1)*σ

A＝ReLU(Z^T*Z)

loss＝CrossEntropy-KL

s8, repeating the steps S2-S5 until convergence;

2. The deep unsupervised single-cell clustering method based on the Gaussian mixture map variation self-encoder as claimed in claim 1, wherein the step of initializing the gene regulation network A is as follows:

numbering the genes numerically;

if the protein expressed by the ith gene and the protein expressed by the jth gene in the PPIs have an interaction relationship, the jth column of the ith row of the gene regulatory network A is initialized to 1, otherwise, the jth row of the gene regulatory network A is initialized to 0.

3. The deep unsupervised single cell clustering method based on the Gaussian mixture graph variation self-encoder is characterized by comprising the following steps of:

s3, initializing a cell cluster C of each cell by using a K-means method;

s4, enabling A, X to pass through a Graph Encoder to obtain a hidden layer;

hidden＝GCN(A，X)，

μ＝GCN(hidden，X)

σ＝GCN(hidden，X)

wherein N (. mu.) is_k，σ_k) Represents the mean value of μ_kVariance is

s5.2, obtaining a cell cluster C according to a Gaussian mixture model GMM:

C＝argmax_k(α_k)

s5.3, sampling from a Gaussian mixture model GMM to obtain a low-dimensional representation Z of the cell: first, sampling is performed from a standard gaussian distribution with a mean of 0 and a variance of 1, and then transformation is performed by standard deviation and mean:

Z＝μ+N(0，1)*σ

A＝ReLU(Z^T*Z)

s7, calculating a loss function, and reversely propagating and updating A, GCN, wherein the loss function comprises cross entropy loss cross Entrophy and KL divergence, and the calculation formula is as follows:

loss＝CrossEntropy-KL

s8, repeating the steps S2-S5 until convergence;

4. The deep unsupervised single-cell clustering method based on the Gaussian mixture map variation self-encoder as claimed in claim 3, wherein the step of initializing the gene regulation network A is as follows:

numbering the genes digitally;

if the ith regulatory element and the jth regulatory element in the HiChIP have an interaction relationship, the ith row and the jth column of the gene regulatory network A are initialized to 1, otherwise, the ith row and the jth column of the gene regulatory network A are initialized to 0.