CN114783526A - Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder - Google Patents
Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder Download PDFInfo
- Publication number
- CN114783526A CN114783526A CN202210506799.6A CN202210506799A CN114783526A CN 114783526 A CN114783526 A CN 114783526A CN 202210506799 A CN202210506799 A CN 202210506799A CN 114783526 A CN114783526 A CN 114783526A
- Authority
- CN
- China
- Prior art keywords
- cell
- gaussian mixture
- gene
- encoder
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/30—Unsupervised data analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2155—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Biology (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computational Linguistics (AREA)
- Genetics & Genomics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biotechnology (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Probability & Statistics with Applications (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a deep unsupervised single cell clustering method based on a Gaussian mixture graph variation self-encoder, which initializes a gene regulation network A by using protein-protein interaction relation PPIs (or regulation element interaction HiChIP); initializing a cell cluster C of each cell by using a K-means method; obtaining a hidden layer by a gene regulation network A and single cell gene expression data X (or regulation element openness data X) through a graph encoder; obtaining a cell cluster C, sampling from a Gaussian mixture model GMM to obtain a cell low-dimensional expression Z: predicting a gene regulation network a using a decoder; calculating a loss function, performing back propagation updating A, GCN, and repeating the steps until convergence; and outputting a gene regulation network A, expressing the cell low dimension Z and clustering the cells C. The invention completes the cell clustering and the dimension reduction of cell representation in the process of constructing the gene regulation network A.
Description
Technical Field
The invention relates to the technical field of single cell clustering, in particular to a depth unsupervised single cell clustering method based on a Gaussian mixture graph variation self-encoder.
Background
The single cell sequencing technology is a technology for performing high-throughput sequencing analysis on genomes, transcriptomes and epigenomes on the level of single cells, can reveal the gene structure and gene expression state of the single cells, reflects the heterogeneity among the cells, plays an important role in the fields of tumors, developmental biology, microbiology, neuroscience and the like, and is the focus of life science research. One of the first tasks to be solved for single cell sequencing is to identify the cell types contained in the sample, i.e. to perform cell clustering based on unsupervised algorithms. Single cell clustering can establish a comprehensive reference for all cell types in different developmental stages in an organism or tissue, and can also serve as a reference basis for disease research in addition to providing a deeper understanding of basic biology. Since many downstream analyses are based on cell clustering, single cell clustering results may have a significant impact on downstream applications. Therefore, how to obtain reliable single-cell clustering is one of the key challenges in the single-cell field.
Some existing methods are designed to process high-dimensional and sparse genome sequencing data, especially single cell RNA-seq (scRNA-seq) data (RNA-seq is a transcriptome sequencing technology, i.e. mRNA, smallRNA, non coding RNA, etc. or some of them are subjected to sequencing analysis by high-throughput sequencing technology to reflect their expression level) and single cell ATAC-seq (scATAC-seq) data (ATAC-seq is a transposase accessible chromatin sequencing technology, an effective method for labeling the open position of whole genome chromatin with a sequencing adapter using Tn5 transposase). Since genome sequencing data is highly dimensional and sparse, distances between cells become similar and the differences in distances tend to be small, making it difficult to identify cell types. Feature selection or dimensionality reduction can reduce noise and increase computation speed, and dimensionality reduction techniques such as Principal Component Analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) are often used to visualize data and to check the distribution of input data. There are many types of single cell clustering methods, the most popular clustering algorithm is k-means, which iteratively identifies k clustering centers, assigning each cell to the nearest center. However, the algorithm is greedy and does not guarantee that a global minimum is found. Another disadvantage is that it tends to define clusters of equal size, which may result in rare cell types being hidden within larger clusters. Another clustering algorithm widely used in scra-seq is hierarchical clustering, which combines individual cells sequentially into larger clusters or splits clusters into smaller groups, which has the disadvantage that the time and memory requirements grow at least quadratically with the number of data points, which means that the cost of using hierarchical clustering for large datasets will be very high.
Recent studies developed several methods specifically for scATAC-seq data analysis: chromavar, evaluating a set of peaks (peaks) sharing the same motif (motif) or functional annotation; scABC, weighting the cells according to the sequencing depth and applying weighted K-medoid clustering to reduce the impact of missing values, then calculating the label for each cluster and assigning the cells to the nearest label based on Spearman correlation. However, each method has significant problems: chromovar can only analyze a group of peaks, and the judgment of a single peak is lacked; scABC, in turn, relies heavily on tag samples with high sequencing depth, and for data with a large number of missing values, especially for scATAC-seq data, Spearman correlation coefficients may define errors.
In summary, most of the input data of the existing single cell clustering method is cell gene expression or chromatin openness, and interaction relations among the genes or open regions are ignored, and only vector form is used as model input. Secondly, because no interaction between genes or open regions is considered, the existing methods can only achieve single cell clustering and cannot predict cell-class specific gene regulatory networks at the same time.
Disclosure of Invention
The invention aims to provide a deep unsupervised single cell clustering method based on a Gaussian mixture graph variation self-encoder aiming at the technical defects in the prior art.
The invention provides a depth unsupervised single cell clustering method based on a Gaussian mixture graph variation self-encoder, which comprises the following steps:
s1, inputting protein-protein interaction relation PPIs data and single cell gene expression data X;
s2, initializing a trainable gene regulation and control network A by using PPIs;
s3, initializing cell clusters C of each cell by using a K-means method;
s4, enabling A, X to pass through a Graph Encoder to obtain a hidden layer;
hidden=GCN(A,X),
s5, obtaining a cell cluster C based on a Gaussian mixture model GMM, and then sampling from the Gaussian mixture model GMM to obtain a cell low-dimensional expression Z:
s5.1, updating parameters of a Gaussian mixture model GMM based on the assumption that the hidden layer obeys Gaussian mixture distribution:
μ=GCN(hidden,X)
σ=GCN(hidden,X)
wherein N (. mu.) isk,σk) Represents the mean value of μkVariance isGaussian distribution of alphakRepresenting the probability that sample m obeys the gaussian distribution;
s5.2, obtaining cell cluster C according to a Gaussian mixture model GMM:
C=argmaxk(αk)
s5.3, sampling from a Gaussian mixture model GMM to obtain a low-dimensional representation Z of a cell: first, sampling from a standard gaussian distribution with a mean of 0 and a variance of 1, and then transforming by standard deviation and mean:
Z=μ+N(0,1)*σ
s6, expressing Z based on the low dimension of the cell, and predicting a gene regulation network A by using a Decoder Graph Decoder;
A=ReLU(ZT*Z)
s7, calculating a loss function, and reversely propagating and updating A, GCN, wherein the loss function comprises cross entropy loss Cross Encopy and KL divergence, and the calculation formula is as follows:
loss=CrossEntropy-KL
s8, repeating the steps S2-S5 until convergence;
and S9, outputting a gene regulation network A, expressing the cell low dimension Z, and clustering the cells C.
The method comprises the following steps of initializing a gene regulation network A:
numbering the genes numerically;
if the protein expressed by the ith gene and the protein expressed by the jth gene in the PPIs have an interaction relationship, the ith row and the jth column of the gene regulatory network A are initialized to 1, otherwise, the ith row and the jth column of the gene regulatory network A are initialized to 0.
The invention not only takes gene expression data as characteristics, but also adds the relationship among genes, and utilizes the graph convolution neural network GCN to learn the gene regulation and control network with single cell low-dimensional expression, cell clustering and cell category specificity. Each single cell is represented by an undirected graph, nodes represent genes, edges represent gene-gene interactions, cells are learned by an improved graph variation automatic encoder, the model considers that a hidden layer is compliant with a Gaussian mixture model GMM, cells are clustered according to the hidden layer, and cell-class-specific gene-gene interactions are learned by a trainable adjacency matrix.
By comparing the clustered regulatory networks, the similarities and differences of the gene regulatory networks in the structure under different background states can be found, and the practical significance of the gene regulatory networks in biology can be further analyzed.
The invention provides a depth unsupervised single cell clustering method based on a Gaussian mixture graph variation self-encoder, which comprises the following steps:
s1, inputting HiChIP data of interaction of a regulatory element, unicell gene expression data or regulatory element openness data X;
s2, initializing a trainable gene regulation and control network A by using HiChIP;
s3, initializing cell clusters C of each cell by using a K-means method;
s4, enabling A, X to pass through a Graph Encoder to obtain a hidden layer;
hidden=GCN(A,X),
s5, obtaining cell cluster C based on a Gaussian mixture model GMM, and then sampling from the Gaussian mixture model GMM to obtain cell low-dimensional representation Z:
s5.1, updating parameters of a Gaussian mixture model GMM based on the assumption that the hidden layer obeys Gaussian mixture distribution:
μ=GCN(hidden,X)
σ=GCN(hidden,X)
wherein N (. mu.) isk,σk) Represents the mean value of μkVariance ofGaussian distribution of alphakRepresenting the probability that sample m obeys the gaussian distribution;
s5.2, obtaining a cell cluster C according to a Gaussian mixture model GMM:
C=argmaxk(αk)
s5.3, sampling from a Gaussian mixture model GMM to obtain a low-dimensional representation Z of the cell: first, sampling from a standard gaussian distribution with a mean of 0 and a variance of 1, and then transforming by standard deviation and mean:
Z=μ+N(0,1)*σ
s6, expressing Z based on the low dimension of the cell, and predicting a gene regulation network A by using a Decoder Graph Decoder;
A=ReLU(ZT*Z)
s7, calculating a loss function, and reversely propagating and updating A, GCN, wherein the loss function comprises cross entropy loss Cross Encopy and KL divergence, and the calculation formula is as follows:
loss=CrossEntropy-KL
s8, repeating the steps S2-S5 until convergence;
and S9, outputting a gene regulation network A, wherein the cell low dimension represents Z, and the cell cluster C.
The steps for initializing the gene regulation network A are as follows:
numbering the genes numerically;
if there is an interaction between the ith and jth regulatory elements in the HiChIP, row i and column j of the gene regulatory network a are initialized to 1, otherwise to 0.
The invention is not only characterized by the opening degree of chromatin, but also adds the relationship among the regulating elements, and simultaneously learns the gene regulating network with single cell low-dimensional expression, cell clustering and cell type specificity by using the graph convolution neural network GCN. Each single cell is represented by an undirected graph, nodes represent regulatory elements, edges represent regulatory element-regulatory element interactions, the cell representation is learned by a modified graph variational autocoder, the model considers hidden layers (H) to obey a gaussian mixture model GMM, according to which the cells are clustered, trainable adjacency matrices are used to learn the cell class specific regulatory element-regulatory element interactions.
By comparing the clustered regulatory networks, the similarities and differences of the gene regulatory networks in the structure under different background states can be found, and the practical significance of the gene regulatory networks in biology can be further analyzed.
The cell data is analyzed in a hidden feature extraction mode, a Variational Graph Auto-Encoders (VGAE) and a Gaussian Mixture Model (GMM) are combined in an algorithm framework, hidden layer features of low dimensions of the data are extracted through a deep learning method to cluster the cell data and mine biological information, and the problem of high-dimension sparsity of the cell data is effectively solved. The low-dimensional hidden layer characteristics meeting the Gaussian mixture model are extracted through an Encoder (Graph Encoder) of a Graph variation self-Encoder, so that low coupling performance is achieved among different dimensions, modes of different cell types are represented respectively, cells of the same type share parameters of the Gaussian mixture model, the effect of mutually compensating for missing information is achieved by sharing information, and finally a final gene regulation and control network A is obtained through prediction of a Decoder (Graph Decoder).
Drawings
FIG. 1 is a schematic processing flow diagram of the deep unsupervised single-cell clustering method based on the Gaussian mixture diagram variation self-encoder of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.
As shown in fig. 1, in a first aspect of the embodiments of the present invention, a method for deep unsupervised single-cell clustering based on a gaussian mixture graph-variant self-encoder (VGAE) is a method for obtaining a gene regulatory network a, a cell low-dimensional representation Z and a cell cluster C by using a graph-variant self-encoder based on a gaussian mixture distribution, and includes the following steps:
step 1: inputting protein-protein interaction relationships PPIs and single cell gene expression data X;
and 2, step: initializing a trainable gene regulation and control network A by using PPIs, and specifically comprising the following steps:
step 2.1: numbering the genes numerically;
step 2.2: if the protein expressed by the ith gene and the protein expressed by the jth gene in the PPIs have an interaction relationship, initializing the jth row and the jth column of the gene regulation network A to 1, otherwise initializing to 0;
and 3, step 3: initializing a cell cluster C of each cell by using a K-means method;
and 4, step 4: a, X through Graph Encoder to obtain hidden layer;
hidden=GCN(A,X)
wherein, GCN represents the neural network of volume of picture;
and 5: obtaining cell clusters and cell low-dimensional representation, specifically comprising the following steps:
step 5.1: updating parameters of the gaussian mixture model GMM based on the assumption that the hidden layer (H) obeys a gaussian mixture distribution:
μ=GCN(hidden,X)
σ=GCN(hidden,X)
wherein N (. mu.) isk,σk) Represents the mean value of μkVariance ofGaussian distribution of alphakRepresents the probability that sample m obeys the gaussian distribution;
and step 5.2: obtaining cell cluster C according to Gaussian mixture model GMM:
C=argmaxk(αk)
step 5.3: sampling from the gaussian mixture model GMM yields a low dimensional representation Z of the cells: first, sampling from a standard gaussian distribution with a mean of 0 and a variance of 1, and then transforming by standard deviation and mean:
Z=μ+N(0,1)*σ
step 6: predicting a gene regulation network A by using a Decoder Graph Decoder;
A=ReLU(ZT*Z)
and 7: compute the penalty function (including cross entropy penalty crossEncopy and KL divergence), propagate the update back A, GCN;
loss=CrossEntropy-KL
and 8: repeating the steps 2-5 until convergence;
and step 9: and outputting a gene regulation network A, expressing the cell low dimension Z and clustering the cells C.
In the embodiment of the invention, a Graph variation Auto-Encoder (VGAE) is adopted, an n × n convolution kernel is used in the Encoder process to change the Encoder process into a Graph convolution process of a Graph convolution neural network (GCN), and the Decode process is changed into a Graph Decode process, and the specific implementation is that the change of a front adjacent matrix and a rear adjacent matrix is loss. Drawing (A)The input of the variational self-encoder VGAE is an adjacency matrix (such as a gene control network A) of an undirected graph and a characteristic matrix (single-cell gene expression data X) of nodes in the undirected graph, and Gaussian distribution (mu, sigma) of low-dimensional representation of node vectors can be obtained through a graph convolution neural network (GCN)2) Then, the adjacency matrix (A') of the graph is generated through a decoder to obtain the gene regulatory network.
As shown in fig. 1, a second aspect of the embodiments of the present invention provides a deep unsupervised single cell clustering method based on a gaussian mixture map-variant self-encoder, which is a method for obtaining a gene regulation network a, a cell low-dimensional representation X and a cell cluster C by using a gaussian mixture distribution-based map-variant self-encoder, and includes the following steps:
step 1: inputting HiChIP data of interaction of the regulatory elements, single-cell gene expression data or regulatory element openness data X.
Step 2: the method comprises the following steps of initializing a trainable gene control network A by using HiChIP, wherein the method comprises the following steps:
step 2.1: numbering the genes digitally;
step 2.2: if the ith regulatory element and the jth regulatory element in the HiChIP have an interaction relation, initializing the ith row and the jth column of the gene regulatory network A to 1, otherwise initializing to 0;
and 3, step 3: initializing a cell cluster C of each cell by using a K-means method;
and 4, step 4: a, X through Graph Encoder to obtain hidden layer;
hidden=GCN(A,X)
and 5: obtaining cell clusters and cell low-dimensional representation, specifically comprising the following steps:
step 5.1: updating the parameters of the gaussian mixture model GCN based on the assumption that the hidden layer (H) obeys a gaussian mixture distribution:
μ=GCN(hidden,X)
σ=GCN(hidden,X)
wherein N (. mu.) isk,σk) Represents the mean value of μkVariance ofGaussian distribution of alphakRepresenting the probability that sample m obeys the gaussian distribution.
Step 5.2: obtaining a cell cluster C according to a Gaussian mixture model GMM:
C=argmaxk(αk)
step 5.3: sampling from the gaussian mixture model GMM yields a low dimensional representation Z of the cells: first, sampling is performed from a standard gaussian distribution with a mean of 0 and a variance of 1, and then transformation is performed by standard deviation and mean:
Z=μ+N(0,1)*σ
step 6: predicting a gene regulation network A by using a Decoder Graph Decoder;
A=ReLU(ZT*Z)
and 7: computing a penalty function (including cross entropy penalty CrossEncopy and KL divergence), back-propagating the update A, GCN;
loss=CrossEntropy-KL
and step 8: repeating the steps 2-5 until convergence;
and step 9: and outputting a gene regulation network A, expressing the cell low dimension Z and clustering the cells C.
The invention learns the cell type specific gene regulatory network A by using a learnable adjacency matrix, and obtains a low-dimensional representation Z of cells and a cell cluster C by using Gaussian mixture distribution in a variable-density-analysis (VGAE) self-encoder. According to the invention, a Gaussian mixture distribution model is applied to a variational graph self-encoder (VGAE), and the cell clustering and the dimension reduction of cell representation are completed in the process of constructing a gene regulation network A, so that three purposes are achieved; in addition, compared with the traditional method for single cell clustering through cell characteristics, the method introduces the interaction relationship of gene-gene/regulatory element-regulatory element, and provides a basis for cell clustering from the perspective of a gene regulatory network.
While there have been shown and described what are at present considered to be the basic principles and essential features of the invention and advantages thereof, it will be apparent to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, but is capable of other embodiments without departing from the spirit or essential characteristics thereof;
the present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein, and any reference signs in the claims are not intended to be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.
Claims (4)
1. The deep unsupervised single cell clustering method based on the Gaussian mixture graph variation self-encoder is characterized by comprising the following steps of:
s1, inputting protein-protein interaction relation PPIs data and single cell gene expression data X;
s2, initializing a trainable gene regulation and control network A by using PPIs;
s3, initializing a cell cluster C of each cell by using a K-means method;
s4, enabling A, X to pass through a Graph Encoder to obtain a hidden layer;
hidden=GCN(A,X),
s5, obtaining a cell cluster C based on a Gaussian mixture model GMM, and then sampling from the Gaussian mixture model GMM to obtain a cell low-dimensional expression Z:
s5.1, updating parameters of a Gaussian mixture model GMM based on the assumption that the hidden layer obeys Gaussian mixture distribution:
μ=GCN(hidden,X)
σ=GCN(hidden,X)
wherein N (. mu.) isk,σk) Represents the mean value of μkVariance ofGaussian distribution of alphakRepresents the probability that sample m obeys the gaussian distribution;
s5.2, obtaining cell cluster C according to a Gaussian mixture model GMM:
C=argmaxk(αk)
s5.3, sampling from a Gaussian mixture model GMM to obtain a low-dimensional representation Z of a cell: first, sampling is performed from a standard gaussian distribution with a mean of 0 and a variance of 1, and then transformation is performed by standard deviation and mean:
Z=μ+N(0,1)*σ
s6, expressing Z based on the low dimension of the cell, and predicting a gene regulation network A by using a Decoder Graph Decoder;
A=ReLU(ZT*Z)
s7, calculating a loss function, and reversely propagating and updating A, GCN, wherein the loss function comprises cross entropy loss Cross Encopy and KL divergence, and the calculation formula is as follows:
loss=CrossEntropy-KL
s8, repeating the steps S2-S5 until convergence;
and S9, outputting a gene regulation network A, expressing the cell low dimension Z, and clustering the cells C.
2. The deep unsupervised single-cell clustering method based on the Gaussian mixture map variation self-encoder as claimed in claim 1, wherein the step of initializing the gene regulation network A is as follows:
numbering the genes numerically;
if the protein expressed by the ith gene and the protein expressed by the jth gene in the PPIs have an interaction relationship, the jth column of the ith row of the gene regulatory network A is initialized to 1, otherwise, the jth row of the gene regulatory network A is initialized to 0.
3. The deep unsupervised single cell clustering method based on the Gaussian mixture graph variation self-encoder is characterized by comprising the following steps of:
s1, inputting HiChIP data of interaction of a regulatory element, unicell gene expression data or regulatory element openness data X;
s2, initializing a trainable gene regulation and control network A by using HiChIP;
s3, initializing a cell cluster C of each cell by using a K-means method;
s4, enabling A, X to pass through a Graph Encoder to obtain a hidden layer;
hidden=GCN(A,X),
s5, obtaining cell cluster C based on a Gaussian mixture model GMM, and then sampling from the Gaussian mixture model GMM to obtain cell low-dimensional representation Z:
s5.1, updating parameters of a Gaussian mixture model GMM based on the assumption that the hidden layer obeys Gaussian mixture distribution:
μ=GCN(hidden,X)
σ=GCN(hidden,X)
wherein N (. mu.) isk,σk) Represents the mean value of μkVariance isGaussian distribution of alphakRepresents the probability that sample m obeys the gaussian distribution;
s5.2, obtaining a cell cluster C according to a Gaussian mixture model GMM:
C=argmaxk(αk)
s5.3, sampling from a Gaussian mixture model GMM to obtain a low-dimensional representation Z of the cell: first, sampling is performed from a standard gaussian distribution with a mean of 0 and a variance of 1, and then transformation is performed by standard deviation and mean:
Z=μ+N(0,1)*σ
s6, expressing Z based on the low dimension of the cell, and predicting a gene regulation network A by using a Decoder Graph Decoder;
A=ReLU(ZT*Z)
s7, calculating a loss function, and reversely propagating and updating A, GCN, wherein the loss function comprises cross entropy loss cross Entrophy and KL divergence, and the calculation formula is as follows:
loss=CrossEntropy-KL
s8, repeating the steps S2-S5 until convergence;
and S9, outputting a gene regulation network A, expressing the cell low dimension Z, and clustering the cells C.
4. The deep unsupervised single-cell clustering method based on the Gaussian mixture map variation self-encoder as claimed in claim 3, wherein the step of initializing the gene regulation network A is as follows:
numbering the genes digitally;
if the ith regulatory element and the jth regulatory element in the HiChIP have an interaction relationship, the ith row and the jth column of the gene regulatory network A are initialized to 1, otherwise, the ith row and the jth column of the gene regulatory network A are initialized to 0.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210506799.6A CN114783526A (en) | 2022-05-11 | 2022-05-11 | Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210506799.6A CN114783526A (en) | 2022-05-11 | 2022-05-11 | Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114783526A true CN114783526A (en) | 2022-07-22 |
Family
ID=82436574
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210506799.6A Pending CN114783526A (en) | 2022-05-11 | 2022-05-11 | Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114783526A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115240772A (en) * | 2022-08-22 | 2022-10-25 | 南京医科大学 | Method for analyzing active pathway in unicellular multiomics based on graph neural network |
CN117854592A (en) * | 2024-03-04 | 2024-04-09 | 中国人民解放军国防科技大学 | Gene regulation network construction method, device, equipment and storage medium |
CN117854592B (en) * | 2024-03-04 | 2024-06-04 | 中国人民解放军国防科技大学 | Gene regulation network construction method, device, equipment and storage medium |
-
2022
- 2022-05-11 CN CN202210506799.6A patent/CN114783526A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115240772A (en) * | 2022-08-22 | 2022-10-25 | 南京医科大学 | Method for analyzing active pathway in unicellular multiomics based on graph neural network |
CN115240772B (en) * | 2022-08-22 | 2023-08-22 | 南京医科大学 | Method for analyzing single cell pathway activity based on graph neural network |
CN117854592A (en) * | 2024-03-04 | 2024-04-09 | 中国人民解放军国防科技大学 | Gene regulation network construction method, device, equipment and storage medium |
CN117854592B (en) * | 2024-03-04 | 2024-06-04 | 中国人民解放军国防科技大学 | Gene regulation network construction method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hu et al. | Mining coherent dense subgraphs across massive biological networks for functional discovery | |
Maulik et al. | Simulated annealing based automatic fuzzy clustering combined with ANN classification for analyzing microarray data | |
CN111564183B (en) | Single cell sequencing data dimension reduction method fusing gene ontology and neural network | |
Yan et al. | Unsupervised and semi‐supervised learning: The next frontier in machine learning for plant systems biology | |
Zheng et al. | Emerging deep learning methods for single-cell RNA-seq data analysis | |
CN114927162A (en) | Multi-set correlation phenotype prediction method based on hypergraph representation and Dirichlet distribution | |
Cheng et al. | DGCyTOF: Deep learning with graphic cluster visualization to predict cell types of single cell mass cytometry data | |
Zeng et al. | A novel HMM-based clustering algorithm for the analysis of gene expression time-course data | |
Erfanian et al. | Deep learning applications in single-cell omics data analysis | |
Du et al. | Model-based trajectory inference for single-cell rna sequencing using deep learning with a mixture prior | |
CN114783526A (en) | Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder | |
Perera et al. | Generative moment matching networks for genotype simulation | |
Ji et al. | scAnnotate: an automated cell-type annotation tool for single-cell RNA-sequencing data | |
Banu et al. | Performance analysis of hard and soft clustering approaches for gene expression data | |
Chatzilygeroudis et al. | Feature Selection in single-cell RNA-seq data via a Genetic Algorithm | |
CN115661498A (en) | Self-optimization single cell clustering method | |
Zhen et al. | A novel framework for single-cell hi-c clustering based on graph-convolution-based imputation and two-phase-based feature extraction | |
CN112071362A (en) | Detection method of protein complex fusing global and local topological structures | |
Min et al. | Structured sparse non-negative matrix factorization with L20-norm for scRNA-seq data analysis | |
Liu et al. | An overview of biological data generation using generative adversarial networks | |
Wang et al. | An interpretation of convolutional neural networks for motif finding from the view of probability | |
Abou El-Naga et al. | Consensus Nature Inspired Clustering of Single-Cell RNA-Sequencing Data | |
CN114512188B (en) | DNA binding protein recognition method based on improved protein sequence position specificity matrix | |
Padma et al. | A modified algorithm for clustering based on particle swarm optimization and K-means | |
Wong et al. | Computational Systems Bioinformatics-Methods And Biomedical Applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |