CN114783526A - Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder - Google Patents

Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder Download PDF

Info

Publication number
CN114783526A
CN114783526A CN202210506799.6A CN202210506799A CN114783526A CN 114783526 A CN114783526 A CN 114783526A CN 202210506799 A CN202210506799 A CN 202210506799A CN 114783526 A CN114783526 A CN 114783526A
Authority
CN
China
Prior art keywords
cell
gaussian mixture
gene
encoder
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210506799.6A
Other languages
Chinese (zh)
Inventor
曾婉雯
张爽
范蕊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nankai University
Original Assignee
Nankai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nankai University filed Critical Nankai University
Priority to CN202210506799.6A priority Critical patent/CN114783526A/en
Publication of CN114783526A publication Critical patent/CN114783526A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computational Linguistics (AREA)
  • Genetics & Genomics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biotechnology (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Probability & Statistics with Applications (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a deep unsupervised single cell clustering method based on a Gaussian mixture graph variation self-encoder, which initializes a gene regulation network A by using protein-protein interaction relation PPIs (or regulation element interaction HiChIP); initializing a cell cluster C of each cell by using a K-means method; obtaining a hidden layer by a gene regulation network A and single cell gene expression data X (or regulation element openness data X) through a graph encoder; obtaining a cell cluster C, sampling from a Gaussian mixture model GMM to obtain a cell low-dimensional expression Z: predicting a gene regulation network a using a decoder; calculating a loss function, performing back propagation updating A, GCN, and repeating the steps until convergence; and outputting a gene regulation network A, expressing the cell low dimension Z and clustering the cells C. The invention completes the cell clustering and the dimension reduction of cell representation in the process of constructing the gene regulation network A.

Description

Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder
Technical Field
The invention relates to the technical field of single cell clustering, in particular to a depth unsupervised single cell clustering method based on a Gaussian mixture graph variation self-encoder.
Background
The single cell sequencing technology is a technology for performing high-throughput sequencing analysis on genomes, transcriptomes and epigenomes on the level of single cells, can reveal the gene structure and gene expression state of the single cells, reflects the heterogeneity among the cells, plays an important role in the fields of tumors, developmental biology, microbiology, neuroscience and the like, and is the focus of life science research. One of the first tasks to be solved for single cell sequencing is to identify the cell types contained in the sample, i.e. to perform cell clustering based on unsupervised algorithms. Single cell clustering can establish a comprehensive reference for all cell types in different developmental stages in an organism or tissue, and can also serve as a reference basis for disease research in addition to providing a deeper understanding of basic biology. Since many downstream analyses are based on cell clustering, single cell clustering results may have a significant impact on downstream applications. Therefore, how to obtain reliable single-cell clustering is one of the key challenges in the single-cell field.
Some existing methods are designed to process high-dimensional and sparse genome sequencing data, especially single cell RNA-seq (scRNA-seq) data (RNA-seq is a transcriptome sequencing technology, i.e. mRNA, smallRNA, non coding RNA, etc. or some of them are subjected to sequencing analysis by high-throughput sequencing technology to reflect their expression level) and single cell ATAC-seq (scATAC-seq) data (ATAC-seq is a transposase accessible chromatin sequencing technology, an effective method for labeling the open position of whole genome chromatin with a sequencing adapter using Tn5 transposase). Since genome sequencing data is highly dimensional and sparse, distances between cells become similar and the differences in distances tend to be small, making it difficult to identify cell types. Feature selection or dimensionality reduction can reduce noise and increase computation speed, and dimensionality reduction techniques such as Principal Component Analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) are often used to visualize data and to check the distribution of input data. There are many types of single cell clustering methods, the most popular clustering algorithm is k-means, which iteratively identifies k clustering centers, assigning each cell to the nearest center. However, the algorithm is greedy and does not guarantee that a global minimum is found. Another disadvantage is that it tends to define clusters of equal size, which may result in rare cell types being hidden within larger clusters. Another clustering algorithm widely used in scra-seq is hierarchical clustering, which combines individual cells sequentially into larger clusters or splits clusters into smaller groups, which has the disadvantage that the time and memory requirements grow at least quadratically with the number of data points, which means that the cost of using hierarchical clustering for large datasets will be very high.
Recent studies developed several methods specifically for scATAC-seq data analysis: chromavar, evaluating a set of peaks (peaks) sharing the same motif (motif) or functional annotation; scABC, weighting the cells according to the sequencing depth and applying weighted K-medoid clustering to reduce the impact of missing values, then calculating the label for each cluster and assigning the cells to the nearest label based on Spearman correlation. However, each method has significant problems: chromovar can only analyze a group of peaks, and the judgment of a single peak is lacked; scABC, in turn, relies heavily on tag samples with high sequencing depth, and for data with a large number of missing values, especially for scATAC-seq data, Spearman correlation coefficients may define errors.
In summary, most of the input data of the existing single cell clustering method is cell gene expression or chromatin openness, and interaction relations among the genes or open regions are ignored, and only vector form is used as model input. Secondly, because no interaction between genes or open regions is considered, the existing methods can only achieve single cell clustering and cannot predict cell-class specific gene regulatory networks at the same time.
Disclosure of Invention
The invention aims to provide a deep unsupervised single cell clustering method based on a Gaussian mixture graph variation self-encoder aiming at the technical defects in the prior art.
The invention provides a depth unsupervised single cell clustering method based on a Gaussian mixture graph variation self-encoder, which comprises the following steps:
s1, inputting protein-protein interaction relation PPIs data and single cell gene expression data X;
s2, initializing a trainable gene regulation and control network A by using PPIs;
s3, initializing cell clusters C of each cell by using a K-means method;
s4, enabling A, X to pass through a Graph Encoder to obtain a hidden layer;
hidden=GCN(A,X),
s5, obtaining a cell cluster C based on a Gaussian mixture model GMM, and then sampling from the Gaussian mixture model GMM to obtain a cell low-dimensional expression Z:
s5.1, updating parameters of a Gaussian mixture model GMM based on the assumption that the hidden layer obeys Gaussian mixture distribution:
μ=GCN(hidden,X)
σ=GCN(hidden,X)
Figure BDA0003637680380000031
wherein N (. mu.) iskk) Represents the mean value of μkVariance is
Figure BDA0003637680380000032
Gaussian distribution of alphakRepresenting the probability that sample m obeys the gaussian distribution;
s5.2, obtaining cell cluster C according to a Gaussian mixture model GMM:
C=argmaxkk)
s5.3, sampling from a Gaussian mixture model GMM to obtain a low-dimensional representation Z of a cell: first, sampling from a standard gaussian distribution with a mean of 0 and a variance of 1, and then transforming by standard deviation and mean:
Z=μ+N(0,1)*σ
s6, expressing Z based on the low dimension of the cell, and predicting a gene regulation network A by using a Decoder Graph Decoder;
A=ReLU(ZT*Z)
s7, calculating a loss function, and reversely propagating and updating A, GCN, wherein the loss function comprises cross entropy loss Cross Encopy and KL divergence, and the calculation formula is as follows:
loss=CrossEntropy-KL
s8, repeating the steps S2-S5 until convergence;
and S9, outputting a gene regulation network A, expressing the cell low dimension Z, and clustering the cells C.
The method comprises the following steps of initializing a gene regulation network A:
numbering the genes numerically;
if the protein expressed by the ith gene and the protein expressed by the jth gene in the PPIs have an interaction relationship, the ith row and the jth column of the gene regulatory network A are initialized to 1, otherwise, the ith row and the jth column of the gene regulatory network A are initialized to 0.
The invention not only takes gene expression data as characteristics, but also adds the relationship among genes, and utilizes the graph convolution neural network GCN to learn the gene regulation and control network with single cell low-dimensional expression, cell clustering and cell category specificity. Each single cell is represented by an undirected graph, nodes represent genes, edges represent gene-gene interactions, cells are learned by an improved graph variation automatic encoder, the model considers that a hidden layer is compliant with a Gaussian mixture model GMM, cells are clustered according to the hidden layer, and cell-class-specific gene-gene interactions are learned by a trainable adjacency matrix.
By comparing the clustered regulatory networks, the similarities and differences of the gene regulatory networks in the structure under different background states can be found, and the practical significance of the gene regulatory networks in biology can be further analyzed.
The invention provides a depth unsupervised single cell clustering method based on a Gaussian mixture graph variation self-encoder, which comprises the following steps:
s1, inputting HiChIP data of interaction of a regulatory element, unicell gene expression data or regulatory element openness data X;
s2, initializing a trainable gene regulation and control network A by using HiChIP;
s3, initializing cell clusters C of each cell by using a K-means method;
s4, enabling A, X to pass through a Graph Encoder to obtain a hidden layer;
hidden=GCN(A,X),
s5, obtaining cell cluster C based on a Gaussian mixture model GMM, and then sampling from the Gaussian mixture model GMM to obtain cell low-dimensional representation Z:
s5.1, updating parameters of a Gaussian mixture model GMM based on the assumption that the hidden layer obeys Gaussian mixture distribution:
μ=GCN(hidden,X)
σ=GCN(hidden,X)
Figure BDA0003637680380000051
wherein N (. mu.) iskk) Represents the mean value of μkVariance of
Figure BDA0003637680380000052
Gaussian distribution of alphakRepresenting the probability that sample m obeys the gaussian distribution;
s5.2, obtaining a cell cluster C according to a Gaussian mixture model GMM:
C=argmaxkk)
s5.3, sampling from a Gaussian mixture model GMM to obtain a low-dimensional representation Z of the cell: first, sampling from a standard gaussian distribution with a mean of 0 and a variance of 1, and then transforming by standard deviation and mean:
Z=μ+N(0,1)*σ
s6, expressing Z based on the low dimension of the cell, and predicting a gene regulation network A by using a Decoder Graph Decoder;
A=ReLU(ZT*Z)
s7, calculating a loss function, and reversely propagating and updating A, GCN, wherein the loss function comprises cross entropy loss Cross Encopy and KL divergence, and the calculation formula is as follows:
loss=CrossEntropy-KL
s8, repeating the steps S2-S5 until convergence;
and S9, outputting a gene regulation network A, wherein the cell low dimension represents Z, and the cell cluster C.
The steps for initializing the gene regulation network A are as follows:
numbering the genes numerically;
if there is an interaction between the ith and jth regulatory elements in the HiChIP, row i and column j of the gene regulatory network a are initialized to 1, otherwise to 0.
The invention is not only characterized by the opening degree of chromatin, but also adds the relationship among the regulating elements, and simultaneously learns the gene regulating network with single cell low-dimensional expression, cell clustering and cell type specificity by using the graph convolution neural network GCN. Each single cell is represented by an undirected graph, nodes represent regulatory elements, edges represent regulatory element-regulatory element interactions, the cell representation is learned by a modified graph variational autocoder, the model considers hidden layers (H) to obey a gaussian mixture model GMM, according to which the cells are clustered, trainable adjacency matrices are used to learn the cell class specific regulatory element-regulatory element interactions.
By comparing the clustered regulatory networks, the similarities and differences of the gene regulatory networks in the structure under different background states can be found, and the practical significance of the gene regulatory networks in biology can be further analyzed.
The cell data is analyzed in a hidden feature extraction mode, a Variational Graph Auto-Encoders (VGAE) and a Gaussian Mixture Model (GMM) are combined in an algorithm framework, hidden layer features of low dimensions of the data are extracted through a deep learning method to cluster the cell data and mine biological information, and the problem of high-dimension sparsity of the cell data is effectively solved. The low-dimensional hidden layer characteristics meeting the Gaussian mixture model are extracted through an Encoder (Graph Encoder) of a Graph variation self-Encoder, so that low coupling performance is achieved among different dimensions, modes of different cell types are represented respectively, cells of the same type share parameters of the Gaussian mixture model, the effect of mutually compensating for missing information is achieved by sharing information, and finally a final gene regulation and control network A is obtained through prediction of a Decoder (Graph Decoder).
Drawings
FIG. 1 is a schematic processing flow diagram of the deep unsupervised single-cell clustering method based on the Gaussian mixture diagram variation self-encoder of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.
As shown in fig. 1, in a first aspect of the embodiments of the present invention, a method for deep unsupervised single-cell clustering based on a gaussian mixture graph-variant self-encoder (VGAE) is a method for obtaining a gene regulatory network a, a cell low-dimensional representation Z and a cell cluster C by using a graph-variant self-encoder based on a gaussian mixture distribution, and includes the following steps:
step 1: inputting protein-protein interaction relationships PPIs and single cell gene expression data X;
and 2, step: initializing a trainable gene regulation and control network A by using PPIs, and specifically comprising the following steps:
step 2.1: numbering the genes numerically;
step 2.2: if the protein expressed by the ith gene and the protein expressed by the jth gene in the PPIs have an interaction relationship, initializing the jth row and the jth column of the gene regulation network A to 1, otherwise initializing to 0;
and 3, step 3: initializing a cell cluster C of each cell by using a K-means method;
and 4, step 4: a, X through Graph Encoder to obtain hidden layer;
hidden=GCN(A,X)
wherein, GCN represents the neural network of volume of picture;
and 5: obtaining cell clusters and cell low-dimensional representation, specifically comprising the following steps:
step 5.1: updating parameters of the gaussian mixture model GMM based on the assumption that the hidden layer (H) obeys a gaussian mixture distribution:
μ=GCN(hidden,X)
σ=GCN(hidden,X)
Figure BDA0003637680380000071
wherein N (. mu.) iskk) Represents the mean value of μkVariance of
Figure BDA0003637680380000081
Gaussian distribution of alphakRepresents the probability that sample m obeys the gaussian distribution;
and step 5.2: obtaining cell cluster C according to Gaussian mixture model GMM:
C=argmaxkk)
step 5.3: sampling from the gaussian mixture model GMM yields a low dimensional representation Z of the cells: first, sampling from a standard gaussian distribution with a mean of 0 and a variance of 1, and then transforming by standard deviation and mean:
Z=μ+N(0,1)*σ
step 6: predicting a gene regulation network A by using a Decoder Graph Decoder;
A=ReLU(ZT*Z)
and 7: compute the penalty function (including cross entropy penalty crossEncopy and KL divergence), propagate the update back A, GCN;
loss=CrossEntropy-KL
and 8: repeating the steps 2-5 until convergence;
and step 9: and outputting a gene regulation network A, expressing the cell low dimension Z and clustering the cells C.
In the embodiment of the invention, a Graph variation Auto-Encoder (VGAE) is adopted, an n × n convolution kernel is used in the Encoder process to change the Encoder process into a Graph convolution process of a Graph convolution neural network (GCN), and the Decode process is changed into a Graph Decode process, and the specific implementation is that the change of a front adjacent matrix and a rear adjacent matrix is loss. Drawing (A)The input of the variational self-encoder VGAE is an adjacency matrix (such as a gene control network A) of an undirected graph and a characteristic matrix (single-cell gene expression data X) of nodes in the undirected graph, and Gaussian distribution (mu, sigma) of low-dimensional representation of node vectors can be obtained through a graph convolution neural network (GCN)2) Then, the adjacency matrix (A') of the graph is generated through a decoder to obtain the gene regulatory network.
As shown in fig. 1, a second aspect of the embodiments of the present invention provides a deep unsupervised single cell clustering method based on a gaussian mixture map-variant self-encoder, which is a method for obtaining a gene regulation network a, a cell low-dimensional representation X and a cell cluster C by using a gaussian mixture distribution-based map-variant self-encoder, and includes the following steps:
step 1: inputting HiChIP data of interaction of the regulatory elements, single-cell gene expression data or regulatory element openness data X.
Step 2: the method comprises the following steps of initializing a trainable gene control network A by using HiChIP, wherein the method comprises the following steps:
step 2.1: numbering the genes digitally;
step 2.2: if the ith regulatory element and the jth regulatory element in the HiChIP have an interaction relation, initializing the ith row and the jth column of the gene regulatory network A to 1, otherwise initializing to 0;
and 3, step 3: initializing a cell cluster C of each cell by using a K-means method;
and 4, step 4: a, X through Graph Encoder to obtain hidden layer;
hidden=GCN(A,X)
and 5: obtaining cell clusters and cell low-dimensional representation, specifically comprising the following steps:
step 5.1: updating the parameters of the gaussian mixture model GCN based on the assumption that the hidden layer (H) obeys a gaussian mixture distribution:
μ=GCN(hidden,X)
σ=GCN(hidden,X)
Figure BDA0003637680380000091
wherein N (. mu.) iskk) Represents the mean value of μkVariance of
Figure BDA0003637680380000092
Gaussian distribution of alphakRepresenting the probability that sample m obeys the gaussian distribution.
Step 5.2: obtaining a cell cluster C according to a Gaussian mixture model GMM:
C=argmaxkk)
step 5.3: sampling from the gaussian mixture model GMM yields a low dimensional representation Z of the cells: first, sampling is performed from a standard gaussian distribution with a mean of 0 and a variance of 1, and then transformation is performed by standard deviation and mean:
Z=μ+N(0,1)*σ
step 6: predicting a gene regulation network A by using a Decoder Graph Decoder;
A=ReLU(ZT*Z)
and 7: computing a penalty function (including cross entropy penalty CrossEncopy and KL divergence), back-propagating the update A, GCN;
loss=CrossEntropy-KL
and step 8: repeating the steps 2-5 until convergence;
and step 9: and outputting a gene regulation network A, expressing the cell low dimension Z and clustering the cells C.
The invention learns the cell type specific gene regulatory network A by using a learnable adjacency matrix, and obtains a low-dimensional representation Z of cells and a cell cluster C by using Gaussian mixture distribution in a variable-density-analysis (VGAE) self-encoder. According to the invention, a Gaussian mixture distribution model is applied to a variational graph self-encoder (VGAE), and the cell clustering and the dimension reduction of cell representation are completed in the process of constructing a gene regulation network A, so that three purposes are achieved; in addition, compared with the traditional method for single cell clustering through cell characteristics, the method introduces the interaction relationship of gene-gene/regulatory element-regulatory element, and provides a basis for cell clustering from the perspective of a gene regulatory network.
While there have been shown and described what are at present considered to be the basic principles and essential features of the invention and advantages thereof, it will be apparent to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, but is capable of other embodiments without departing from the spirit or essential characteristics thereof;
the present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein, and any reference signs in the claims are not intended to be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (4)

1. The deep unsupervised single cell clustering method based on the Gaussian mixture graph variation self-encoder is characterized by comprising the following steps of:
s1, inputting protein-protein interaction relation PPIs data and single cell gene expression data X;
s2, initializing a trainable gene regulation and control network A by using PPIs;
s3, initializing a cell cluster C of each cell by using a K-means method;
s4, enabling A, X to pass through a Graph Encoder to obtain a hidden layer;
hidden=GCN(A,X),
s5, obtaining a cell cluster C based on a Gaussian mixture model GMM, and then sampling from the Gaussian mixture model GMM to obtain a cell low-dimensional expression Z:
s5.1, updating parameters of a Gaussian mixture model GMM based on the assumption that the hidden layer obeys Gaussian mixture distribution:
μ=GCN(hidden,X)
σ=GCN(hidden,X)
Figure FDA0003637680370000011
wherein N (. mu.) isk,σk) Represents the mean value of μkVariance of
Figure FDA0003637680370000012
Gaussian distribution of alphakRepresents the probability that sample m obeys the gaussian distribution;
s5.2, obtaining cell cluster C according to a Gaussian mixture model GMM:
C=argmaxkk)
s5.3, sampling from a Gaussian mixture model GMM to obtain a low-dimensional representation Z of a cell: first, sampling is performed from a standard gaussian distribution with a mean of 0 and a variance of 1, and then transformation is performed by standard deviation and mean:
Z=μ+N(0,1)*σ
s6, expressing Z based on the low dimension of the cell, and predicting a gene regulation network A by using a Decoder Graph Decoder;
A=ReLU(ZT*Z)
s7, calculating a loss function, and reversely propagating and updating A, GCN, wherein the loss function comprises cross entropy loss Cross Encopy and KL divergence, and the calculation formula is as follows:
loss=CrossEntropy-KL
s8, repeating the steps S2-S5 until convergence;
and S9, outputting a gene regulation network A, expressing the cell low dimension Z, and clustering the cells C.
2. The deep unsupervised single-cell clustering method based on the Gaussian mixture map variation self-encoder as claimed in claim 1, wherein the step of initializing the gene regulation network A is as follows:
numbering the genes numerically;
if the protein expressed by the ith gene and the protein expressed by the jth gene in the PPIs have an interaction relationship, the jth column of the ith row of the gene regulatory network A is initialized to 1, otherwise, the jth row of the gene regulatory network A is initialized to 0.
3. The deep unsupervised single cell clustering method based on the Gaussian mixture graph variation self-encoder is characterized by comprising the following steps of:
s1, inputting HiChIP data of interaction of a regulatory element, unicell gene expression data or regulatory element openness data X;
s2, initializing a trainable gene regulation and control network A by using HiChIP;
s3, initializing a cell cluster C of each cell by using a K-means method;
s4, enabling A, X to pass through a Graph Encoder to obtain a hidden layer;
hidden=GCN(A,X),
s5, obtaining cell cluster C based on a Gaussian mixture model GMM, and then sampling from the Gaussian mixture model GMM to obtain cell low-dimensional representation Z:
s5.1, updating parameters of a Gaussian mixture model GMM based on the assumption that the hidden layer obeys Gaussian mixture distribution:
μ=GCN(hidden,X)
σ=GCN(hidden,X)
Figure FDA0003637680370000031
wherein N (. mu.) isk,σk) Represents the mean value of μkVariance is
Figure FDA0003637680370000032
Gaussian distribution of alphakRepresents the probability that sample m obeys the gaussian distribution;
s5.2, obtaining a cell cluster C according to a Gaussian mixture model GMM:
C=argmaxkk)
s5.3, sampling from a Gaussian mixture model GMM to obtain a low-dimensional representation Z of the cell: first, sampling is performed from a standard gaussian distribution with a mean of 0 and a variance of 1, and then transformation is performed by standard deviation and mean:
Z=μ+N(0,1)*σ
s6, expressing Z based on the low dimension of the cell, and predicting a gene regulation network A by using a Decoder Graph Decoder;
A=ReLU(ZT*Z)
s7, calculating a loss function, and reversely propagating and updating A, GCN, wherein the loss function comprises cross entropy loss cross Entrophy and KL divergence, and the calculation formula is as follows:
loss=CrossEntropy-KL
s8, repeating the steps S2-S5 until convergence;
and S9, outputting a gene regulation network A, expressing the cell low dimension Z, and clustering the cells C.
4. The deep unsupervised single-cell clustering method based on the Gaussian mixture map variation self-encoder as claimed in claim 3, wherein the step of initializing the gene regulation network A is as follows:
numbering the genes digitally;
if the ith regulatory element and the jth regulatory element in the HiChIP have an interaction relationship, the ith row and the jth column of the gene regulatory network A are initialized to 1, otherwise, the ith row and the jth column of the gene regulatory network A are initialized to 0.
CN202210506799.6A 2022-05-11 2022-05-11 Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder Pending CN114783526A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210506799.6A CN114783526A (en) 2022-05-11 2022-05-11 Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210506799.6A CN114783526A (en) 2022-05-11 2022-05-11 Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder

Publications (1)

Publication Number Publication Date
CN114783526A true CN114783526A (en) 2022-07-22

Family

ID=82436574

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210506799.6A Pending CN114783526A (en) 2022-05-11 2022-05-11 Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder

Country Status (1)

Country Link
CN (1) CN114783526A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115240772A (en) * 2022-08-22 2022-10-25 南京医科大学 Method for analyzing active pathway in unicellular multiomics based on graph neural network
CN117854592A (en) * 2024-03-04 2024-04-09 中国人民解放军国防科技大学 Gene regulation network construction method, device, equipment and storage medium
CN117854592B (en) * 2024-03-04 2024-06-04 中国人民解放军国防科技大学 Gene regulation network construction method, device, equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115240772A (en) * 2022-08-22 2022-10-25 南京医科大学 Method for analyzing active pathway in unicellular multiomics based on graph neural network
CN115240772B (en) * 2022-08-22 2023-08-22 南京医科大学 Method for analyzing single cell pathway activity based on graph neural network
CN117854592A (en) * 2024-03-04 2024-04-09 中国人民解放军国防科技大学 Gene regulation network construction method, device, equipment and storage medium
CN117854592B (en) * 2024-03-04 2024-06-04 中国人民解放军国防科技大学 Gene regulation network construction method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
Hu et al. Mining coherent dense subgraphs across massive biological networks for functional discovery
Maulik et al. Simulated annealing based automatic fuzzy clustering combined with ANN classification for analyzing microarray data
CN111564183B (en) Single cell sequencing data dimension reduction method fusing gene ontology and neural network
Yan et al. Unsupervised and semi‐supervised learning: The next frontier in machine learning for plant systems biology
Zheng et al. Emerging deep learning methods for single-cell RNA-seq data analysis
CN114927162A (en) Multi-set correlation phenotype prediction method based on hypergraph representation and Dirichlet distribution
Cheng et al. DGCyTOF: Deep learning with graphic cluster visualization to predict cell types of single cell mass cytometry data
Zeng et al. A novel HMM-based clustering algorithm for the analysis of gene expression time-course data
Erfanian et al. Deep learning applications in single-cell omics data analysis
Du et al. Model-based trajectory inference for single-cell rna sequencing using deep learning with a mixture prior
CN114783526A (en) Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder
Perera et al. Generative moment matching networks for genotype simulation
Ji et al. scAnnotate: an automated cell-type annotation tool for single-cell RNA-sequencing data
Banu et al. Performance analysis of hard and soft clustering approaches for gene expression data
Chatzilygeroudis et al. Feature Selection in single-cell RNA-seq data via a Genetic Algorithm
CN115661498A (en) Self-optimization single cell clustering method
Zhen et al. A novel framework for single-cell hi-c clustering based on graph-convolution-based imputation and two-phase-based feature extraction
CN112071362A (en) Detection method of protein complex fusing global and local topological structures
Min et al. Structured sparse non-negative matrix factorization with L20-norm for scRNA-seq data analysis
Liu et al. An overview of biological data generation using generative adversarial networks
Wang et al. An interpretation of convolutional neural networks for motif finding from the view of probability
Abou El-Naga et al. Consensus Nature Inspired Clustering of Single-Cell RNA-Sequencing Data
CN114512188B (en) DNA binding protein recognition method based on improved protein sequence position specificity matrix
Padma et al. A modified algorithm for clustering based on particle swarm optimization and K-means
Wong et al. Computational Systems Bioinformatics-Methods And Biomedical Applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination