CN116386729A - scRNA-seq data dimension reduction method based on graph neural network - Google Patents
scRNA-seq data dimension reduction method based on graph neural network Download PDFInfo
- Publication number
- CN116386729A CN116386729A CN202211716676.1A CN202211716676A CN116386729A CN 116386729 A CN116386729 A CN 116386729A CN 202211716676 A CN202211716676 A CN 202211716676A CN 116386729 A CN116386729 A CN 116386729A
- Authority
- CN
- China
- Prior art keywords
- cell
- data
- neural network
- scrna
- gene
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 54
- 238000012174 single-cell RNA sequencing Methods 0.000 title claims abstract description 54
- 238000000034 method Methods 0.000 title claims abstract description 48
- 230000009467 reduction Effects 0.000 title claims abstract description 33
- 238000002474 experimental method Methods 0.000 claims abstract description 12
- 238000011156 evaluation Methods 0.000 claims abstract description 6
- 238000007781 pre-processing Methods 0.000 claims abstract description 6
- 108090000623 proteins and genes Proteins 0.000 claims description 44
- 210000004027 cell Anatomy 0.000 claims description 36
- 230000003993 interaction Effects 0.000 claims description 28
- 239000011159 matrix material Substances 0.000 claims description 26
- 238000004458 analytical method Methods 0.000 claims description 8
- 238000004422 calculation algorithm Methods 0.000 claims description 7
- 230000014509 gene expression Effects 0.000 claims description 6
- 238000003559 RNA-seq method Methods 0.000 claims description 5
- 102000004058 Leukemia inhibitory factor Human genes 0.000 claims description 4
- 108090000581 Leukemia inhibitory factor Proteins 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 4
- 210000001671 embryonic stem cell Anatomy 0.000 claims description 4
- 238000003064 k means clustering Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 230000001105 regulatory effect Effects 0.000 claims description 3
- 241000894007 species Species 0.000 claims description 3
- 241000244203 Caenorhabditis elegans Species 0.000 claims description 2
- 229920002430 Fibre-reinforced plastic Polymers 0.000 claims description 2
- 230000004931 aggregating effect Effects 0.000 claims description 2
- 210000003443 bladder cell Anatomy 0.000 claims description 2
- 210000005068 bladder tissue Anatomy 0.000 claims description 2
- 239000002131 composite material Substances 0.000 claims description 2
- 239000011151 fibre-reinforced plastic Substances 0.000 claims description 2
- 230000000971 hippocampal effect Effects 0.000 claims description 2
- 230000001418 larval effect Effects 0.000 claims description 2
- 238000004519 manufacturing process Methods 0.000 claims description 2
- 210000002569 neuron Anatomy 0.000 claims description 2
- 210000003819 peripheral blood mononuclear cell Anatomy 0.000 claims description 2
- 230000000717 retained effect Effects 0.000 claims description 2
- 238000000547 structure data Methods 0.000 claims description 2
- 230000009466 transformation Effects 0.000 claims description 2
- 238000002759 z-score normalization Methods 0.000 claims description 2
- 239000013598 vector Substances 0.000 claims 4
- 230000002829 reductive effect Effects 0.000 claims 2
- 238000006243 chemical reaction Methods 0.000 claims 1
- 238000010606 normalization Methods 0.000 claims 1
- 230000000452 restraining effect Effects 0.000 claims 1
- 238000007621 cluster analysis Methods 0.000 abstract description 3
- 238000013135 deep learning Methods 0.000 abstract description 3
- 230000006835 compression Effects 0.000 abstract description 2
- 238000007906 compression Methods 0.000 abstract description 2
- 238000007418 data mining Methods 0.000 abstract description 2
- 238000005065 mining Methods 0.000 abstract description 2
- 238000003062 neural network model Methods 0.000 abstract 1
- 230000004850 protein–protein interaction Effects 0.000 description 6
- 230000008901 benefit Effects 0.000 description 3
- 230000001413 cellular effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000000513 principal component analysis Methods 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 230000031018 biological processes and functions Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000002224 dissection Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000037361 pathway Effects 0.000 description 1
- 229920000333 poly(propyleneimine) Polymers 0.000 description 1
- 230000008844 regulatory mechanism Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000002195 synergetic effect Effects 0.000 description 1
- 210000001519 tissue Anatomy 0.000 description 1
- 230000002103 transcriptional effect Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Bioethics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Public Health (AREA)
- Epidemiology (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The present invention relates to data mining in bioinformatics, and in particular to mining single cell RNA sequencing data. In particular to a method for carrying out dimension compression and clustering on single-cell RNA sequencing data by a deep learning method so as to achieve the purpose of effectively identifying cell populations. The method of the invention comprises collecting and preprocessing scRNA-seq data; constructing a graph neural network model; performing dimension reduction on the preprocessed data by using the constructed model; and carrying out cluster analysis on the result after the dimension reduction. The model constrains the data structure, reduces the dimension through the graphic neural network module, and simultaneously maintains the cell-cell relationship and the gene-gene relationship in the dimension reduction result. Experiments performed on five real scRNA-seq data sets with standardized mutual information and adjusted Raney index as evaluation indexes show that the method has good performance.
Description
Technical Field
The present invention relates to data mining in bioinformatics, and in particular to mining single cell RNA sequencing data. In particular to the method for achieving the purpose of effectively identifying the cell population by carrying out dimensional compression and clustering on single-cell RNA sequencing data.
Background
With the explosive growth of single cell RNA sequencing (scRNAseq) technology in recent years, unprecedented single cell transcriptional analysis opportunities have emerged. Traditional batch RNA sequencing methods sequence a mixture of millions of cells. This results in gene expression of one gene reflecting the average of gene expression in all cells, and ignoring heterogeneity between cells. Unlike bulk RNAseq, scRNAseq isolates cells in a first step and sequences thousands of genes per cell in a second step. Millions of expression values are collected for each gene according to different sequencing schemes, so that new cell types can be identified, gene regulation mechanisms are determined, and the cell dynamics problem in the development process is solved.
Single cell RNA sequencing (scRNA-seq) is an ideal method to study intercellular variation. Conventional dimension reduction techniques such as Principal Component Analysis (PCA) and t-distributed random neighborhood embedding (t-SNE) are implemented on scRNA-seq data for visualization and downstream analysis, which significantly increases our understanding of cellular heterogeneity and development progress. The recent advent of massively parallel scRNA-seq (e.g. droplet platforms) enabled sequencing of millions of cells in complex biological systems, which provides excellent potential for dissection of tissue and cell microenvironments, identification of rare/new cell types, inference of developmental lineages, and elucidation of the response mechanisms of cells to stimuli. However, the data generated by the massive parallel scRNA-seq has the characteristics of high dropout, high noise, complex structure and the like, and brings a series of challenges for dimension reduction. In particular, preserving the complex topology between cells is a great challenge.
Over the past few years, a number of dimension reduction methods have been developed or introduced for scRNA-seq data analysis. Recently developed competing methods include DCA, scVI, scDeepCluster, PHATE, SAUCIE, scGNN, ZINB-WaVE and Ivis. Among them, deep learning shows the greatest potential. For example, DCA, scDeepCluster, ivis and SAUCIE adjust the auto-encoder to denoise, visualize and cluster the scRNA-seq data. However, these deep learning based models embed only different cellular features and ignore cell-to-cell relationships, which limits their ability to reveal complex topologies between cells and also makes it difficult to elucidate developmental trajectories. The recently proposed graph auto-encoder is very promising because it preserves long distance relationships between data in potential space.
However, studies have shown that gene interactions involved in gene regulatory networks or protein-protein interactions (PPI) networks are informative in different biological contexts. Furthermore, previous studies have shown that combining scRNA-seq data with previous gene interaction information can lead to meaningful understanding of the data. NetNMF-sc is a network regularized non-negative matrix factorization designed specifically for scRNA-seq analysis that uses a priori gene networks to obtain a more meaningful low-dimensional representation of genes. Correspondingly, the scRNA-seq data also contains rich information to infer gene-gene interactions.
In light of the above understanding, we propose scTPGAE, a graph neural network-based calculation method that uses two graph neural networks to simultaneously retain the cell-cell relationship, gene-gene relationship, into the dimension-reduction result to achieve better downstream analysis results.
Disclosure of Invention
Aiming at the problems of the method and the complexity of the scRNA-seq data, the invention provides a dimension reduction method of the scRNA-seq data based on a graph neural network. The method can effectively solve the problems of important information loss, insufficient feature extraction and the like of the existing dimension reduction method, simultaneously reserves a cell-cell relationship and a gene-gene relationship in a dimension reduction result, and obtains better clustering precision. The steps of the described method include:
1. data preprocessing
First, we assume that we have an original scRNA-seq count matrix C, which filters out genes that are not counted in any cells. C can be expressed as a P by N dimensional matrix, where P is defined as the total number of genes and N is defined as the total number of cells, C ij The expression value of gene i in cell j is indicated.
In this work, we first pre-process the raw scRNA-seq count data, including logarithmic transformation and z-score normalization. We have a normalized output X, shown below
X=zscore(X′)
Wherein S is j Is the size factor of each cell j. The advantage of data preprocessing is to preserve the effect of data size differences and convert discrete values to continuous values, thereby providing greater flexibility for subsequent modeling.
The inputs required for the graph neural network require a cell-cell relationship graph and a gene-gene interaction network in addition to the gene-cell relationship matrix described above.
Wherein the cell-cell relationship graph is constructed by the K Nearest Neighbor (KNN) algorithm in the Scikit-learn Python package. Default K was predefined as 35 in this study and was adjusted according to the dataset in our experiment. The adjacency matrix generated is a matrix of 0-1, 1 representing connectivity and 0 representing non-connectivity.
Gene-gene interaction networks we have collected seven different human gene interaction networks and a mouse gene interaction network to evaluate the performance of scTPGAE using existing data. One of the most well known gene interaction networks is the sting database, a PPI network, which collects and integrates protein-protein association information from a variety of sources, including literature and experiments. HumanNet is a human functional gene network that integrates multiple types of histology data through a Bayesian statistical framework. Humantet includes the hierarchical structure of the human gene network, i.e., human-derived PPIs, co-functional links, co-references, and mutual exclusion from other species. In particular, we use two versions of HumanNet, humanNet-CF and HumanNet-PI, which consist of a synergistic network and PPI network, respectively. FunCoup is a genome-wide functionally-associated network that uses unique redundant weighted Bayesian integration to combine 10 different types of functionally-associated data. GeneMANIA creates a combinatorial gene network by weighting multiple functional genome datasets. Furthermore, we collected two functional similarity matrices from pgWalk, which were derived from KEGG pathway and Gene ontologiy biological processes, respectively. Next, we transform the two similarity matrices into a gene network by filtering out those pairs of genes whose similarity values are less than a certain threshold (i.e., 0.9). These two networks are referred to as pgWalk-kegg and pgWalk-gobp, respectively.
2. Construction of a graph neural network for dimension reduction
(1) Graphic neural network G1 retaining cell-cell relationship
The graph automatic encoder is an artificial neural network for unsupervised representation learning of graph structure data. The graphic auto-encoder has a low-dimensional bottleneck layer and thus can be used as a dimension-reduction model. Assume that the inputs are a cell-cell relationship graph of node matrix X and adjacency matrix a. In our joint picture automatic encoder, there is one encoder E for the whole picture, two decoders D X And D A For nodes and edges, respectively. In practice, we first encode the input graph as the latent variable h=e (X, a), and then decode h into the reconstructed node matrix X r =D X (h) And a reconstructed adjacency matrix A r =D A (h) A. The invention relates to a method for producing a fibre-reinforced plastic composite The goal of the learning process is to minimize reconstruction losses
Wherein the weights are superparameters. In our experiments, set to 0.6.
We use Python package Spektral32 to implement our model. There are many types of graphic neural networks that can be used as encoders or decoders. Therefore, to extract the features of the nodes by means of their neighbors, we apply the graph attention layer as default in the encoder. Other graphic neural networks such as GCN, graphSAGE and TAGCN may also be implemented as encoders in the scTPGAE. Feature decoder D X Is a four-layer fully connected neural network with 64, 256 and 512 nodes in the hidden layer.
The edge decoder consists of one fully connected layer, then the components of quadrant and activation:
A r =D A (h)=σ(ZZ T )
where z=σ (Wh) as the output of the fully connected layer with the weight matrix W, σ (x) =max (0, x) is a straight linear unit.
(2) Graph neural network G2 retaining gene-gene relationship
We note that when a gene interaction network is applied to a data set, only those interaction pairs in which two interacting genes occur in the data set are retained, and the remaining pairs are discarded. In other words, the number of interaction pairs of the gene interaction networks of different data sets may differ from each other. To capture both regulatory directions and their corresponding intensities in a pair of genes, the gene interaction network is considered a directed graph, so for the edges of the a and B genes from the undirected gene network, e.g., STRING PPI network, we consider it as a pair of edges (i.e., the edge from a to B and the edge from B to a).
The specific graph neural network construction method is the same as that of the graph neural network which retains the cell-cell relationship, except that the input of the graph neural network is converted from the cell-cell relationship graph into the PPI interaction network of the gene-gene relationship. The interaction relationships between genes can spontaneously be presented in a graphical format, where a graphical neural network is applied to model such relationships. In the graph roll stack, each node represents one gene, and the edge between two nodes represents the relationship of the two corresponding genes. The graph representation module is designed as a graph volume layer, updating each node by aggregating the information of its neighboring nodes.
3. Dimension reduction for scRNA-seq data
And (3) performing dimension reduction on the preprocessed scRNA-seq data by using the constructed graph neural network.
Inputting the gene-cell count matrix and the cell-cell relation into the graph neural network G1 to obtain the cell characteristics theta 1 after dimension reduction.
Inputting the gene-cell count matrix and the gene-gene interaction network into the graph neural network G2 to obtain the cell characteristics theta 2 after dimension reduction.
The learned cell characteristics are linked as a dimension reduction result of subsequent downstream analysis.
K-means algorithm clustering
The present method uses the ZINB conditional likelihood to reconstruct the decoder output of the scRNA-seq data, and the ZINB distribution has proven to be a better model for describing the scRNA-seq data and is a widely accepted gene expression distribution structure.
In order to evaluate the effectiveness of the method, a k-means clustering algorithm is applied to cluster the data after dimension reduction, and the index of standardized mutual information is used for evaluation. Assuming that X is the predicted clustering result and Y is the true tagged cell type, the NMI score is calculated as follows:
MI is the mutual entropy between X and Y, and H is the shannon entropy.
From the foregoing, it can be seen that the scRNA-seq data dimension reduction method based on the graph neural network provided in one or more embodiments of the present disclosure retains both cell-cell and gene-gene relationships in the dimension reduction results. Our model constrains the data structure and dimension reduction is performed by two graph neural network modules. Experiments performed on five real scRNA-seq datasets indicate that the present method can provide a more accurate low-dimensional representation of the scRNA-seq data.
Detailed Description
The present invention will be described in further detail with reference to the following experiments in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
1. Overview of data set
To evaluate the performance of scTPGAE, we focused on a relatively large dataset; five authentic scRNA-seq datasets with known cell types were selected. The following table summarizes the basic information of five real datasets, which will be described below.
(i) 10X PBMC dataset, provided by the 10X scRNA-seq platform, data collected from a healthy human; (ii) A mouse embryonic stem cell dataset describing a transcriptome of mouse embryonic stem cell heterodifferentiation following withdrawal of Leukemia Inhibitory Factor (LIF); (iii) The mouse bladder cell dataset was from the mouse cytogram project GSE108097. From the original count matrix, we selected about 2700 cells from bladder tissue; (iv) The worm neuron cell dataset was analyzed by single cell combinatorial indexing RNA sequencing from L2 larval stage caenorhabditis elegans; (v) The Zeisel dataset contained 3005 cells from mouse cortex and hippocampal GSE60361.
2. Experimental environment and parameter setting
The hardware environment is mainly a PC host. The CPU 11th Gen Intel (R) Core (TM) i5-1135G7,2.42GHz of the PC host computer is 16GB RAM and 64-bit operating system. The software is implemented in Python language under Pycharm environment with Windows 10 as platform, python version 3.5.0 and Tensorflow version 1.4.0.
We use Python package Spektral32 to implement our model. There are many types of graphic neural networks that can be used as encoders or decoders. Therefore, to extract the features of the nodes by means of their neighbors, we apply the graph attention layer as default in the encoder. Other graph neural networks such as GCN, graphSAGE and TAGCN may also be implemented as encoders in the scTPGAE. Feature decoder D X Is a four-layer fully connected neural network with 64, 256 and 512 nodes in the hidden layer.
The edge decoder consists of one fully connected layer, then the components of quadrant and activation:
A r =D A (h)=σ(ZZ T )
where z=σ (Wh) as the output of the fully connected layer with the weight matrix W, σ (x) =max (0, x) is a straight linear unit.
The inputs required for the graph neural network require a cell-cell relationship graph and a gene-gene interaction network in addition to the gene-cell relationship matrix described above.
Wherein the cell-cell relationship graph is constructed by the K Nearest Neighbor (KNN) algorithm in the Scikit-learn Python package. Default K was predefined as 35 in this study and was adjusted according to the dataset in our experiment. The adjacency matrix generated is a matrix of 0-1, 1 representing connectivity and 0 representing non-connectivity.
Gene-gene interaction networks we have collected seven different human gene interaction networks and a mouse gene interaction network to evaluate the performance of scTPGAE using existing data.
3. Evaluation index
In order to make the results of the different methods easy to compare, we use K-means for cluster analysis and set the parameter K as the true cluster number in each dataset. In our experiments, the scTPGAE model was evaluated using two indices, normalized Mutual Information (NMI) and Adjusted Rankine Index (ARI), which are widely used in model performance evaluation in unsupervised learning scenarios.
4. Analysis of experimental results
Here, experiments are mainly performed on five real data sets by the method, and the obtained normalized mutual information and the adjusted rand index are shown in the following table.
Normalized Mutual Information (NMI)
Adjusting the Rankine index (ARI)
The experimental result shows that the scTPGAE method based on the graph neural network is a promising new method. The present method achieves better performance over five real datasets, indicating that the present method can provide a more accurate low-dimensional representation of the scRNA-seq data.
It can be seen that the proposed scTPGAE method is a method for performing dimension reduction and cluster analysis on single-cell RNA-seq data, and has the following advantages that firstly, the scTPGAE matches potential spatial distribution with a selected priori; secondly, scTPGAE retains the cell-cell relationship in the dimension reduction result; again, the scTPGAE method retains the cell-cell relationship while retaining the gene-gene relationship; finally, the method takes into account the parallelism and scalability properties in the deep neural network framework. Our model constrains the data structure and performs dimension reduction through the graph neural network module. Experiments performed on five real scRNA-seq data sets with standardized mutual information and adjusted Raney index as evaluation indexes show that the method has good performance.
Drawings
Fig. 1: a flow diagram of a scRNA-seq data dimension reduction method based on a graph neural network;
fig. 2: experimental results with Normalized Mutual Information (NMI) as a measure;
fig. 3: experimental results with the Adjusted Rand Index (ARI) as a measure.
Claims (5)
1. A scRNA-seq data dimension reduction method based on a graph neural network is characterized by comprising the following implementation steps:
(1) Preprocessing data; collecting scRNA-seq datasets from different species, different types, different cell numbers; preprocessing the collected original scRNA-seq data by adopting a logarithmic conversion and z fraction normalization method, and reconstructing the input data by utilizing zero expansion negative binomial distribution to obtain noiseless data;
(2) Constructing a graphic neural network for dimension reduction, which is an automatic encoder framework consisting of a depth encoder, an intermediate hidden layer and a depth decoder; the topological structure between cells and the topological structure between genes can be simultaneously reserved in the dimension reduction result;
(3) Reducing the dimension of the preprocessed scRNA-seq data by using the constructed graph neural network, learning a hidden layer feature vector by using an intermediate hidden layer of an automatic encoder, restraining prior distribution of the hidden layer feature vector, and matching the hidden layer feature vector with the selected prior distribution; connecting the hidden layer feature vectors learned in the two graph neural networks so as to facilitate subsequent downstream analysis;
(4) And clustering the dimensionality reduced data by using a k-means clustering algorithm to obtain a standardized mutual information score and adjust the Rand index.
2. The method for reducing dimension of scRNA-seq data based on graphic neural network according to claim 1, wherein the data is collected and the collected single cell RNA sequencing data is preprocessed:
we collected five scRNA-seq datasets from different species, different types, different cell numbers, and were then preprocessed using the method of logarithmic transformation and z-score normalization.
Specifically, we performed data preprocessing operations on the following five data sets.
(1) 10X PBMC dataset, provided by the 10X scRNA-seq platform, data collected from a healthy human;
(2) A mouse embryonic stem cell dataset describing a transcriptome of mouse embryonic stem cell heterodifferentiation following withdrawal of Leukemia Inhibitory Factor (LIF);
(3) The mouse bladder cell dataset was from the mouse cytogram project GSE108097. From the original count matrix, we selected about 2700 cells from bladder tissue;
(4) The worm neuron cell dataset was analyzed by single cell combinatorial indexing RNA sequencing from L2 larval stage caenorhabditis elegans;
(5) The Zeisel dataset contained 3005 cells from mouse cortex and hippocampal GSE60361.
3. The method for reducing dimension of scRNA-seq data based on graphic neural network according to claim 1, wherein the construction of a graphic neural network is an automatic encoder framework composed of a depth encoder, an intermediate hidden layer and a depth decoder, and specifically comprises:
(1) Graphic neural network G1 retaining cell-cell relationship
The graph automatic encoder is an artificial neural network for unsupervised representation learning of graph structure data. The graphic auto-encoder has a low-dimensional bottleneck layer and thus can be used as a dimension-reduction model. Assume that the inputs are a cell-cell relationship graph of node matrix X and adjacency matrix a. In our joint picture automatic encoder, there is one encoder E for the whole picture, two decoders D X And D A For nodes and edges, respectively. In practice, we first encode the input graph as the latent variable h=e (X, a), and then decode h into the reconstructed node matrix X r =D X (h) And a reconstructed adjacency matrix A r =D A (h) A. The invention relates to a method for producing a fibre-reinforced plastic composite The goal of the learning process is to minimize reconstruction losses
Wherein the weights are superparameters. In our experiments, set to 0.6.
We use Python package Spektral32 to implement our model. There are many types of graphic neural networks that can be used as encoders or decoders. Therefore, to extract the features of the nodes by means of their neighbors, we apply the graph attention layer as default in the encoder. Other graph neural networks such as GCN, graphSAGE and TAGCN may also be implemented as encoders in the scTPGAE. Feature decoder D X Is a four-layer fully connected neural network with 64, 256 and 512 nodes in the hidden layer.
The edge decoder consists of one fully connected layer, then the components of quadrant and activation:
A r =D A (h)=σ(ZZ T )
where z=σ (Wh) as the output of the fully connected layer with the weight matrix W, σ (x) =max (0, x) is a straight linear unit.
(2) Graph neural network G2 retaining gene-gene relationship
We note that when a gene interaction network is applied to a data set, only those interaction pairs in which two interacting genes occur in the data set are retained, and the remaining pairs are discarded. In other words, the number of interaction pairs of the gene interaction networks of different data sets may differ from each other. To capture both regulatory directions and their corresponding intensities in a pair of genes, the gene interaction network is considered a directed graph, so for the edges of the a and B genes from the undirected gene network, e.g., STRING PPI network, we consider it as a pair of edges (i.e., the edge from a to B and the edge from B to a).
The specific graph neural network construction method is the same as that of the graph neural network which retains the cell-cell relationship, except that the input of the graph neural network is converted from the cell-cell relationship graph into the PPI interaction network of the gene-gene relationship. The interaction relationships between genes can spontaneously be presented in a graphical format, where a graphical neural network is applied to model such relationships. In the graph roll stack, each node represents one gene, and the edge between two nodes represents the relationship of the two corresponding genes. The graph representation module is designed as a graph volume layer, updating each node by aggregating the information of its neighboring nodes.
4. The method for reducing the dimension of scRNA-seq data based on the graphic neural network according to claim 1, wherein the method for reducing the dimension of the preprocessed scRNA-seq data by using the constructed graphic neural network is characterized by comprising the following steps:
and (3) performing dimension reduction on the preprocessed scRNA-seq data by using the constructed graph neural network.
Inputting the gene-cell count matrix and the cell-cell relation into the graph neural network G1 to obtain the cell characteristics theta 1 after dimension reduction.
Inputting the gene-cell count matrix and the gene-gene interaction network into the graph neural network G2 to obtain the cell characteristics theta 2 after dimension reduction.
The learned cell characteristics are linked as a dimension reduction result of subsequent downstream analysis.
5. The method for reducing the dimension of scRNA-seq data based on the graphic neural network according to claim 1, wherein the k-means clustering algorithm is applied to cluster the dimension-reduced data. The method specifically comprises the following steps:
the present method uses the ZINB conditional likelihood to reconstruct the decoder output of the scRNA-seq data, and the ZINB distribution has proven to be a better model for describing the scRNA-seq data and is a widely accepted gene expression distribution structure.
In order to evaluate the effectiveness of the method, a k-means clustering algorithm is applied to cluster the data after dimension reduction, and standardized mutual information and an adjusted Rand index are used as evaluation indexes. Experiments performed on five real scRNA-seq datasets indicate that the present method can provide a more accurate low-dimensional representation of the scRNA-seq data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211716676.1A CN116386729A (en) | 2022-12-23 | 2022-12-23 | scRNA-seq data dimension reduction method based on graph neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211716676.1A CN116386729A (en) | 2022-12-23 | 2022-12-23 | scRNA-seq data dimension reduction method based on graph neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116386729A true CN116386729A (en) | 2023-07-04 |
Family
ID=86975628
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211716676.1A Pending CN116386729A (en) | 2022-12-23 | 2022-12-23 | scRNA-seq data dimension reduction method based on graph neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116386729A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116665786A (en) * | 2023-07-21 | 2023-08-29 | 曲阜师范大学 | RNA layered embedding clustering method based on graph convolution neural network |
CN116825204A (en) * | 2023-08-30 | 2023-09-29 | 鲁东大学 | Single-cell RNA sequence gene regulation inference method based on deep learning |
CN118335192A (en) * | 2024-06-13 | 2024-07-12 | 杭州电子科技大学 | Single-cell sequencing data clustering method based on self-attention network and contrast learning |
CN118645154A (en) * | 2024-08-12 | 2024-09-13 | 中国医学科学院基础医学研究所 | Single-cell Hi-C map prediction method based on single-cell RNA expression data |
CN118645154B (en) * | 2024-08-12 | 2024-11-08 | 中国医学科学院基础医学研究所 | Single-cell Hi-C map prediction method based on single-cell RNA expression data |
-
2022
- 2022-12-23 CN CN202211716676.1A patent/CN116386729A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116665786A (en) * | 2023-07-21 | 2023-08-29 | 曲阜师范大学 | RNA layered embedding clustering method based on graph convolution neural network |
CN116825204A (en) * | 2023-08-30 | 2023-09-29 | 鲁东大学 | Single-cell RNA sequence gene regulation inference method based on deep learning |
CN116825204B (en) * | 2023-08-30 | 2023-11-07 | 鲁东大学 | Single-cell RNA sequence gene regulation inference method based on deep learning |
CN118335192A (en) * | 2024-06-13 | 2024-07-12 | 杭州电子科技大学 | Single-cell sequencing data clustering method based on self-attention network and contrast learning |
CN118645154A (en) * | 2024-08-12 | 2024-09-13 | 中国医学科学院基础医学研究所 | Single-cell Hi-C map prediction method based on single-cell RNA expression data |
CN118645154B (en) * | 2024-08-12 | 2024-11-08 | 中国医学科学院基础医学研究所 | Single-cell Hi-C map prediction method based on single-cell RNA expression data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107622182B (en) | Method and system for predicting local structural features of protein | |
CN116386729A (en) | scRNA-seq data dimension reduction method based on graph neural network | |
CN111785329A (en) | Single-cell RNA sequencing clustering method based on confrontation automatic encoder | |
CN111210871A (en) | Protein-protein interaction prediction method based on deep forest | |
CN114022693B (en) | Single-cell RNA-seq data clustering method based on double self-supervision | |
Wang et al. | Inferring gene–gene interactions and functional modules using sparse canonical correlation analysis | |
Wang et al. | Graph neural networks: Self-supervised learning | |
CN115732034A (en) | Identification method and system of spatial transcriptome cell expression pattern | |
CN113571125A (en) | Drug target interaction prediction method based on multilayer network and graph coding | |
CN114091603A (en) | Spatial transcriptome cell clustering and analyzing method | |
CN111276187A (en) | Gene expression profile feature learning method based on self-encoder | |
CN114067915A (en) | scRNA-seq data dimension reduction method based on deep antithetical variational self-encoder | |
CN114783526A (en) | Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder | |
Celik et al. | Biological cartography: Building and benchmarking representations of life | |
CN115881232A (en) | ScRNA-seq cell type annotation method based on graph neural network and feature fusion | |
Wu et al. | AAE-SC: A scRNA-seq clustering framework based on adversarial autoencoder | |
Zhang et al. | Feature selection algorithm for high-dimensional biomedical data using information gain and improved chemical reaction optimization | |
Wen et al. | CellPLM: pre-training of cell language model beyond single cells | |
CN117594132A (en) | Single-cell RNA sequence data clustering method based on robust residual error map convolutional network | |
Bagyamani et al. | Biological significance of gene expression data using similarity based biclustering algorithm | |
CN112071362A (en) | Detection method of protein complex fusing global and local topological structures | |
Chen et al. | A deep graph convolution network with attention for clustering scRNA-seq data | |
Pavlov et al. | Recognition of DNA secondary structures as nucleosome barriers with deep learning methods | |
Leoshchenko et al. | Sequencing for Encoding in Neuroevolutionary Synthesis of Neural Network Models for Medical Diagnosis. | |
Deng | Algorithms for reconstruction of gene regulatory networks from high-throughput gene expression data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |