CN116386729A

CN116386729A - scRNA-seq data dimension reduction method based on graph neural network

Info

Publication number: CN116386729A
Application number: CN202211716676.1A
Authority: CN
Inventors: 王树林; 孙鸿福
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2022-12-23
Filing date: 2022-12-23
Publication date: 2023-07-04

Abstract

The present invention relates to data mining in bioinformatics, and in particular to mining single cell RNA sequencing data. In particular to a method for carrying out dimension compression and clustering on single-cell RNA sequencing data by a deep learning method so as to achieve the purpose of effectively identifying cell populations. The method of the invention comprises collecting and preprocessing scRNA-seq data; constructing a graph neural network model; performing dimension reduction on the preprocessed data by using the constructed model; and carrying out cluster analysis on the result after the dimension reduction. The model constrains the data structure, reduces the dimension through the graphic neural network module, and simultaneously maintains the cell-cell relationship and the gene-gene relationship in the dimension reduction result. Experiments performed on five real scRNA-seq data sets with standardized mutual information and adjusted Raney index as evaluation indexes show that the method has good performance.

Description

scRNA-seq data dimension reduction method based on graph neural network

Technical Field

The present invention relates to data mining in bioinformatics, and in particular to mining single cell RNA sequencing data. In particular to the method for achieving the purpose of effectively identifying the cell population by carrying out dimensional compression and clustering on single-cell RNA sequencing data.

Background

With the explosive growth of single cell RNA sequencing (scRNAseq) technology in recent years, unprecedented single cell transcriptional analysis opportunities have emerged. Traditional batch RNA sequencing methods sequence a mixture of millions of cells. This results in gene expression of one gene reflecting the average of gene expression in all cells, and ignoring heterogeneity between cells. Unlike bulk RNAseq, scRNAseq isolates cells in a first step and sequences thousands of genes per cell in a second step. Millions of expression values are collected for each gene according to different sequencing schemes, so that new cell types can be identified, gene regulation mechanisms are determined, and the cell dynamics problem in the development process is solved.

Single cell RNA sequencing (scRNA-seq) is an ideal method to study intercellular variation. Conventional dimension reduction techniques such as Principal Component Analysis (PCA) and t-distributed random neighborhood embedding (t-SNE) are implemented on scRNA-seq data for visualization and downstream analysis, which significantly increases our understanding of cellular heterogeneity and development progress. The recent advent of massively parallel scRNA-seq (e.g. droplet platforms) enabled sequencing of millions of cells in complex biological systems, which provides excellent potential for dissection of tissue and cell microenvironments, identification of rare/new cell types, inference of developmental lineages, and elucidation of the response mechanisms of cells to stimuli. However, the data generated by the massive parallel scRNA-seq has the characteristics of high dropout, high noise, complex structure and the like, and brings a series of challenges for dimension reduction. In particular, preserving the complex topology between cells is a great challenge.

Over the past few years, a number of dimension reduction methods have been developed or introduced for scRNA-seq data analysis. Recently developed competing methods include DCA, scVI, scDeepCluster, PHATE, SAUCIE, scGNN, ZINB-WaVE and Ivis. Among them, deep learning shows the greatest potential. For example, DCA, scDeepCluster, ivis and SAUCIE adjust the auto-encoder to denoise, visualize and cluster the scRNA-seq data. However, these deep learning based models embed only different cellular features and ignore cell-to-cell relationships, which limits their ability to reveal complex topologies between cells and also makes it difficult to elucidate developmental trajectories. The recently proposed graph auto-encoder is very promising because it preserves long distance relationships between data in potential space.

However, studies have shown that gene interactions involved in gene regulatory networks or protein-protein interactions (PPI) networks are informative in different biological contexts. Furthermore, previous studies have shown that combining scRNA-seq data with previous gene interaction information can lead to meaningful understanding of the data. NetNMF-sc is a network regularized non-negative matrix factorization designed specifically for scRNA-seq analysis that uses a priori gene networks to obtain a more meaningful low-dimensional representation of genes. Correspondingly, the scRNA-seq data also contains rich information to infer gene-gene interactions.

In light of the above understanding, we propose scTPGAE, a graph neural network-based calculation method that uses two graph neural networks to simultaneously retain the cell-cell relationship, gene-gene relationship, into the dimension-reduction result to achieve better downstream analysis results.

Disclosure of Invention

Aiming at the problems of the method and the complexity of the scRNA-seq data, the invention provides a dimension reduction method of the scRNA-seq data based on a graph neural network. The method can effectively solve the problems of important information loss, insufficient feature extraction and the like of the existing dimension reduction method, simultaneously reserves a cell-cell relationship and a gene-gene relationship in a dimension reduction result, and obtains better clustering precision. The steps of the described method include:

1. data preprocessing

First, we assume that we have an original scRNA-seq count matrix C, which filters out genes that are not counted in any cells. C can be expressed as a P by N dimensional matrix, where P is defined as the total number of genes and N is defined as the total number of cells, C _ij The expression value of gene i in cell j is indicated.

In this work, we first pre-process the raw scRNA-seq count data, including logarithmic transformation and z-score normalization. We have a normalized output X, shown below

X＝zscore(X′)

Wherein S is _j Is the size factor of each cell j. The advantage of data preprocessing is to preserve the effect of data size differences and convert discrete values to continuous values, thereby providing greater flexibility for subsequent modeling.

The inputs required for the graph neural network require a cell-cell relationship graph and a gene-gene interaction network in addition to the gene-cell relationship matrix described above.

Wherein the cell-cell relationship graph is constructed by the K Nearest Neighbor (KNN) algorithm in the Scikit-learn Python package. Default K was predefined as 35 in this study and was adjusted according to the dataset in our experiment. The adjacency matrix generated is a matrix of 0-1, 1 representing connectivity and 0 representing non-connectivity.

Gene-gene interaction networks we have collected seven different human gene interaction networks and a mouse gene interaction network to evaluate the performance of scTPGAE using existing data. One of the most well known gene interaction networks is the sting database, a PPI network, which collects and integrates protein-protein association information from a variety of sources, including literature and experiments. HumanNet is a human functional gene network that integrates multiple types of histology data through a Bayesian statistical framework. Humantet includes the hierarchical structure of the human gene network, i.e., human-derived PPIs, co-functional links, co-references, and mutual exclusion from other species. In particular, we use two versions of HumanNet, humanNet-CF and HumanNet-PI, which consist of a synergistic network and PPI network, respectively. FunCoup is a genome-wide functionally-associated network that uses unique redundant weighted Bayesian integration to combine 10 different types of functionally-associated data. GeneMANIA creates a combinatorial gene network by weighting multiple functional genome datasets. Furthermore, we collected two functional similarity matrices from pgWalk, which were derived from KEGG pathway and Gene ontologiy biological processes, respectively. Next, we transform the two similarity matrices into a gene network by filtering out those pairs of genes whose similarity values are less than a certain threshold (i.e., 0.9). These two networks are referred to as pgWalk-kegg and pgWalk-gobp, respectively.

2. Construction of a graph neural network for dimension reduction

(1) Graphic neural network G1 retaining cell-cell relationship

The graph automatic encoder is an artificial neural network for unsupervised representation learning of graph structure data. The graphic auto-encoder has a low-dimensional bottleneck layer and thus can be used as a dimension-reduction model. Assume that the inputs are a cell-cell relationship graph of node matrix X and adjacency matrix a. In our joint picture automatic encoder, there is one encoder E for the whole picture, two decoders D _X And D _A For nodes and edges, respectively. In practice, we first encode the input graph as the latent variable h=e (X, a), and then decode h into the reconstructed node matrix X _r ＝D _X (h) And a reconstructed adjacency matrix A _r ＝D _A (h) A. The invention relates to a method for producing a fibre-reinforced plastic composite The goal of the learning process is to minimize reconstruction losses

Wherein the weights are superparameters. In our experiments, set to 0.6.

We use Python package Spektral32 to implement our model. There are many types of graphic neural networks that can be used as encoders or decoders. Therefore, to extract the features of the nodes by means of their neighbors, we apply the graph attention layer as default in the encoder. Other graphic neural networks such as GCN, graphSAGE and TAGCN may also be implemented as encoders in the scTPGAE. Feature decoder D _X Is a four-layer fully connected neural network with 64, 256 and 512 nodes in the hidden layer.

The edge decoder consists of one fully connected layer, then the components of quadrant and activation:

A _r ＝D _A (h)＝σ(ZZ ^T )

where z=σ (Wh) as the output of the fully connected layer with the weight matrix W, σ (x) =max (0, x) is a straight linear unit.

(2) Graph neural network G2 retaining gene-gene relationship

We note that when a gene interaction network is applied to a data set, only those interaction pairs in which two interacting genes occur in the data set are retained, and the remaining pairs are discarded. In other words, the number of interaction pairs of the gene interaction networks of different data sets may differ from each other. To capture both regulatory directions and their corresponding intensities in a pair of genes, the gene interaction network is considered a directed graph, so for the edges of the a and B genes from the undirected gene network, e.g., STRING PPI network, we consider it as a pair of edges (i.e., the edge from a to B and the edge from B to a).

The specific graph neural network construction method is the same as that of the graph neural network which retains the cell-cell relationship, except that the input of the graph neural network is converted from the cell-cell relationship graph into the PPI interaction network of the gene-gene relationship. The interaction relationships between genes can spontaneously be presented in a graphical format, where a graphical neural network is applied to model such relationships. In the graph roll stack, each node represents one gene, and the edge between two nodes represents the relationship of the two corresponding genes. The graph representation module is designed as a graph volume layer, updating each node by aggregating the information of its neighboring nodes.

3. Dimension reduction for scRNA-seq data

And (3) performing dimension reduction on the preprocessed scRNA-seq data by using the constructed graph neural network.

Inputting the gene-cell count matrix and the cell-cell relation into the graph neural network G1 to obtain the cell characteristics theta 1 after dimension reduction.

Inputting the gene-cell count matrix and the gene-gene interaction network into the graph neural network G2 to obtain the cell characteristics theta 2 after dimension reduction.

The learned cell characteristics are linked as a dimension reduction result of subsequent downstream analysis.

K-means algorithm clustering

The present method uses the ZINB conditional likelihood to reconstruct the decoder output of the scRNA-seq data, and the ZINB distribution has proven to be a better model for describing the scRNA-seq data and is a widely accepted gene expression distribution structure.

In order to evaluate the effectiveness of the method, a k-means clustering algorithm is applied to cluster the data after dimension reduction, and the index of standardized mutual information is used for evaluation. Assuming that X is the predicted clustering result and Y is the true tagged cell type, the NMI score is calculated as follows:

MI is the mutual entropy between X and Y, and H is the shannon entropy.

From the foregoing, it can be seen that the scRNA-seq data dimension reduction method based on the graph neural network provided in one or more embodiments of the present disclosure retains both cell-cell and gene-gene relationships in the dimension reduction results. Our model constrains the data structure and dimension reduction is performed by two graph neural network modules. Experiments performed on five real scRNA-seq datasets indicate that the present method can provide a more accurate low-dimensional representation of the scRNA-seq data.

Detailed Description

The present invention will be described in further detail with reference to the following experiments in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

1. Overview of data set

To evaluate the performance of scTPGAE, we focused on a relatively large dataset; five authentic scRNA-seq datasets with known cell types were selected. The following table summarizes the basic information of five real datasets, which will be described below.

(i) 10X PBMC dataset, provided by the 10X scRNA-seq platform, data collected from a healthy human; (ii) A mouse embryonic stem cell dataset describing a transcriptome of mouse embryonic stem cell heterodifferentiation following withdrawal of Leukemia Inhibitory Factor (LIF); (iii) The mouse bladder cell dataset was from the mouse cytogram project GSE108097. From the original count matrix, we selected about 2700 cells from bladder tissue; (iv) The worm neuron cell dataset was analyzed by single cell combinatorial indexing RNA sequencing from L2 larval stage caenorhabditis elegans; (v) The Zeisel dataset contained 3005 cells from mouse cortex and hippocampal GSE60361.

2. Experimental environment and parameter setting

The hardware environment is mainly a PC host. The CPU 11th Gen Intel (R) Core (TM) i5-1135G7,2.42GHz of the PC host computer is 16GB RAM and 64-bit operating system. The software is implemented in Python language under Pycharm environment with Windows 10 as platform, python version 3.5.0 and Tensorflow version 1.4.0.

We use Python package Spektral32 to implement our model. There are many types of graphic neural networks that can be used as encoders or decoders. Therefore, to extract the features of the nodes by means of their neighbors, we apply the graph attention layer as default in the encoder. Other graph neural networks such as GCN, graphSAGE and TAGCN may also be implemented as encoders in the scTPGAE. Feature decoder D _X Is a four-layer fully connected neural network with 64, 256 and 512 nodes in the hidden layer.

A _r ＝D _A (h)＝σ(ZZ ^T )

Gene-gene interaction networks we have collected seven different human gene interaction networks and a mouse gene interaction network to evaluate the performance of scTPGAE using existing data.

3. Evaluation index

In order to make the results of the different methods easy to compare, we use K-means for cluster analysis and set the parameter K as the true cluster number in each dataset. In our experiments, the scTPGAE model was evaluated using two indices, normalized Mutual Information (NMI) and Adjusted Rankine Index (ARI), which are widely used in model performance evaluation in unsupervised learning scenarios.

4. Analysis of experimental results

Here, experiments are mainly performed on five real data sets by the method, and the obtained normalized mutual information and the adjusted rand index are shown in the following table.

Normalized Mutual Information (NMI)

Adjusting the Rankine index (ARI)

The experimental result shows that the scTPGAE method based on the graph neural network is a promising new method. The present method achieves better performance over five real datasets, indicating that the present method can provide a more accurate low-dimensional representation of the scRNA-seq data.

It can be seen that the proposed scTPGAE method is a method for performing dimension reduction and cluster analysis on single-cell RNA-seq data, and has the following advantages that firstly, the scTPGAE matches potential spatial distribution with a selected priori; secondly, scTPGAE retains the cell-cell relationship in the dimension reduction result; again, the scTPGAE method retains the cell-cell relationship while retaining the gene-gene relationship; finally, the method takes into account the parallelism and scalability properties in the deep neural network framework. Our model constrains the data structure and performs dimension reduction through the graph neural network module. Experiments performed on five real scRNA-seq data sets with standardized mutual information and adjusted Raney index as evaluation indexes show that the method has good performance.

Drawings

Fig. 1: a flow diagram of a scRNA-seq data dimension reduction method based on a graph neural network;

fig. 2: experimental results with Normalized Mutual Information (NMI) as a measure;

fig. 3: experimental results with the Adjusted Rand Index (ARI) as a measure.

Claims

1. A scRNA-seq data dimension reduction method based on a graph neural network is characterized by comprising the following implementation steps:

(1) Preprocessing data; collecting scRNA-seq datasets from different species, different types, different cell numbers; preprocessing the collected original scRNA-seq data by adopting a logarithmic conversion and z fraction normalization method, and reconstructing the input data by utilizing zero expansion negative binomial distribution to obtain noiseless data;

(2) Constructing a graphic neural network for dimension reduction, which is an automatic encoder framework consisting of a depth encoder, an intermediate hidden layer and a depth decoder; the topological structure between cells and the topological structure between genes can be simultaneously reserved in the dimension reduction result;

(3) Reducing the dimension of the preprocessed scRNA-seq data by using the constructed graph neural network, learning a hidden layer feature vector by using an intermediate hidden layer of an automatic encoder, restraining prior distribution of the hidden layer feature vector, and matching the hidden layer feature vector with the selected prior distribution; connecting the hidden layer feature vectors learned in the two graph neural networks so as to facilitate subsequent downstream analysis;

(4) And clustering the dimensionality reduced data by using a k-means clustering algorithm to obtain a standardized mutual information score and adjust the Rand index.

2. The method for reducing dimension of scRNA-seq data based on graphic neural network according to claim 1, wherein the data is collected and the collected single cell RNA sequencing data is preprocessed:

we collected five scRNA-seq datasets from different species, different types, different cell numbers, and were then preprocessed using the method of logarithmic transformation and z-score normalization.

Specifically, we performed data preprocessing operations on the following five data sets.

(1) 10X PBMC dataset, provided by the 10X scRNA-seq platform, data collected from a healthy human;

(2) A mouse embryonic stem cell dataset describing a transcriptome of mouse embryonic stem cell heterodifferentiation following withdrawal of Leukemia Inhibitory Factor (LIF);

(3) The mouse bladder cell dataset was from the mouse cytogram project GSE108097. From the original count matrix, we selected about 2700 cells from bladder tissue;

(4) The worm neuron cell dataset was analyzed by single cell combinatorial indexing RNA sequencing from L2 larval stage caenorhabditis elegans;

(5) The Zeisel dataset contained 3005 cells from mouse cortex and hippocampal GSE60361.

3. The method for reducing dimension of scRNA-seq data based on graphic neural network according to claim 1, wherein the construction of a graphic neural network is an automatic encoder framework composed of a depth encoder, an intermediate hidden layer and a depth decoder, and specifically comprises:

(1) Graphic neural network G1 retaining cell-cell relationship

Wherein the weights are superparameters. In our experiments, set to 0.6.

A _r ＝D _A (h)＝σ(ZZ ^T )

(2) Graph neural network G2 retaining gene-gene relationship

4. The method for reducing the dimension of scRNA-seq data based on the graphic neural network according to claim 1, wherein the method for reducing the dimension of the preprocessed scRNA-seq data by using the constructed graphic neural network is characterized by comprising the following steps:

5. The method for reducing the dimension of scRNA-seq data based on the graphic neural network according to claim 1, wherein the k-means clustering algorithm is applied to cluster the dimension-reduced data. The method specifically comprises the following steps:

In order to evaluate the effectiveness of the method, a k-means clustering algorithm is applied to cluster the data after dimension reduction, and standardized mutual information and an adjusted Rand index are used as evaluation indexes. Experiments performed on five real scRNA-seq datasets indicate that the present method can provide a more accurate low-dimensional representation of the scRNA-seq data.