CN117877585A

CN117877585A - Sequencing data characteristic gene extraction method based on interpretable deep learning

Info

Publication number: CN117877585A
Application number: CN202410064040.6A
Authority: CN
Inventors: 杨朝勇; 钟智星; 丘宇童
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2024-01-17
Filing date: 2024-01-17
Publication date: 2024-04-12

Abstract

The method uses the interpretable deep learning to extract the characteristic genes which play an important role in the classification decision process, uses the genes as input and performs classification tasks on the types of target cells/spots (measurement units in a space transcriptome). And the gradient values of the gene features are calculated as the importance (significance) of the genes to the classification of the target cell/spot type in the classification process, and ranked accordingly. The top significant gene is taken as the characteristic gene signature and the cycle is continued until the ranking is stable. The stable signature of the signature can be used to analyze patient prognosis, analyze cell function, etc. The method is suitable for transcriptome data such as space transcriptome and single cell transcriptome based on next generation sequencing, and has wide applicability.

Description

Sequencing data characteristic gene extraction method based on interpretable deep learning

Technical Field

The invention relates to the field of transcriptome sequencing data analysis, in particular to a sequencing data characteristic gene extraction method based on interpretable deep learning.

Background

Single cell transcriptome sequencing technology can measure gene expression level at single cell accuracy, which can enable researchers to study dynamic changes of cells in disease process at single cell accuracy level. The space transcriptome technology can measure the gene expression level of a plurality of cells (called spots) in the tissue while preserving space information, and can allow researchers to explore the relationship and effect of the cells in the space category. Currently, as related research proceeds, the amount of single cell transcriptome and spatial transcriptome data that is publicly available for research is increasing to a very large scale. Deep learning, which is an excellent analysis method for dealing with a large amount of data, is also widely used in analysis of single cell transcriptome and spatial transcriptome data, and a typical application is to classify single cells or spots by deep learning. However, the model interior of the single-cell and space transcriptome data analysis method based on deep learning is generally regarded as a black box, researchers only can see results, but do not know the process, the explanation and understanding of classification decisions are lacked, and characteristic genes which play an important role in decision in classification targets (cells or spots) cannot be extracted. This limits the application of deep learning methods to these data in biology and medicine. Therefore, a method is needed to explain the classification process and decision basis of deep learning to expand the application of the method in the fields of biology and medicine.

Disclosure of Invention

The invention aims to solve the problems in the prior art and provides a sequencing data characteristic gene extraction method based on interpretable deep learning, which can extract characteristic genes of target cells or spot types based on the interpretable deep learning technology in the classification process.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a sequencing data characteristic gene extraction method based on interpretable deep learning comprises the following steps:

s1: based on the high variance gene as input, for the target cell/spot type, using an interpretable deep neural network to perform classification tasks;

s2: calculating the significance of calculating the feature gradient as the gene feature according to the back propagation of the loss function in the classification process;

s3: sequencing the salience of the gene features, and taking the genes with the top salience rank as feature gene signatures;

s4: circularly executing S1-S3 until reaching a cycle stopping condition to obtain a stable characteristic gene signature;

the gene signature is specifically a set of genes consisting of tens of genes. The gene signature represents the gene combination of a certain cell or a certain spot specifically expressed in the microenvironment, can be used as the identification of the cell/spot in the microenvironment, and is related to the biological function of the cell/spot. The signature of the characteristic genes of the cells/spots related to the disease can be used as a prognostic index of the disease.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides a sequencing data characteristic gene extraction method based on interpretable deep learning, which solves the problem that the current deep learning lacks decision basis and interpretation in the classification application of single-cell and space transcriptome data. The method uses the characteristic genes which can explain the important role of deep learning in the classification decision process to extract, uses the gene expression matrix of single cells or space transcriptomes as input, and uses the high variance genes as characteristics. Performing classification tasks on target cells/spot types, calculating gradient values of gene features in the classification process as importance (significance) of the genes on classification of the target cells/spot types, and sequencing the importance; the genes with the top significance ranking are taken as characteristic gene signatures, and the characteristic gene signatures are circulated until the ranking is stable, so that the stable characteristic gene signatures can be used for analyzing prognosis of patients, analyzing cell functions and the like. The method is suitable for transcriptome data such as space transcriptome and single cell transcriptome based on next generation sequencing, and has wide applicability.

Drawings

FIG. 1 is a frame of a method for extracting characteristic genes of sequencing data based on interpretable deep learning according to an embodiment of the present invention;

FIG. 2 is a training process illustration of a sequencing data feature gene extraction method based on interpretable deep learning according to an embodiment of the present invention;

fig. 3 is an example of an output result of a sequencing data feature gene extraction method based on interpretable deep learning applied to cancer prognosis according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical schemes and beneficial effects to be solved more clear and obvious, the invention is further described in detail below with reference to the accompanying drawings and embodiments.

The invention is designed and developed for extracting characteristic genes of specific cell/spot types in single cell and space transcriptome sequencing data. The characteristic genes refer to the gene combination of the cell/spot specifically expressed in the biological microenvironment, can be used as the identification of the cell/spot in the microenvironment, and are closely related to the functions and actions of the cell/spot. Interpretable deep learning is a research area of deep learning, whose problem is how to describe the decision basis of a model in the training and prediction process of deep learning. One popular interpretable deep learning method is a saliency map-based method, which aims to obtain the importance of features in a model decision process by calculating the feature saliency in a training process.

According to the existing saliency map method, the sequencing data characteristic gene extraction method based on interpretable deep learning disclosed by the invention specifically comprises the following steps of:

s1: based on the high variance gene as input, for target cell/spot type, the interpretable deep neural network is used for classification tasks (see fig. 1, loss function values, validation set classification accuracy and gene feature significance in fig. 2, a), B) and C)).

Step S1 further comprises the steps of:

s101: single cell transcriptome or spatial transcriptome sequencing data is obtained according to public databases or experiments. In the form of a single cell expression matrix or a spatial spot (measurement unit of the spatial transcriptome, size of about 5-10 cells) expression matrix. The row genes of the matrix, the columns of the matrix are the numbers of the sample measurement units (cells in single cell transcriptomes and spots in space transcriptomes), and the values of the matrix are the number counts of row-corresponding genes expressed by column-corresponding cells/spots. The matrix is normalized. The normalized expression matrix is retained as training data.

S102: the cells or spot categories in the matrix are obtained as training tags by conventional methods. For example, cell types in a single cell transcriptome can be annotated by cell-specific marker genes in clusters obtained by unsupervised cluster-bound clustering; various regions on the spatial transcriptome can be identified by pathologist manual work. The cell type or tissue region of interest is marked with a "1" class, and the remaining cells/spots are marked with a "0" class as training tags.

S103: the data and labels of the training set and the validation set are divided according to a ratio of 4:1. Variance of genes was calculated and ranked on the training set. The 5000 high variance genes with highest variance ranking are selected as features, and the data and the labels are input into a neural network for classification training.

In the step S101, the normalization method is as follows:

where n represents the number of cells contained in a batch (single cell transcriptome) or the number of spots (spatial transcriptome). Norm (Norm) _i Represents the normalized expression level of the ith cell, count _i Representing the expression count measured directly from the ith cell.

In the step S103, the model of the neural network is denoted as f _θ (. Cndot.) the neural network uses the expression level value of the high variance gene as input to perform the classification task training of the target cells/spots. Is composed of an input layer,The hidden layer and the classified layer. The network receives the expression of the high variance gene as the characteristic, uses the hidden layer to extract the characteristic, and connects the classifying layer to classify. The classification layer comprises two neurons as classification heads, and outputs the scores g of the target categories respectively ₁ And score g for non-target class ₀ . Namely:

wherein X represents training data.

Neural networks are trained using cross entropy loss, whose loss function is expressed as:

wherein l _θ (. Cndot.) represents a loss function, X represents training data, y represents training labels, and e represents a natural constant.

S2: the significance of computing the feature gradient as the gene feature according to the back propagation of the loss function is calculated in the classification process (see figure 1, the loss function value, the verification set classification accuracy and the gene feature significance in the implementation process are shown in figure 2). In each training, the back propagation of the loss function obtains a Gradient of the loss function denoted Gradient (), namely:

where m represents the number of characteristic genes, in this example, m=5000. The inputted characteristic gene expression amount can be expressed as: { Gene ₁ ，Gene ₂ ，...，Gene _m }. θ represents the parameter that the neural network needs to optimize, i.e., the weights of the neurons. θ ₁ ,. the neuron weight of the first, input layer, is a vector of m elements, namely: θ ₁ ，.＝{θ _1，1 ，θ _1，2 ，...，θ _1，m }. Wherein θ is _1，i Represents Gene _i The weights of the neurons corresponding to the inputs of (a). Here, the gradient value is denoted by W, that is:

W＝{w ₁ ，w ₂ ，...，w _m }＝Gradient(l _θ (X，y))

w is a vector containing m elements, where W is _i Represents Gene in one training _i The components of the corresponding gradient values. It represents the contribution of the gene to this training, i.e., the importance of the gene during this training. Throughout the training process, the total saliency value of a gene can be expressed as:

where k represents the total number of exercises in the whole training process. Total salience (Gene _i ) Represents the Gene _i The gradient components accumulated during the overall training process, i.e. the contribution to the overall training process, i.e. the importance thereof during the overall training process.

S3: the significance of the gene features is ordered, and the genes with the top significance rank are taken as the feature gene signatures (see figure 1, and the loss function value, the verification set classification accuracy and the gene feature significance in the implementation process are shown in figure 2). Each Gene _i Has a significance value of salience (Gene _i ) Indicating how important it is in the overall training process. These genes are ranked according to their significance values to obtain signature of the characteristic genes. The characteristic gene signature is a group of genes consisting of tens of genes, and is a group consisting of 10 to 20 genes. The gene signature represents the gene combination of a certain cell or a certain spot specifically expressed in the microenvironment, can be used as the identification of the cell/spot in the microenvironment, and is related to the biological function of the cell/spot. The signature of characteristic genes of disease-related cells/spots has value in analyzing patient prognosis. In this example, the gene set of the first ten of accumulated saliency was used as the signature of the characteristic gene.

S4: and (3) circularly executing S1-S3 until a circulation stopping condition is reached, and obtaining a stable characteristic gene signature (see figure 1, and the loss function value, the verification set classification accuracy and the gene characteristic significance in the specific implementation process are shown in figure 2). The cycle stop condition is that the top ranking gene ranking remains unchanged for several consecutive training rounds to reach a steady state. In this example, the stop condition is that the top ten genes were ranked to have reached steady state when they remained unchanged in the continuous 20 rounds of training. In this steady state, the top-ranked genes are selected as the output gene signature, which represents the basis for the model to judge the class, and is the characteristic functional gene of the class of cells/spots.

The invention provides a sequencing data characteristic gene extraction method based on interpretable deep learning, which solves the problem that the current deep learning lacks decision basis and interpretation in the classification application of single-cell and space transcriptome data. The method uses the characteristic genes which can explain the important role of deep learning in the classification decision process to extract, uses the gene expression matrix of single cells or space transcriptomes as input, and uses the high variance genes as characteristics. Performing classification tasks on target cells/spot types, calculating gradient values of gene features in the classification process as importance (significance) of the genes on classification of the target cells/spot types, and sequencing the importance; the genes with the top significance ranking are taken as characteristic gene signatures, and the characteristic gene signatures are circulated until the ranking is stable, so that the stable characteristic gene signatures can be used for analyzing prognosis of patients, analyzing cell functions and the like.

For the specific case of this embodiment. As shown in fig. 3, in a tumor microenvironment, using a malignant region as a target region of interest in the spatial transcriptome data, a signature of a characteristic gene associated with tumor malignancy can be obtained using this method. The characteristic gene signature has several genes in the gene set, which have been demonstrated in previous studies to be related to cancer invasion and metastasis, such as malt 1 and COL1A1. By scoring the expression of the patient's signature set of characteristic genes, the patient can be divided into a high risk group and a low risk group, with a significant difference in prognosis between the two risk groups (Log-rank test, p < 0.05). This demonstrates that the extracted signature of the characteristic gene associated with the type of the target region has good biological and medical value.

The foregoing is merely illustrative of specific embodiments of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modification of the present invention by using the design concept shall fall within the scope of the present invention.

Claims

1. The sequencing data characteristic gene extraction method based on interpretable deep learning is characterized by comprising the following steps of:

s4: and (3) circularly executing S1-S3 until a circulation stopping condition is reached, and obtaining a stable characteristic gene signature.

2. The method for extracting feature genes of sequencing data based on interpretable deep learning of claim 1, wherein the classification task using the interpretable deep neural network in step S1 is specifically as follows: the neural network uses the expression quantity value of the high variance gene as input to carry out classification task training of target cells/spots, and consists of an input layer, a hidden layer and a classification layer which are in linear full connection; the network receives the expression of the high variance gene as input data, utilizes the hidden layer to extract the characteristics, and connects the classification layer to classify; the classification layer comprises two neurons as classification heads, and outputs the scores g of the target categories respectively ₁ And score g for non-target class ₀ The method comprises the steps of carrying out a first treatment on the surface of the In the classification process, gradient components of first layer neurons of a neural network are used to represent target thinningThe degree of contribution of the genetic features of cell type/spot classification, i.e., the degree of importance of the genes during classification.

3. The method for extracting the characteristic genes of sequencing data based on interpretable deep learning of claim 1, wherein the step S2 is specifically:

calculation of the parameter Gradient for the first layer input neurons based on the loss function back propagation calculation during each training of the classification (l _θ (X, y)) as a gene signature, namely:

wherein X represents data of one training, y represents a label of one training, l _θ (X, y) represents a loss function trained once based on data X and label y; m represents the number of characteristic genes, and the inputted characteristic gene expression amount is expressed as: { Gene ₁ ，Gene ₂ ，...，Gene _m -a }; θ represents a parameter that the neural network needs to optimize, i.e., the weight of the neuron; θ _1，· The neuron weight representing the first layer, the input layer, is a vector of m elements, namely: θ _1，· ＝{θ _1，1 ，θ _1，2 ，...，θ _1，m And }, wherein θ _1，i Representative Gene _i The weights of the neurons corresponding to the inputs of (a);

w represents a gradient value, namely:

W＝{w ₁ ，w ₂ ，...，w _m }＝Gradient(l _θ (X，y))

w is a vector containing m elements, where W is _i Represents Gene in one training _i The component of the corresponding gradient value, which represents the contribution of the gene to the training, i.e., the importance of the gene during the training, is expressed as the total significant value of a gene throughout the training:

where k represents the total number of exercises in the whole training process, the total salience value salience (Gene _i ) Represents the Gene _i The gradient components accumulated during the overall training process, i.e. the contribution to the overall training process, i.e. its importance in the overall training process.

4. The method for extracting signature genes from sequencing data based on interpretable deep learning of claim 1, wherein in step S3, the signature genes are specifically: the gene signature represents the gene combination of specific expression of a certain cell or a certain spot in the microenvironment, can be used as the identification of the cell/spot in the microenvironment and is related to the biological function of the cell/spot, and particularly, the characteristic gene signature of the cell/spot related to the disease has the value of analyzing the prognosis of a patient.

5. The method for extracting characteristic genes of sequencing data based on interpretable deep learning of claim 1, wherein in step S4, the cycle stop condition is specifically: the top ranking genes remain unchanged during successive training rounds to reach a steady state where the top ranking genes are selected as the output gene signature.