CN117877585A - Sequencing data characteristic gene extraction method based on interpretable deep learning - Google Patents

Sequencing data characteristic gene extraction method based on interpretable deep learning Download PDF

Info

Publication number
CN117877585A
CN117877585A CN202410064040.6A CN202410064040A CN117877585A CN 117877585 A CN117877585 A CN 117877585A CN 202410064040 A CN202410064040 A CN 202410064040A CN 117877585 A CN117877585 A CN 117877585A
Authority
CN
China
Prior art keywords
gene
classification
genes
training
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410064040.6A
Other languages
Chinese (zh)
Inventor
杨朝勇
钟智星
丘宇童
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Original Assignee
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University filed Critical Xiamen University
Priority to CN202410064040.6A priority Critical patent/CN117877585A/en
Publication of CN117877585A publication Critical patent/CN117877585A/en
Pending legal-status Critical Current

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The method uses the interpretable deep learning to extract the characteristic genes which play an important role in the classification decision process, uses the genes as input and performs classification tasks on the types of target cells/spots (measurement units in a space transcriptome). And the gradient values of the gene features are calculated as the importance (significance) of the genes to the classification of the target cell/spot type in the classification process, and ranked accordingly. The top significant gene is taken as the characteristic gene signature and the cycle is continued until the ranking is stable. The stable signature of the signature can be used to analyze patient prognosis, analyze cell function, etc. The method is suitable for transcriptome data such as space transcriptome and single cell transcriptome based on next generation sequencing, and has wide applicability.

Description

Sequencing data characteristic gene extraction method based on interpretable deep learning
Technical Field
The invention relates to the field of transcriptome sequencing data analysis, in particular to a sequencing data characteristic gene extraction method based on interpretable deep learning.
Background
Single cell transcriptome sequencing technology can measure gene expression level at single cell accuracy, which can enable researchers to study dynamic changes of cells in disease process at single cell accuracy level. The space transcriptome technology can measure the gene expression level of a plurality of cells (called spots) in the tissue while preserving space information, and can allow researchers to explore the relationship and effect of the cells in the space category. Currently, as related research proceeds, the amount of single cell transcriptome and spatial transcriptome data that is publicly available for research is increasing to a very large scale. Deep learning, which is an excellent analysis method for dealing with a large amount of data, is also widely used in analysis of single cell transcriptome and spatial transcriptome data, and a typical application is to classify single cells or spots by deep learning. However, the model interior of the single-cell and space transcriptome data analysis method based on deep learning is generally regarded as a black box, researchers only can see results, but do not know the process, the explanation and understanding of classification decisions are lacked, and characteristic genes which play an important role in decision in classification targets (cells or spots) cannot be extracted. This limits the application of deep learning methods to these data in biology and medicine. Therefore, a method is needed to explain the classification process and decision basis of deep learning to expand the application of the method in the fields of biology and medicine.
Disclosure of Invention
The invention aims to solve the problems in the prior art and provides a sequencing data characteristic gene extraction method based on interpretable deep learning, which can extract characteristic genes of target cells or spot types based on the interpretable deep learning technology in the classification process.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a sequencing data characteristic gene extraction method based on interpretable deep learning comprises the following steps:
s1: based on the high variance gene as input, for the target cell/spot type, using an interpretable deep neural network to perform classification tasks;
s2: calculating the significance of calculating the feature gradient as the gene feature according to the back propagation of the loss function in the classification process;
s3: sequencing the salience of the gene features, and taking the genes with the top salience rank as feature gene signatures;
s4: circularly executing S1-S3 until reaching a cycle stopping condition to obtain a stable characteristic gene signature;
the gene signature is specifically a set of genes consisting of tens of genes. The gene signature represents the gene combination of a certain cell or a certain spot specifically expressed in the microenvironment, can be used as the identification of the cell/spot in the microenvironment, and is related to the biological function of the cell/spot. The signature of the characteristic genes of the cells/spots related to the disease can be used as a prognostic index of the disease.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention provides a sequencing data characteristic gene extraction method based on interpretable deep learning, which solves the problem that the current deep learning lacks decision basis and interpretation in the classification application of single-cell and space transcriptome data. The method uses the characteristic genes which can explain the important role of deep learning in the classification decision process to extract, uses the gene expression matrix of single cells or space transcriptomes as input, and uses the high variance genes as characteristics. Performing classification tasks on target cells/spot types, calculating gradient values of gene features in the classification process as importance (significance) of the genes on classification of the target cells/spot types, and sequencing the importance; the genes with the top significance ranking are taken as characteristic gene signatures, and the characteristic gene signatures are circulated until the ranking is stable, so that the stable characteristic gene signatures can be used for analyzing prognosis of patients, analyzing cell functions and the like. The method is suitable for transcriptome data such as space transcriptome and single cell transcriptome based on next generation sequencing, and has wide applicability.
Drawings
FIG. 1 is a frame of a method for extracting characteristic genes of sequencing data based on interpretable deep learning according to an embodiment of the present invention;
FIG. 2 is a training process illustration of a sequencing data feature gene extraction method based on interpretable deep learning according to an embodiment of the present invention;
fig. 3 is an example of an output result of a sequencing data feature gene extraction method based on interpretable deep learning applied to cancer prognosis according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical schemes and beneficial effects to be solved more clear and obvious, the invention is further described in detail below with reference to the accompanying drawings and embodiments.
The invention is designed and developed for extracting characteristic genes of specific cell/spot types in single cell and space transcriptome sequencing data. The characteristic genes refer to the gene combination of the cell/spot specifically expressed in the biological microenvironment, can be used as the identification of the cell/spot in the microenvironment, and are closely related to the functions and actions of the cell/spot. Interpretable deep learning is a research area of deep learning, whose problem is how to describe the decision basis of a model in the training and prediction process of deep learning. One popular interpretable deep learning method is a saliency map-based method, which aims to obtain the importance of features in a model decision process by calculating the feature saliency in a training process.
According to the existing saliency map method, the sequencing data characteristic gene extraction method based on interpretable deep learning disclosed by the invention specifically comprises the following steps of:
s1: based on the high variance gene as input, for target cell/spot type, the interpretable deep neural network is used for classification tasks (see fig. 1, loss function values, validation set classification accuracy and gene feature significance in fig. 2, a), B) and C)).
Step S1 further comprises the steps of:
s101: single cell transcriptome or spatial transcriptome sequencing data is obtained according to public databases or experiments. In the form of a single cell expression matrix or a spatial spot (measurement unit of the spatial transcriptome, size of about 5-10 cells) expression matrix. The row genes of the matrix, the columns of the matrix are the numbers of the sample measurement units (cells in single cell transcriptomes and spots in space transcriptomes), and the values of the matrix are the number counts of row-corresponding genes expressed by column-corresponding cells/spots. The matrix is normalized. The normalized expression matrix is retained as training data.
S102: the cells or spot categories in the matrix are obtained as training tags by conventional methods. For example, cell types in a single cell transcriptome can be annotated by cell-specific marker genes in clusters obtained by unsupervised cluster-bound clustering; various regions on the spatial transcriptome can be identified by pathologist manual work. The cell type or tissue region of interest is marked with a "1" class, and the remaining cells/spots are marked with a "0" class as training tags.
S103: the data and labels of the training set and the validation set are divided according to a ratio of 4:1. Variance of genes was calculated and ranked on the training set. The 5000 high variance genes with highest variance ranking are selected as features, and the data and the labels are input into a neural network for classification training.
In the step S101, the normalization method is as follows:
where n represents the number of cells contained in a batch (single cell transcriptome) or the number of spots (spatial transcriptome). Norm (Norm) i Represents the normalized expression level of the ith cell, count i Representing the expression count measured directly from the ith cell.
In the step S103, the model of the neural network is denoted as f θ (. Cndot.) the neural network uses the expression level value of the high variance gene as input to perform the classification task training of the target cells/spots. Is composed of an input layer,The hidden layer and the classified layer. The network receives the expression of the high variance gene as the characteristic, uses the hidden layer to extract the characteristic, and connects the classifying layer to classify. The classification layer comprises two neurons as classification heads, and outputs the scores g of the target categories respectively 1 And score g for non-target class 0 . Namely:
wherein X represents training data.
Neural networks are trained using cross entropy loss, whose loss function is expressed as:
wherein l θ (. Cndot.) represents a loss function, X represents training data, y represents training labels, and e represents a natural constant.
S2: the significance of computing the feature gradient as the gene feature according to the back propagation of the loss function is calculated in the classification process (see figure 1, the loss function value, the verification set classification accuracy and the gene feature significance in the implementation process are shown in figure 2). In each training, the back propagation of the loss function obtains a Gradient of the loss function denoted Gradient (), namely:
where m represents the number of characteristic genes, in this example, m=5000. The inputted characteristic gene expression amount can be expressed as: { Gene 1 ,Gene 2 ,...,Gene m }. θ represents the parameter that the neural network needs to optimize, i.e., the weights of the neurons. θ 1 ,. the neuron weight of the first, input layer, is a vector of m elements, namely: θ 1 ,.={θ 1,1 ,θ 1,2 ,...,θ 1,m }. Wherein θ is 1,i Represents Gene i The weights of the neurons corresponding to the inputs of (a). Here, the gradient value is denoted by W, that is:
W={w 1 ,w 2 ,...,w m }=Gradient(l θ (X,y))
w is a vector containing m elements, where W is i Represents Gene in one training i The components of the corresponding gradient values. It represents the contribution of the gene to this training, i.e., the importance of the gene during this training. Throughout the training process, the total saliency value of a gene can be expressed as:
where k represents the total number of exercises in the whole training process. Total salience (Gene i ) Represents the Gene i The gradient components accumulated during the overall training process, i.e. the contribution to the overall training process, i.e. the importance thereof during the overall training process.
S3: the significance of the gene features is ordered, and the genes with the top significance rank are taken as the feature gene signatures (see figure 1, and the loss function value, the verification set classification accuracy and the gene feature significance in the implementation process are shown in figure 2). Each Gene i Has a significance value of salience (Gene i ) Indicating how important it is in the overall training process. These genes are ranked according to their significance values to obtain signature of the characteristic genes. The characteristic gene signature is a group of genes consisting of tens of genes, and is a group consisting of 10 to 20 genes. The gene signature represents the gene combination of a certain cell or a certain spot specifically expressed in the microenvironment, can be used as the identification of the cell/spot in the microenvironment, and is related to the biological function of the cell/spot. The signature of characteristic genes of disease-related cells/spots has value in analyzing patient prognosis. In this example, the gene set of the first ten of accumulated saliency was used as the signature of the characteristic gene.
S4: and (3) circularly executing S1-S3 until a circulation stopping condition is reached, and obtaining a stable characteristic gene signature (see figure 1, and the loss function value, the verification set classification accuracy and the gene characteristic significance in the specific implementation process are shown in figure 2). The cycle stop condition is that the top ranking gene ranking remains unchanged for several consecutive training rounds to reach a steady state. In this example, the stop condition is that the top ten genes were ranked to have reached steady state when they remained unchanged in the continuous 20 rounds of training. In this steady state, the top-ranked genes are selected as the output gene signature, which represents the basis for the model to judge the class, and is the characteristic functional gene of the class of cells/spots.
The invention provides a sequencing data characteristic gene extraction method based on interpretable deep learning, which solves the problem that the current deep learning lacks decision basis and interpretation in the classification application of single-cell and space transcriptome data. The method uses the characteristic genes which can explain the important role of deep learning in the classification decision process to extract, uses the gene expression matrix of single cells or space transcriptomes as input, and uses the high variance genes as characteristics. Performing classification tasks on target cells/spot types, calculating gradient values of gene features in the classification process as importance (significance) of the genes on classification of the target cells/spot types, and sequencing the importance; the genes with the top significance ranking are taken as characteristic gene signatures, and the characteristic gene signatures are circulated until the ranking is stable, so that the stable characteristic gene signatures can be used for analyzing prognosis of patients, analyzing cell functions and the like.
For the specific case of this embodiment. As shown in fig. 3, in a tumor microenvironment, using a malignant region as a target region of interest in the spatial transcriptome data, a signature of a characteristic gene associated with tumor malignancy can be obtained using this method. The characteristic gene signature has several genes in the gene set, which have been demonstrated in previous studies to be related to cancer invasion and metastasis, such as malt 1 and COL1A1. By scoring the expression of the patient's signature set of characteristic genes, the patient can be divided into a high risk group and a low risk group, with a significant difference in prognosis between the two risk groups (Log-rank test, p < 0.05). This demonstrates that the extracted signature of the characteristic gene associated with the type of the target region has good biological and medical value.
The foregoing is merely illustrative of specific embodiments of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modification of the present invention by using the design concept shall fall within the scope of the present invention.

Claims (5)

1. The sequencing data characteristic gene extraction method based on interpretable deep learning is characterized by comprising the following steps of:
s1: based on the high variance gene as input, for the target cell/spot type, using an interpretable deep neural network to perform classification tasks;
s2: calculating the significance of calculating the feature gradient as the gene feature according to the back propagation of the loss function in the classification process;
s3: sequencing the salience of the gene features, and taking the genes with the top salience rank as feature gene signatures;
s4: and (3) circularly executing S1-S3 until a circulation stopping condition is reached, and obtaining a stable characteristic gene signature.
2. The method for extracting feature genes of sequencing data based on interpretable deep learning of claim 1, wherein the classification task using the interpretable deep neural network in step S1 is specifically as follows: the neural network uses the expression quantity value of the high variance gene as input to carry out classification task training of target cells/spots, and consists of an input layer, a hidden layer and a classification layer which are in linear full connection; the network receives the expression of the high variance gene as input data, utilizes the hidden layer to extract the characteristics, and connects the classification layer to classify; the classification layer comprises two neurons as classification heads, and outputs the scores g of the target categories respectively 1 And score g for non-target class 0 The method comprises the steps of carrying out a first treatment on the surface of the In the classification process, gradient components of first layer neurons of a neural network are used to represent target thinningThe degree of contribution of the genetic features of cell type/spot classification, i.e., the degree of importance of the genes during classification.
3. The method for extracting the characteristic genes of sequencing data based on interpretable deep learning of claim 1, wherein the step S2 is specifically:
calculation of the parameter Gradient for the first layer input neurons based on the loss function back propagation calculation during each training of the classification (l θ (X, y)) as a gene signature, namely:
wherein X represents data of one training, y represents a label of one training, l θ (X, y) represents a loss function trained once based on data X and label y; m represents the number of characteristic genes, and the inputted characteristic gene expression amount is expressed as: { Gene 1 ,Gene 2 ,...,Gene m -a }; θ represents a parameter that the neural network needs to optimize, i.e., the weight of the neuron; θ 1,· The neuron weight representing the first layer, the input layer, is a vector of m elements, namely: θ 1,· ={θ 1,1 ,θ 1,2 ,...,θ 1,m And }, wherein θ 1,i Representative Gene i The weights of the neurons corresponding to the inputs of (a);
w represents a gradient value, namely:
W={w 1 ,w 2 ,...,w m }=Gradient(l θ (X,y))
w is a vector containing m elements, where W is i Represents Gene in one training i The component of the corresponding gradient value, which represents the contribution of the gene to the training, i.e., the importance of the gene during the training, is expressed as the total significant value of a gene throughout the training:
where k represents the total number of exercises in the whole training process, the total salience value salience (Gene i ) Represents the Gene i The gradient components accumulated during the overall training process, i.e. the contribution to the overall training process, i.e. its importance in the overall training process.
4. The method for extracting signature genes from sequencing data based on interpretable deep learning of claim 1, wherein in step S3, the signature genes are specifically: the gene signature represents the gene combination of specific expression of a certain cell or a certain spot in the microenvironment, can be used as the identification of the cell/spot in the microenvironment and is related to the biological function of the cell/spot, and particularly, the characteristic gene signature of the cell/spot related to the disease has the value of analyzing the prognosis of a patient.
5. The method for extracting characteristic genes of sequencing data based on interpretable deep learning of claim 1, wherein in step S4, the cycle stop condition is specifically: the top ranking genes remain unchanged during successive training rounds to reach a steady state where the top ranking genes are selected as the output gene signature.
CN202410064040.6A 2024-01-17 2024-01-17 Sequencing data characteristic gene extraction method based on interpretable deep learning Pending CN117877585A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410064040.6A CN117877585A (en) 2024-01-17 2024-01-17 Sequencing data characteristic gene extraction method based on interpretable deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410064040.6A CN117877585A (en) 2024-01-17 2024-01-17 Sequencing data characteristic gene extraction method based on interpretable deep learning

Publications (1)

Publication Number Publication Date
CN117877585A true CN117877585A (en) 2024-04-12

Family

ID=90591716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410064040.6A Pending CN117877585A (en) 2024-01-17 2024-01-17 Sequencing data characteristic gene extraction method based on interpretable deep learning

Country Status (1)

Country Link
CN (1) CN117877585A (en)

Similar Documents

Publication Publication Date Title
Liu et al. Selecting informative genes with parallel genetic algorithms in tissue classification
CN111913999B (en) Statistical analysis method, system and storage medium based on multiple groups of study and clinical data
Mukhopadhyay et al. Towards improving fuzzy clustering using support vector machine: Application to gene expression data
Suo et al. Application of clustering analysis in brain gene data based on deep learning
Dash et al. Performance analysis of clustering techniques over microarray data: A case study
US7587280B2 (en) Genomic data mining using clustering logic and filtering criteria
CN104966106A (en) Biological age step-by-step predication method based on support vector machine
CN112382347B (en) Synergistic anti-cancer drug combination identification method based on molecular fingerprint and multi-target protein
Liu et al. Ensemble component selection for improving ICA based microarray data prediction models
Park et al. Evolutionary fuzzy clustering algorithm with knowledge-based evaluation and applications for gene expression profiling
Saha et al. Aggregation of multi-objective fuzzy symmetry-based clustering techniques for improving gene and cancer classification
TW202121223A (en) Methods for training an artificial neural network to predict whether a subject will exhibit a characteristic gene expression and systems for executing the same
CN117877585A (en) Sequencing data characteristic gene extraction method based on interpretable deep learning
Thakur et al. RNN-CNN based cancer prediction model for gene expression
CN115662504A (en) Multi-angle fusion-based biological omics data analysis method
Saha et al. Simultaneous clustering and feature weighting using multiobjective optimization for identifying functionally similar mirnas
CN115206437A (en) Intelligent screening system for mitochondrial effect molecules and construction method and application thereof
Mythili et al. CTCHABC-hybrid online sequential fuzzy Extreme Kernel learning method for detection of Breast Cancer with hierarchical Artificial Bee
CN108108589A (en) The recognition methods of esophageal squamous cell carcinoma label based on network index variance analysis
Gong et al. Interpretable single-cell transcription factor prediction based on deep learning with attention mechanism
Muhammad et al. Gvdeepnet: Unsupervised deep learning techniques for effective genetic variant classification
Cai et al. Application and research progress of machine learning in Bioinformatics
Ghai et al. Proximity measurement technique for gene expression data
Singh et al. GeneAI 3.0: powerful, novel, generalized hybrid and ensemble deep learning frameworks for miRNA species classification of stationary patterns from nucleotides
Al-Janabee et al. Evaluation Algorithms Based on Fuzzy C-means for the Data Clustering of Cancer Gene Expression

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination