CN114334014A

CN114334014A - Cancer subtype identification method and system based on self-attention deep learning

Info

Publication number: CN114334014A
Application number: CN202111677858.8A
Authority: CN
Inventors: 巩萍; 孙秋文; 程磊; 张志远; 孟军; 葛海涛; 陈洁; 章龙珍
Original assignee: Xuzhou Medical University
Current assignee: Xuzhou Medical University
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-12

Abstract

The invention provides a cancer subtype identification method and system based on self-attention deep learning, which comprises the following steps: firstly, preprocessing a plurality of groups of cancer data, then respectively learning the low-dimensional characteristics of each omic by utilizing a deep learning Dense network, and preliminarily integrating the characteristics of different omics in a splicing mode; and then constructing a similarity matrix between the samples by using self attention, and performing feature fusion according to the matrix weight and the splicing features to obtain final integrated feature representation. The decoder is used to minimize the error between the fused features and the primitive omics features, and the discriminator is used to perform the antagonistic learning of the integrated feature distribution. And finally, clustering the learned integrated feature distribution through a Gaussian mixture model to identify cancer subtypes. The invention can effectively integrate multiple groups of chemical data, adaptively model the relationship between samples, learn better feature representation, obtain better clustering result and realize accurate identification of cancer subtypes.

Description

Cancer subtype identification method and system based on self-attention deep learning

Technical Field

The invention relates to the technical field of biological information, in particular to a cancer subtype identification method and system based on self-attention deep learning.

Background

The diagnosis, treatment and prognosis evaluation of cancer is one of the most urgent and important research subjects in the current life science and medical fields. Research shows that cancers have high heterogeneity, and the molecular typing of the cancers has the same clinical stage or tissue morphology and is greatly different, and different molecular typing plays a crucial role in the selection of preoperative treatment schemes and prognosis of patients and is an important basis for individualized treatment, particularly endocrine treatment and targeted treatment.

Early cancer molecular typing studies mainly utilized univomic data, and this typing method was dependent on the type of data used, and the results obtained from different types of omic data did not match, resulting in low model accuracy. Cancer heterogeneity is not represented at an omic level only, but rather at genomic, transcriptomic, epigenetic and other omic levels. As a class of diseases with higher complexity caused by different factors, research based on unicomics data has been difficult to meet the requirements of scientific research. Different omics data have complementarity, and the mechanism of tumorigenesis and development can be better revealed by combining multiple groups of chemical data, so that a new research direction is provided for tumor molecular typing.

The characteristic extraction is the basis of multigroup data research, good characteristics can well reflect the nuance and deeper information of the tumor, and the discriminability, the robustness and the repeatability are realized. The biological group data is usually high-dimensional small sample data, and the result obtained by directly applying the traditional data mining method to analyze the biological group data is often poor in generalization capability. This is because a high feature space dimension and a small number of samples can cause a dimension disaster problem, that is, as the feature dimension increases, the difficulty of the constructed data model with generalization capability increases exponentially, thereby resulting in data overfitting.

In order to overcome the problem of dimensionality disaster in high-dimensional omics data analysis, the original data needs to be subjected to feature extraction so as to reduce the size of each omics data. In recent years, deep learning, as a brand-new machine learning algorithm, is gradually applied to feature extraction of multigroup mathematical data due to its good feature learning capability. Deep learning simulates the learning process of human brain through a multilayer neural network, and hopefully, the multilayer abstract mechanism of the human brain is used for reference to realize abstract expression of data, so that more useful characteristics are learned. Cancer typing studies based on deep learning are the current focus of research.

The current multigroup cancer subtype identification based on deep learning mostly integrates multigroup data at the front end and then learns characteristics through a deep learning model. These methods ignore data features between different omics and relationships between samples. In order to solve the above problems, the present invention proposes a new cancer subtype identification method based on self-attention deep learning. The method fully considers the difference of various omics characteristics and the relation of the samples on the characteristics of the universe.

Disclosure of Invention

The purpose of the invention is as follows: in view of the problems and deficiencies of the prior art as described above, the present invention is directed to a cancer subtype identification method and system based on self-attention deep learning.

The technical scheme of the invention is as follows: in order to achieve the purpose of the invention, the invention adopts the technical scheme that:

a cancer subtype identification method based on self-attention deep learning comprises the following steps:

respectively learning the low-dimensional features of each omic by utilizing a deep learning Dense network, and splicing the low-dimensional features of different omics obtained by learning to obtain spliced features;

constructing a similarity matrix between samples by using a self-attention mechanism, and performing feature fusion according to the matrix weight of the similarity matrix and the spliced features to obtain final integrated feature representation;

minimizing the error between the original features and the integrated features through a decoder, performing countermeasure learning of the integrated feature distribution through a discriminator, and training the learned integrated feature distribution to obtain the optimal integrated feature distribution;

and clustering the integrated feature distribution after training and learning by using a Gaussian mixture model to obtain the subtype of the cancer sample.

Preferably, the method further comprises the step of performing data preprocessing on the omics data of the cancer sample, and comprises the following steps:

preprocessing four different omics data of a cancer sample; wherein, the four different omics data are respectively mRNA data, miRNA data, DNA copy number variation data and DNA methylation data;

carrying out logarithmic transformation on the data of mRNA and miRNA, and reducing the absolute numerical value of the data;

removing the repeated regions of the DNA copy number variation data, and constructing characteristics according to the corresponding relation between the sample and the genome region;

for DNA methylation data, DNA methylation information was integrated and the average for each sample was calculated;

and carrying out normalization processing on the omics data.

Preferably, the learning of the low-dimensional features of the omics respectively by using the deep learning sense network comprises the following steps:

and (3) respectively extracting the characteristics of multiple groups of mathematical data by using a deep learning Dense network:

order to

Represents the input data of the kth omics,

representing the output characteristics of kth omics after the kth omics passes through the network, wherein N is a sample size, and D and D respectively represent the dimensionality of input data and the dimensionality of the output characteristics;

over the Dense network, y^kExpressed as:

y^k＝W_kx^k+b^k

wherein, W_kWeight matrix representing the network, b^kRepresents a bias;

will y^kSplicing to obtain a spliced characteristic matrix Y:

Y＝Concat(y¹,..,y⁴)

the matrix size of the spliced characteristic matrix Y is Nx 4 d; to prevent the network from overfitting, a batch normalization layer is added behind the Dense network, and a GELU function is used as a nonlinear excitation function to obtain a spliced feature matrix T':

preferably, the constructing a similarity matrix between samples by using a self-attention mechanism, and performing feature fusion according to the matrix weight of the similarity matrix and the spliced features to obtain a final integrated feature distribution includes the following steps:

regarding each spliced feature as a word in a sentence, let:

d_k＝4d

Q＝K＝V＝Y′

Q＝Y′W^Q

K＝Y′W^K

V＝Y′WV

wherein Q, K, V denotes query, key, value matrices, W^Q、W^K、W^VRepresenting linear projection parameters;

the similarity between samples i and j is then expressed as:

wherein the content of the first and second substances,

is a scaling matrix;

wherein, the jth feature vector z_jThe calculation steps are as follows:

let alpha_iIs a similarity weight vector, α, of the sample i with all other samples_iExpressed as:

assume that the fused feature vector of sample i is Z_iMultiplying each vector value of V by its weight value, and finally adding to obtain the following calculation formula:

the integrated features of all samples are expressed as:

adding a batch standardization layer after the self-attention model, and keeping the data distribution unchanged; suppose Z follows a Gaussian distribution Z-N (u, σ)²) Directly learning the mean u and variance σ of Z by using the full connection layer²The integrated feature distribution S (z) is obtained.

Preferably, the minimizing, by the decoder, an error between the original feature and the integrated feature comprises the steps of:

assume the inputs to the network are:

X＝{x¹，x²，x³，x⁴}

the output of the decoder is:

X′＝{x^1′，x^2′，x^3′，x^4′}

loss function L between X and X' based on Euclidean distance₁Expressed as:

preferably, for better fitting of S (z) to the Gaussian distribution, the counterlearning is performed using a discriminator, and the mean u and variance σ are randomly generated²The standard normal distribution of (2) P (z); inputting the generated normal distribution P (z) and the learned integrated feature distribution S (z) into a discriminator for counterlearning, wherein S (z) is close to P (z); the discriminator uses a binary cross entropy loss function, the formula is as follows:

L₂＝-E_z′～P(z)(log(D(z′)))-E_z～S(z)(log(1-D(S(z))))

the final network training loss function includes L1 and L2, as follows:

L＝λ₁L₁+λ₂L₂

wherein λ is₁And λ₂∈[0，1]Is the weight parameter of each loss function.

Preferably, the clustering the integrated feature distribution after training and learning includes the following steps:

given a fusion characteristic

K is the number of clusters, p (z)_n) The probability distribution function representing the mixture of gaussian distributions, the clustering process based on the gaussian mixture model is represented as:

wherein pi ═ pi (pi)₁，π₂，…，π_k)，μ＝(μ₁，μ₂，...，μ_k) And sigma (∑ s)₁，∑₂，…，∑_k) Respectively representing the weight, the mean value and the covariance of the clustering model;

the gaussian mixture model updates the parameters θ ═ (pi, μ., Σ) using the EM algorithm;

and after the training process is finished, obtaining the most appropriate subtype label according to the maximum probability density of each sample in different clusters.

A cancer subtype identification system based on self-attention deep learning comprising:

the deep learning Dense network is used for learning the low-dimensional features of the omics;

the characteristic splicing module is used for splicing the low-dimensional characteristics of different omics to obtain splicing characteristics;

the self-attention mechanism module is used for constructing a similarity matrix between the samples;

the feature fusion module is used for fusing the spliced features according to the matrix weight of the similarity matrix to obtain integrated feature representation;

a decoder for minimizing an error between the original feature and the integrated feature;

the discriminator is used for carrying out countermeasure learning on the integrated feature distribution to obtain the optimal integrated feature distribution;

and clustering the optimal integrated feature distribution by a clustering device to obtain the subtype of the cancer sample.

The invention has the beneficial effects that:

the invention provides a cancer subtype identification method based on self-attention deep learning by combining the characteristics of multiple groups of chemical data and the advantages of a self-attention mechanism. The main contributions of the invention are: (1) and respectively learning the characteristics of the omics according to the characteristics of the multi-omics data. (2) And (3) by utilizing a self-attention mechanism, fully considering omics data characteristics and constructing the relation weight between the samples in a self-adaptive manner. (3) The constructed model can effectively integrate multiple groups of mathematical data and can obtain a better clustering effect.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example 1

The invention discloses a cancer subtype identification method and system based on self-attention deep learning, which are shown in figure 1 and specifically comprise the following steps:

step 1, preprocessing four omics data of cancer samples respectively. For mRNA and miRNA expression data, log transformation was first performed to narrow the absolute values of the data. For DNA copy number variation data, the repetitive regions are first removed, and features are then constructed according to the correspondence between the sample and the genomic regions. For DNA methylation data, since each sample corresponds to many methylation site information, DNA methylation information is first integrated and the average for each sample is calculated. In cancer multiple groups of data, missing data occurs to different degrees, and the mean value of the sample is sampled for each group of data to fill out the missing data. And finally, carrying out normalization processing on the omics data.

Step 2, respectively extracting the characteristics of each omic by utilizing a deep learning Dense network, and preliminarily integrating the extracted characteristics of different omics in a splicing mode;

order to

Represents the input data of the kth omics,

representing the output characteristics of kth omics after the kth omics passes through the network, wherein N is a sample size, and D and D respectively represent the dimensionality of input data and the dimensionality of the output characteristics; over the Dense network, y^kExpressed as:

y^k＝W_kx^k+b^k

wherein, W_kWeight matrix representing the network, b^kIndicating the bias. Will y^kSplicing to obtain a splicing characteristic matrix Y:

Y＝Concat(y¹,..,y⁴)

the matrix size of Y is N × 4 d. To prevent network overfitting, a normalization layer is added after the Dense network and the GELU function is used as the nonlinear excitation function, i.e.:

and 3, constructing a similarity matrix between the samples by using a self-attention mechanism, and performing feature fusion according to the matrix weight and the spliced features to obtain final integrated feature representation and integrated feature distribution.

Let d_k＝4d，

Q＝K＝V＝Y′，Q＝Y′W^Q，K＝Y′W^K，y＝Y′W^V. Wherein Q, K, V represents query, key, value, respectivelyMatrix, W^Q、W^K、W^VRepresenting linear projection parameters. The self-attention based sample fusion process is as follows: the similarity between samples i and j is first calculated:

wherein

Is a scaling matrix. Let alpha_iIs a similarity weight vector, α, of the sample i with all other samples_iExpressed as:

assume that the fusion characteristic of sample i is Z_iSumming each vector value of V with α_iThe weights are multiplied respectively and added to obtain the weight value, and the calculation formula is as follows:

the integrated features of all samples are then expressed as:

adding a batch standardization layer NB after the self-attention model, and keeping the data distribution unchanged; suppose Z follows a Gaussian distribution Z-N (u, σ)²) Directly learning the mean u and variance σ of Z by using the full connection layer²The integrated feature distribution S (z) is obtained.

And 4, network training, namely minimizing the error between the original feature and the integrated feature through a decoder in order to obtain a good feature representation, and performing counterlearning through a discriminator in order to enable the learned integrated feature distribution to better fit Gaussian distribution.

Let the input of the network be X ═ X¹，x²，x³，x⁴The output of the decoder is X' ═ X }^1′，x^2′，x^3′，x^4′H.a loss function L between X and X₁Expressed as:

to better fit S (z) to the Gaussian distribution, a discriminator is used for challenge learning, with a randomly generated mean u and variance σ²The standard normal distribution of (2) P (z); inputting the integrated feature distribution S (z) and the standard normal distribution P (z) into a discriminator, and enabling S (z) to be close to P (z) through learning; the discriminator uses a binary cross entropy loss function, defined as follows:

L₂＝-E_z′～P(z)(log(D(z′)))-E_z～S(z)(log(1-D(S(z))))

the final loss function of the network training consists of two parts, L1 and L2:

L＝λ₁L₁+λ₂L₂

wherein λ is₁And λ₂∈[0，1]Is a weight parameter.

And 5, clustering the integrated feature distribution obtained by training and learning by using a Gaussian mixture model to obtain the subtype of the cancer sample.

Given the number of clusters k, for the fused feature distribution, the cluster expression based on the gaussian mixture model is:

wherein pi ═ pi (pi)₁，π₂，...，π_k)，μ＝(μ₁，μ₂，...，μ_k) And sigma (∑ s)₁，∑₂，...，∑_k) Respectively representing the weight, the mean value and the covariance of the statistical model; using EM algorithmNew parameter θ ═ (pi, μ.,. Σ); and after the training process is finished, obtaining the most appropriate subtype label according to the maximum probability density of each sample in different clusters.

The present invention is not limited to the above preferred embodiments, and any modifications, equivalent substitutions and improvements made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A cancer subtype identification method based on self-attention deep learning is characterized by comprising the following steps:

minimizing the error between the original features and the integrated features through a decoder, and performing countermeasure learning of the integrated feature distribution through a discriminator to obtain the optimal integrated feature distribution after training learning;

and clustering the optimal integrated feature distribution after training and learning to obtain the subtype of the cancer sample.

2. The method for cancer subtype identification based on deep self-attention learning according to claim 1 further comprising data preprocessing of the cancer sample's multinomial data comprising the steps of:

and carrying out normalization processing on the omics data.

3. The cancer subtype identification method based on self-attention deep learning according to claim 1, characterized in that the learning of the low-dimensional features of the omics separately using the deep learning sense network comprises the following steps:

order to

Represents the input data of the kth omics,

over the Dense network, y^kExpressed as:

y^k＝W_kx^k+b^k

wherein, W_kWeight matrix representing the network, b^kRepresents a bias;

will y^kSplicing to obtain a spliced characteristic matrix Y:

Y＝Concat(y¹，..，y⁴)

the matrix size of the spliced characteristic matrix Y is Nx 4 d; to prevent network overfitting, a batch normalization layer NB is added after the Dense network, and the stitched feature matrix Y' is obtained using the GELU function as the nonlinear excitation function:

4. the cancer subtype identification method based on self-attention deep learning according to claim 3, characterized in that the method for constructing the similarity matrix between the samples by using the self-attention mechanism and performing feature fusion according to the matrix weight of the similarity matrix and the spliced features to obtain the final integrated feature representation comprises the following steps:

regarding each spliced feature as a word in a sentence, let:

d_k＝4d

Q＝K＝V＝Y′

Q＝Y′W^Q

K＝Y′W^K

V＝Y′W^V

the similarity between samples i and j is then expressed as:

wherein the content of the first and second substances,

is a scaling matrix;

wherein, the jth feature vector z_jThe calculation steps are as follows:

the integrated features of all samples are expressed as:

5. The method for cancer subtype identification based on self-attention deep learning according to claim 1 characterized in that said minimizing the error between the original features and the integrated features by a decoder comprises the steps of:

assume the inputs to the network are:

X＝{x¹，x²，x³，x⁴}

the output of the decoder is:

X′＝{x^1′，x^2′，x^3′，x^4′}

loss function L between X and X' based on Euclidean distance₁Expressed as:

6. the cancer subtype identification method based on self-attention deep learning according to claim 5, characterized in that the antagonistic learning of the integrated feature distribution by the discriminator comprises the following steps:

randomly generated mean u and variance σ²S (z);

inputting the generated normal distribution S (z) and the learned integrated feature distribution P (z) into a discriminator for counterlearning, and defining a loss function of the discriminator by using binary cross entropy, wherein the formula is as follows:

L₂＝-E_z′～P(z)(log(D(z′)))-E_z～S(z)(log(1-D(S(z))))

the final network training loss function includes L1 and L2, as follows:

L＝λ₁L₁+λ₂L₂

7. The method for cancer subtype identification based on self-attention deep learning according to claim 4, characterized in that said clustering of ensemble feature representations after training learning comprises the following steps:

given integrated features

wherein pi ═ pi (pi)₁，π₂，...，π_k)，μ＝(μ₁，μ₂，...，μ_k) And sigma (∑ s)₁，∑₂，...，∑_k) Respectively representing the weight, the mean value and the covariance of the clustering model;

8. A cancer subtype recognition system based on self-attention deep learning, comprising: