CN114530222A

CN114530222A - Cancer patient classification system based on multiomics and image data fusion

Info

Publication number: CN114530222A
Application number: CN202210034741.6A
Authority: CN
Inventors: 董守斌; 黄薇娴; 谭凯文; 胡金龙; 张子烨
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-01-13
Filing date: 2022-01-13
Publication date: 2022-05-24

Abstract

The invention discloses a cancer patient classification system based on multi-omics and image data fusion, which can complete the loading and preprocessing of end-to-end multi-group chemical data and image data, introduces extra characteristic information by utilizing an external knowledge database to carry out characteristic dimension reduction and information aggregation on specific omics data, supplements extra sample information by calculating the similarity between cancer patients, namely samples, finally realizes the fusion of the classification results of the multi-group chemical data and the image data by a multi-modal cross fusion method, and completes the output of the final classification result, and comprises the following functional modules: the system comprises a data loading and preprocessing module, a multi-group science processing module, a picture processing module and a fusion module. The invention can effectively fuse multiomics and image data and can be used for accurately classifying various cancer patients.

Description

Cancer patient classification system based on multiomics and image data fusion

Technical Field

The invention relates to the technical field of cancer patient classification, in particular to a cancer patient classification system based on multiomics and image data fusion.

Background

Cancer is a disease with complex underlying molecular mechanisms and factors that require a large amount of data to more accurately describe, diagnose, and treat a patient. Omics are the main data used by researchers to research the mechanism of cancer, and in recent years, due to the progress of gene sequencing technology, the sequencing time is greatly shortened, the sequencing cost is greatly reduced, and the manpower consumption is reduced, so that the rapid development of various omics including genomics, proteomics and the like is promoted. Meanwhile, due to the development of modern computers and medical images, the images are taken as an effective means for researching cancers, and more pathological pictures are called 'gold standard' for diagnosis. The multiomics and the images provide disease information of patients from different layers, wherein the genomics, the transcriptomics and the proteomics respectively provide molecular-level analysis for cancer patients from gene, transcription and protein expression layers, and the image data visually represent the current physical conditions of the patients. More and more researches are devoted to the fusion of omics data and image data so as to more comprehensively diagnose and treat cancer patients, but the fusion of omics data and image data faces various challenges such as dimensional disaster, data heterogeneity, data imbalance and the like;

for many years, many methods have been proposed for fusion of omics data and image data for various problems. However, most of the existing work has focused on unsupervised fusion of multianatomical and image data, or simply obtaining additional information from features or samples. With the development of public and personalized medicine, more and more organizations and institutions provide literature information and data sets related to cancer, and attract people to research supervised multigroup chemical data and image data fusion methods, which can identify biomarkers related to diseases and predict new samples. Early attempts at this type of approach included feature stitching based approaches and integration based approaches. On one hand, the method based on splicing integrates multiple groups of chemical data and image data by directly splicing the characteristics of input data to finish the learning of classification models. On the other hand, the integration-based approach integrates predictions from different classifiers, each trained on a respective type of input data. However, these methods do not take into account the correlation between different input data types, which may favor certain input data types;

in summary, in consideration of obtaining additional information from the features and the samples at the same time, a new multi-modal data fusion method is used to realize information interaction between different input data and complete fusion of multiple sets of mathematical data and image data.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art, provides a cancer patient classification system based on the fusion of multiomics and image data, can realize the information fusion between multigroup chemical data and image data, and accurately classifies cancer patients by utilizing the fused multigroup chemical data and image data.

In order to achieve the purpose, the technical scheme provided by the invention is as follows: a cancer patient classification system based on multi-omics and image data fusion can complete loading and preprocessing of end-to-end multi-group chemical data and image data, introduces additional feature information by using an external knowledge database to perform feature dimension reduction and information aggregation on specific omics data, supplements additional sample information by calculating similarity between cancer patients, namely samples, and finally realizes fusion of classification results of the multi-group chemical data and the image data by a multi-modal cross fusion method, and specifically comprises the following functional modules:

the data loading and preprocessing module is used for importing a plurality of groups of mathematical data and image data and preprocessing the imported data;

the system comprises a multiomic processing module and a supervised Graph convolution module, wherein the integrative gene Network module utilizes interaction information between genes provided by an external database HINT to construct an adjacency matrix between the genes, utilizes a Graph convolution neural Network (GCN) and the adjacency matrix between the genes to carry out gene characterization, the supervised Graph convolution module utilizes cosine similarity to construct an adjacency matrix between samples, utilizes the Graph convolution neural Network (GCN) and the adjacency matrix between the samples to carry out sample characterization, and obtains a preliminary prediction classification result taking omic data as input;

the image processing module is used for representing and learning the image data by utilizing a convolutional neural network to obtain a preliminary prediction classification result taking the image data as input;

and the fusion module comprises a primary cross fusion module and a network fusion module, wherein the primary cross fusion module constructs a multi-mode data cross fusion vector, and inputs a reconstruction vector of the vector into the network fusion module, so that the fusion of the classification results of a plurality of omics and image data is realized, and the final classification result is obtained.

Further, the data loading and preprocessing module comprises a data loading module and a data preprocessing module; the data loading module is used for loading a plurality of groups of chemical data and image data, wherein the plurality of groups of chemical data comprise genomics data, transcriptomics data and epigenomics data, each row of the plurality of groups of chemical data represents the expression value of each sample on the corresponding characteristic, each column represents the expression value of one sample in the corresponding characteristic, and the image data is cancer patient pathological diagram data; the preprocessing of the data preprocessing module comprises sample alignment, feature alignment and deletion of features of multigroup mathematical data with null proportion exceeding a% of all samples and low null proportionFilling b% of all samples with software IMPUTE2, removing features with variance lower than a threshold value, performing feature alignment on specific omics data by using software TCGA-Assembler, analyzing a pathological diagram by using software HistomicTK, and cutting the pathological diagram by using an Openslide tool, wherein each sample obtains z regions of interest (ROI), z is greater than or equal to 1, and the pixel size of each region of interest (ROI) is r₁×r₂Wherein r is₁And r₂Respectively corresponding to the length and width pixel values of each region of interest (ROI), and finally dividing omics and image data according to a specified proportion to obtain a training set and a test set; the data output after passing through the data preprocessing module is composed of a plurality of samples, and each sample comprises a plurality of omics data and a plurality of pathological pictures.

Further, the omics processing module comprises an integrated gene network module and a supervised graph convolution module;

the integrated gene Network module utilizes interaction information between introduced genes of an external database HINT to realize information aggregation and feature screening of a plurality of omics data feature levels through a Graph convolution neural Network (GCN), and comprises the following steps:

a1) construction of an adjacency matrix A between genes Using Chiense-binary-physical-interaction datasets provided by the external database HINT^(g)∈R^(p×p)R is a real number set, and p is a characteristic number;

a2) using the adjacency matrix A obtained in step a1)^(g)Constructing a Graph convolution neural Network (GCN) to obtain neighbor information of a feature space:

wherein, omics U is {1, 2., U }, and U is a plurality of groups of mathematical numbers,

respectively pretreated groupsLearning the training set and the test set of u, respectively, in the training phase and the test phase into the formula of step a2),

for the implicit layer characterization of omics u, σ (·) is an activation function ReLU (·) ═ max (0, ·), max (0, ·) indicates a larger number of 0 and · which is an hadamard product,

for parameters needing to be learned in the training process of a Graph Convolutional neural Network (GCN) in omics u, the integrated gene Network module learns the parameters of the Graph Convolutional neural Network (GCN) only in the training stage;

the supervised image convolution module constructs a sample adjacency matrix A according to cosine similarity between samples^(s)The method for obtaining the preliminary prediction classification result of each omics through the Graph convolution neural Network (GCN) comprises the following steps:

b1) construction of adjacency matrix A according to similarity between samples^(s)；

b1.1) in the training stage, calculating the cosine similarity between the samples in the training set to obtain the adjacency matrix of the training samples

Wherein the content of the first and second substances,

an adjacency matrix representing samples i and j,

denotes the cosine similarity between sample i and sample j, x_iAnd x_iExpression of sample i and sample j in omics, respectivelyValue, | | · | luminance₂Representing a 2-norm operation on a,

is a contiguous matrix

I denotes the identity matrix, ∈ being determined by a given parameter k, k denotes the average number of edges retained by each node, including self-join, whose formula is as follows:

wherein I (·) is an indicator function, when sim (x)_i,x_j) When the k is equal to 1, each node is only connected, and a Graph Convolutional neural Network (GCN) at the moment is equal to a full connection layer;

b1.2) in the testing stage, calculating cosine similarity between the training sample and the testing sample and between the testing sample and the testing sample, and replacing the training integrated test set according to the formula in the step b1.1) to obtain an adjacency matrix of the testing sample

b2) Graph convolution neural Network (GCN, Graph Convolutional Network) built with supervised Graph convolution module:

b2.1) in the training stage, the construction formula of the Graph convolution neural Network (GCN) with the supervision Graph convolution module is as follows:

wherein the content of the first and second substances,

for the characterization of omics u after integration of the gene network module,

for the adjacency matrix between the samples obtained in step b1.1),

and

for the parameters needed to be learned in the training process of Graph convolution neural Network (GCN) with supervision Graph convolution module in omics u,

and

is an implicit characterization of supervised graph convolution modularity omics u;

b2.2) in a testing stage, inputting the adjacency matrix of the testing sample obtained in the step b1.2) and the testing set passing through the integrated gene Network module into a constructed Graph Convolutional neural Network (GCN) for sample information aggregation to obtain omics representation of the testing set;

b3) obtaining a preliminary prediction classification result of each omics data:

wherein the content of the first and second substances,

predictive labels representing training or test sets after a supervised graph convolution module，n_tr、n_teThe number of samples in the training set and the test set, respectively, c the number of classes in the classification task,

to build parameters to be learned in the softmax classifier process, the formula of the softmax classifier is as follows:

wherein, the classification task includes a classification t ═ 1, 2., c } and a classification m ═ 1, 2., c }, h ═ h ·₁,h₂,...,h_c]^TFor the vector input to the softmax classifier, h_tAnd h_mRepresenting the t and m elements in the input vector h;

loss function constructed with supervised graph convolution module

The following were used:

wherein L is_CE(. cndot.) is a cross-entropy loss function,

a one-hot coded prediction tag representing omic i sample j,

to represent

M-th element of (1), y_jAre authentic tags in the data set.

Further, the picture processing module extracts depth features of the pathological picture by using a Convolutional Neural Network (CNN)Neural Network) is composed of l convolutional layers, pooling layers and full-link layers, l is greater than or equal to 1, wherein the core size of the convolutional layers is s₁×s₂Each convolution layer has q feature maps, and the pooling layer size is s₃×s₄And the last layer adopts a full connection layer, and outputs the preliminary classification result of the image data of the sample, and the method comprises the following steps:

c1) in the training stage, the pre-processed size is r₁×r₂The pathological picture His in the training set_trInputting into Convolutional Neural Network (CNN), and extracting pathological picture His from convolutional layer by convolutional layer_trThe data is reduced through the pooling layer, and the result is output through the full-connection layer

Adjusting network structure parameters through back propagation, obtaining optimal network parameters through continuous training, and adopting a dropout mechanism in the training process to avoid overfitting;

c2) in the testing stage, the pre-processed size is r₁×r₂Test concentrated pathology picture His_teInputting the data into a trained Convolutional Neural Network (CNN), and outputting the preliminary prediction classification result of the image processing module

Further, the fusion module comprises a preliminary cross fusion module and a network fusion module, the preliminary cross fusion module firstly constructs multi-modal data cross fusion vectors, then reconstructs the multi-modal data cross fusion vectors to obtain reconstructed vectors, and finally the network fusion module outputs the classification results after fusion;

the preliminary cross fusion module specifically performs the following operations:

d1) constructing a multi-modal data cross fusion vector:

wherein the content of the first and second substances,

is a multi-mode data cross fusion vector of a training set or a test set, R is a real number set, n_tr、n_teRespectively the number of samples in the training set and the test set, c the number of categories in the classification task, U the number of omics,

in order to take omics u as input and pass through a supervised graph convolution module to obtain a preliminary prediction classification result of a training set or a test set,

the method comprises the steps of taking a pathological diagram as an input and conducting preliminary prediction classification result of a training set or a test set after passing through an image processing module;

d2) reconstructing the multi-modal data cross fusion vector obtained in the step d1) to obtain a reconstructed vector of a training set or a test set

The network convergence module is composed of a full connection layer and comprises the following steps:

e1) using the reconstructed vector obtained in step d2)

Inputting into a network fusion module, and outputting a final classification result:

wherein the content of the first and second substances,

the network parameters to be trained in the training stage are input into the training set and the test setConstructing vectors to obtain final classification results of the training set and the test set respectively

And

the formula of the softmax classifier is as follows:

e2) and (3) calculating a loss function L of the network fusion module by back propagation:

wherein L is_CE(. is a cross entropy loss function, y_jIs the true label of the sample j and,

in order to train the final prediction result for sample j,

represent

The m-th element of (1); in the training stage, a loss function after passing through the network fusion module needs to be calculated, and then parameters of the network fusion module are trained through back propagation; but this step need not be passed through in the testing phase, the final classification result is output in step e 1).

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. introducing additional interaction information among genes by means of an external knowledge database, realizing information aggregation among the genes by utilizing a Graph Convolutional neural Network (GCN), and fully mining implicit characteristics of the multiomic data.

2. By calculating the similarity between the training sample and the test sample and between the test sample and the test sample, the supervised and unsupervised information of the samples is fully utilized, and then the information aggregation of the sample layer is completed through a Graph Convolutional neural Network (GCN), which is beneficial to improving the prediction precision of cancer patient classification.

3. The fusion of the information of the cancer patients in different layers is realized by a multi-mode cross fusion method, and the prediction classification of the cancer patients is further completed by the fusion of a network method.

4. Information aggregation is carried out from a characteristic level and a sample level, the related information among the multiomic data, the image data and different data sets is fully mined, and the prediction precision of cancer patient classification is improved.

Drawings

FIG. 1 is an architectural diagram of the system of the present invention.

FIG. 2 is a schematic diagram of the function of integrating gene network modules.

Fig. 3 is a functional diagram of the supervised graph convolution module.

Fig. 4 is a schematic structural diagram of the fusion module.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.

The embodiment discloses a cancer patient classification system based on fusion of multiomics and image data, and aims to improve the classification prediction precision of cancer patients by fusing the fusion of multiple groups of cancer patient data and image data. The system can complete the loading and preprocessing of end-to-end multigroup mathematical data and image data, introduces extra characteristic information by utilizing an external knowledge database to perform characteristic dimension reduction and information aggregation on specific omics data, supplements extra sample information by calculating the similarity between cancer patients, namely samples, and finally realizes the fusion of the classification results of the multigroup mathematical data and the image data by a multimodality cross fusion method, as shown in fig. 1, the system specifically comprises the following functional modules:

and the data loading and preprocessing module comprises a data loading module and a data preprocessing module.

The data loading module is used for loading a plurality of groups of chemical data and image data, wherein the plurality of groups of chemical data comprise genomics data, transcriptomics data and epigenomics data, each row of the plurality of groups of chemical data represents the expression value of each sample on the corresponding characteristic, each column represents the expression value of one sample in the corresponding characteristic, and the image data is cancer patient pathological diagram data; in this example, the effect of the inventive system was evaluated using the breast cancer item (BRCA) data in the public cancer data set TCGA, and mRNA expression data, DNA methylation data, miRNA expression data, and pathology picture data were loaded from a storage device into 578 breast cancer patients.

The preprocessing of the data preprocessing module comprises sample alignment, feature alignment, deleting features of which the null proportion of a plurality of groups of mathematical data exceeds a% of all samples, filling values of which the null proportion is lower than b% of all samples by IMPUTE2, removing features of which the variance is lower than a threshold value, performing feature alignment on specific omic data by TCGA-Assembler, analyzing a pathological diagram by HistomicsTK, cutting the pathological diagram by an OpenSlide tool, obtaining a z-block region of interest (ROI) of each sample, wherein z is greater than or equal to 1, and the pixel size of each ROI is r₁×r₂Wherein r is₁And r₂The method comprises the steps of respectively corresponding to the length and width pixel values of each region of interest (ROI), finally dividing omics and image data according to a specified proportion to obtain a training set and a testing set, wherein data output after passing through a data preprocessing module are composed of a plurality of samples, and each sample comprises a plurality of omics data and a plurality of pathological pictures. In fact, the above preprocessing can be summarized as: preprocessing of multinomial dataThe image processing and preprocessing module and the data set division are as follows:

the preprocessing of multigroup chemical data comprises:

sample alignment: only the sample containing four kinds of data is reserved, and other samples are deleted;

characteristic deletion: deleting the characteristics of which the median value of the multiple groups of mathematical data exceeds 20%, and simultaneously removing the characteristics of which the variance is lower than a threshold value;

data filling: null values with feature loss lower than 20% are filled in with the software IMPUTE 2;

gene mapping: in order that gene-gene interaction information can be used in DNA methylation data, features in DNA methylation are genetically mapped using a TCGA-Assembler, preserving the successfully mapped features in DNA methylation data.

The image data preprocessing comprises the following steps:

analyzing whether the pathological picture is abnormal or not by using software HistomicTK;

the pathology map was cropped using the OpenSlide tool, resulting in an infinite number of ROIs per sample, each with a pixel size of 224 × 224.

The data set partitioning step includes: training model parameters by taking 80% of samples as a training set, and performing performance evaluation on the trained models by taking 20% of samples as a testing set; after passing through a data set dividing module, outputting three groups of chemical data, namely a training set (462 samples) and a test set (116 samples), in the same matrix, wherein the front 462 rows of data are taken as training samples, 463-578 rows of data are taken as test samples, and the total number of the training set samples and the test set samples is 578 samples; meanwhile, training and testing samples in the image data correspond to corresponding training and testing samples in the multiple groups of mathematical data.

The omics processing module comprises an integrated gene network module and a supervised graph convolution module.

The integrated gene Network module utilizes interaction information between introduced genes of an external database HINT to realize information aggregation and feature screening of a plurality of omics data feature layers through a Graph convolution neural Network (GCN), the action principle of the integrated gene Network module is shown in figure 2, and the steps are as follows:

1) construction of an adjacency matrix A between genes Using Chiense-binary-physical-interaction datasets provided by the external database HINT^(g)∈R^(p×p)R is a real number set, and p is 2000 is the number of genes;

2) using an adjacency matrix A^(g)Constructing a GCN to obtain neighbor information of a feature space:

wherein, u ═ {1,2,3}, X⁽¹⁾、X⁽²⁾And X⁽³⁾Respectively representing mRNA expression data, DNA methylation data and miRNA expression data,

for the preprocessed training data or test data of omics u, in this module, the adjacency matrices of the training data set and the test data set are identical,

for the characterization of omics u, σ (·) is an activation function ReLU (·) ═ max (0, ·), max (0, ·) indicates a larger number of 0 and ·, which is an hadamard product,

for the parameters needing to be learned in the GCN training process of omics u, the integrated gene network module only learns the parameters of the GCN in the training stage, and miRNA expression data cannot be subjected to gene mapping through TCGA-Assembler, so that the miRNA expression data are not changed in the integrated gene network module.

The supervised graph convolution module, the operation principle of which is shown in fig. 3, uses cosine similarity between samples to construct a sample adjacency matrix A^(s)Training the parameters of a supervised graph convolution module to obtain the preliminary prediction classification result of each omics, wherein the supervised graph convolution module is used for carrying out information aggregation on samples, and the steps areThe following were used:

1) construction of adjacency matrix A according to similarity between samples^(s)；

1.1) in the training stage, calculating the cosine similarity between samples in the training set to obtain the adjacency matrix of the training samples

R is a real number set, n_tr462 is the number of test set samples:

wherein

An adjacency matrix representing samples i and j,

denotes the cosine similarity between sample i and sample j, x_iAnd x_iRespectively is the expression values of the sample i and the sample j in the omics, | · | | luminous flux₂Representing a 2-norm operation on a,

is a contiguous matrix

wherein, delta (·) is an indicator function, when sim (x)_i,x_j) When ≧ epsilon, δ (·) 1, otherwise δ (·) 0, n is the number of samples, i.e., the number of nodes, the same k value was used in all experiments on the same dataset, in this exampleK-5 in the breast cancer dataset;

1.2) in the testing stage, calculating cosine similarity between the training sample and the testing sample and between the testing sample and the testing sample, and replacing the training integrated testing set according to the formula in the step 1.1) to obtain an adjacent matrix of the testing set

R is the real number set, and n ═ 578 is the total number of samples;

2) the GCN with the supervised graph convolution module is constructed:

2.1) in the training phase, the GCN with the supervised graph convolution module is constructed according to the following formula:

wherein the content of the first and second substances,

for the adjacency matrix between the samples obtained in step b1.1),

and

for the parameters needed to be learned in the GCN training process of the omics u with the supervised graph convolution module,

and

to superviseImplicit characterization of graph convolution modularity u;

2.2) in the testing stage, inputting the testing sample adjacency matrix obtained in the step 1.2) and the testing set passing through the integrated gene network module into the constructed GCN for sample information aggregation to obtain omics characterization of the testing set;

3) obtaining a preliminary prediction classification result of each omics data:

wherein the content of the first and second substances,

predictive labels, n, representing training or test sets after passing through a supervised graph convolution module_tr＝462、n_te116 is the number of samples in the training set and test set, respectively, c 5 is the number of categories in the classification task,

wherein, the classification task includes a classification task having a category t {1,2, a., c } and a classification task having a category m {1,2, a., c } and a classification task having a category h [ [ 1,2₁,...,h_c]^TTo input the vector of the softmax classifier, h_tAnd h_mRepresenting the t and m elements in the input vector h;

loss function constructed with supervised graph convolution module

The following were used:

wherein L is_CE(. cndot.) is a cross-entropy loss function,

a one-hot coded prediction tag representing omic i sample j,

represent

M-th element of (1), y_jAre authentic tags in the data set.

The image processing module extracts the depth characteristics of the pathological image by using the CNN to obtain the primary classification result of the image data on the cancer patient. The CNN is composed of 6 convolutional layers, a pooling layer and a full-link layer, wherein the kernel size of the convolutional layers is 3 x 3, each convolutional layer is provided with 64 feature maps, the size of the pooling layer is 2 x 2, the full-link layer is adopted in the last layer, and the image data preliminary classification result of the sample is output, and the CNN comprises the following steps:

1) in the training stage, the preprocessed pathological pictures His in the training set with the size of 224 × 224 are processed_trInputting into CNN, extracting pathological picture His from CNN by convolutional layer_trThe data is reduced in dimension through the pooling layer, overfitting is avoided, and the result is output through the full-connection layer

2) in the testing stage, pre-processed pathological pictures His with the size of 224 × 224 in the test set_teInputting the image into the trained CNN model, and outputting the preliminary prediction classification result of the image processing module

The structure of the fusion module is shown in fig. 4, and the fusion module comprises a preliminary cross fusion module and a network fusion module. The preliminary cross fusion module firstly constructs multi-mode data cross fusion vectors, then reconstructs the multi-mode data cross fusion vectors to obtain reconstructed vectors, and finally the network fusion module outputs the classification results after fusion.

1) constructing a multi-modal data cross fusion vector:

wherein the content of the first and second substances,

is a multi-mode data cross fusion vector of a training set or a test set, R is a real number set, n_tr＝462、n_te116 is the number of samples in the training set and test set, respectively, c 5 is the number of classes in the classification task, U3 is the omic number,

2) reconstructing the multi-modal data cross fusion vector obtained in the step 1) to obtain a reconstructed vector of a training set or a test set

1) the reconstructed vector obtained by the preliminary cross fusion module

wherein the content of the first and second substances,

inputting the network parameters to be trained in the training stage, inputting the reconstruction vectors v corresponding to the training set and the test set, and respectively obtaining the final classification results of the training set and the test set

And

the formula of the softmax classifier is as follows:

wherein, the classification task includes a classification t ═ 1, 2., c } and a classification m ═ 1, 2., c }, h ═ h ·₁,...,h_c]^TFor the vector input to the softmax classifier, h_tAnd h_mRepresenting the t-th and m-th elements in the input vector h;

2) and (3) calculating a loss function of the network fusion module by back propagation:

wherein L is_CE(. is a cross entropy loss function, y_jIs the true label of the sample j,

in order to train the final prediction result for sample j,

to represent

The m-th element of (1); in the training stage, a loss function after passing through the network fusion module needs to be calculated, and then parameters of the network fusion module are trained through back propagation; but the step is not needed in the testing stage, and the final classification result can be output in the step 1).

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A cancer patient classification system based on fusion of omics and image data, characterized by: the system can complete the loading and preprocessing of end-to-end multigroup mathematical data and image data, introduces extra characteristic information by utilizing an external knowledge database to perform characteristic dimension reduction and information aggregation on specific omics data, supplements extra sample information by calculating the similarity between cancer patients, namely samples, and finally realizes the fusion of the classification results of the multigroup mathematical data and the image data by a multimodality cross fusion method, and specifically comprises the following functional modules:

the system comprises a multiomic processing module and a supervised graph convolution module, wherein the integrated gene network module utilizes interaction information between genes provided by an external database HINT to construct an adjacency matrix between the genes, utilizes an atlas neural network GCN to characterize the genes, utilizes cosine similarity to construct an adjacency matrix between samples, utilizes the adjacency matrix between the atlas neural network GCN and the samples to characterize the samples, and obtains a preliminary prediction classification result taking omic data as input;

the image processing module is used for representing and learning the image data by using a convolutional neural network to obtain a preliminary prediction classification result by taking the image data as input;

2. The system of claim 1 for classifying cancer patients based on fusion of omics and image data, wherein: the data loading and preprocessing module comprises a data loading module and a data preprocessing module; the data loading module is used for loading a plurality of groups of chemical data and image data, wherein the plurality of groups of chemical data comprise genomics data, transcriptomics data and epigenomics data, each row of the plurality of groups of chemical data represents the expression value of each sample on the corresponding characteristic, each column represents the expression value of one sample in the corresponding characteristic, and the image data is cancer patient pathological diagram data; the preprocessing of the data preprocessing module comprises sample alignment and feature alignment, wherein the features of a plurality of groups of mathematical data with null value proportion exceeding a% of all samples are deleted, the values with null value proportion lower than b% of all samples are filled by IMPUTE2, the features with variance lower than a threshold are removed, the feature alignment of specific omic data is carried out by TCGA-Assembler, the pathological diagram is analyzed by HistomicsTK, the pathological diagram is cut by OpenSlide tools, each sample obtains z interesting regions ROI, z is larger than or equal to 1, and the pixel size of each interesting region is r₁×r₂Wherein r is₁And r₂Respectively corresponding to the length and width pixel values of each ROI, and finally dividing omics and image data according to a specified proportion to obtain a training set and a test set; after passing through the data preprocessing module, the data are outputThe data obtained is composed of a plurality of samples, and each sample comprises a plurality of omics data and a plurality of pathological pictures.

3. The system of claim 1 for classifying cancer patients based on fusion of omics and image data, wherein: the multiomic processing module comprises an integrated gene network module and a supervised graph convolution module;

the integrated gene network module utilizes interaction information between introduced genes of an external database HINT to realize information aggregation and feature screening of a plurality of omics data feature levels through a graph convolution neural network GCN, and comprises the following steps:

training and test sets of preprocessed omics u, respectively, are input into the formula of step a2) during the training phase and the test phase, respectively,

GCN training for omics u-on-map convolution neural networksIn the process, parameters needing to be learned are integrated with the gene network module, and the parameters of the graph convolution neural network GCN are only learned in the training stage;

the supervised image convolution module constructs a sample adjacency matrix A according to cosine similarity between samples^(s)The method for obtaining the preliminary prediction classification result of each omic through the graph convolution neural network GCN comprises the following steps:

Wherein the content of the first and second substances,

an adjacency matrix representing samples i and j,

is a contiguous matrix

wherein, delta (·) is an indicator function, when sim (x)_i,x_j) When the k is equal to 1, each node is only connected, and the graph convolution neural network GCN is equal to a full connection layer at the moment;

b2) A graph convolution neural network GCN with a supervision graph convolution module is constructed:

b2.1) in the training stage, the construction formula of the graph convolution neural network GCN with the supervision graph convolution module is as follows:

wherein the content of the first and second substances,

for the adjacency matrix between the samples obtained in step b1.1),

and

for the parameters needed to be learned in the GCN training process of the graph convolution neural network with the supervision graph convolution module in the omics u,

and

b2.2) in the testing stage, inputting the adjacency matrix of the testing sample obtained in the step b1.2) and the testing set passing through the integrated gene network module into the constructed graph convolution neural network GCN for sample information aggregation to obtain omics characterization of the testing set;

wherein the content of the first and second substances,

predictive labels, n, representing training or test sets after passing through a supervised graph convolution module_tr、n_teThe number of samples in the training set and the test set, respectively, c the number of classes in the classification task,

loss function constructed with supervised graph convolution module

The following were used:

wherein L is_CE(. cndot.) is a cross-entropy loss function,

a one-hot coded prediction tag representing omic i samples j,

to represent

M-th element of (1), y_jAre authentic tags in the data set.

4. A cancer patient classification system based on multiomic and image data fusion as defined in claim 1, wherein: the image processing module extracts depth features of the pathological image by using a Convolutional Neural Network (CNN), wherein the CNN consists of l convolutional layers, a pooling layer and a full-connection layer, l is greater than or equal to 1, and the kernel size of each convolutional layer is s₁×s₂Each convolution layer has q feature maps, and the pooling layer size is s₃×s₄And the last layer adopts a full connection layer, and outputs the preliminary classification result of the image data of the sample, and the method comprises the following steps:

c1) in trainingStage, the pre-processed size is r₁×r₂The pathological picture His in the training set_trInputting into convolutional neural network CNN, extracting pathological picture His from convolutional layer by convolutional layer_trThe data is reduced through the pooling layer, and the result is output through the full-connection layer

c2) in the testing stage, the pre-processed size is r₁×r₂Test concentrated pathology picture His_teInputting the data into a trained convolutional neural network CNN, and outputting the preliminary prediction classification result of the image processing module

5. The system of claim 1 for classifying cancer patients based on fusion of omics and image data, wherein: the fusion module comprises a preliminary cross fusion module and a network fusion module, the preliminary cross fusion module firstly constructs multi-modal data cross fusion vectors, then reconstructs the multi-modal data cross fusion vectors to obtain reconstructed vectors, and finally the network fusion module outputs the classification results after fusion;

d1) constructing a multi-modal data cross fusion vector:

wherein the content of the first and second substances,

is a training set orThe multi-mode data of the test set are cross-fused with vectors, R is a real number set, n_tr、n_teRespectively the number of samples in the training set and the test set, c the number of categories in the classification task, U the number of omics,

The network fusion module is composed of a full connection layer and comprises the following steps:

e1) using the reconstructed vector obtained in step d2)

wherein the content of the first and second substances,

the network parameters to be trained in the training stage are input into the reconstruction vectors corresponding to the training set and the test set to respectively obtain the final classification results of the training set and the test set

And

the formula of the softmax classifier is as follows:

in order to train the final prediction result for sample j,

to represent

The m-th element of (1); in the training stage, a loss function after passing through the network fusion module needs to be calculated, and parameters of the network fusion module are trained through back propagation; but this step need not be passed through in the testing phase, the final classification result is output in step e 1).