CN115064266B - Incomplete multi-set data-based cancer diagnosis system, equipment and medium - Google Patents
Incomplete multi-set data-based cancer diagnosis system, equipment and medium Download PDFInfo
- Publication number
- CN115064266B CN115064266B CN202210867454.3A CN202210867454A CN115064266B CN 115064266 B CN115064266 B CN 115064266B CN 202210867454 A CN202210867454 A CN 202210867454A CN 115064266 B CN115064266 B CN 115064266B
- Authority
- CN
- China
- Prior art keywords
- data
- patient
- histology
- omics
- loss
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003745 diagnosis Methods 0.000 title claims abstract description 92
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 50
- 201000011510 cancer Diseases 0.000 title claims abstract description 50
- 238000000605 extraction Methods 0.000 claims abstract description 44
- 230000004927 fusion Effects 0.000 claims abstract description 19
- 230000008485 antagonism Effects 0.000 claims description 26
- 238000000034 method Methods 0.000 claims description 22
- 238000004590 computer program Methods 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 11
- 239000000126 substance Substances 0.000 claims description 11
- 239000011159 matrix material Substances 0.000 claims description 6
- 230000002962 histologic effect Effects 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims 4
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000007418 data mining Methods 0.000 abstract description 2
- 238000012549 training Methods 0.000 description 12
- 230000006870 function Effects 0.000 description 7
- 238000005457 optimization Methods 0.000 description 5
- 239000013598 vector Substances 0.000 description 5
- 239000000284 extract Substances 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000007067 DNA methylation Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 230000001575 pathological effect Effects 0.000 description 1
- 238000011282 treatment Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Public Health (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Pathology (AREA)
- Epidemiology (AREA)
- Theoretical Computer Science (AREA)
- Primary Health Care (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Measuring And Recording Apparatus For Diagnosis (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The present disclosure provides a cancer diagnosis system based on incomplete multiunit data, which belongs to the technical field of artificial intelligence data mining classification and bioinformatics, comprising: a data acquisition module configured to: acquiring all available to-be-diagnosed group data of the same patient; a histology data feature extraction module configured to: respectively extracting features of the obtained different groups of data; a missing omics data generation module configured to: generating missing omics data of the patient based on the generated countermeasure strategies according to the omics data corresponding to the patient, and extracting features; a multi-set of mathematical feature fusion and diagnosis module configured to: fusing the extracted histology data characteristics of the patient with the generated histology data characteristics, and inputting the fused characteristics into a pre-trained diagnosis network model to obtain a diagnosis result.
Description
Technical Field
The disclosure belongs to the technical field of artificial intelligence data mining classification and bioinformatics, and particularly relates to a cancer diagnosis system, equipment and medium based on incomplete multi-group data.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
In the context of current cancer pandemics, it is particularly important to precisely classify the type of cancer in a patient by employing the patient's histology data, as different types of cancer often require different treatments. Compared with the cancer diagnosis by adopting single histology data, the fusion of multiple histology data can lead the characteristics of the patient to be more abundant, thereby further improving the accuracy of diagnosis. However, due to the high cost, invasiveness, legal and ethical constraints and other factors of the partial detection means, incomplete multi-group data are ubiquitous in the real biological world, and how to realize more accurate cancer diagnosis of patients according to the incomplete multi-group data is a difficulty that the current machine learning technology still needs to be improved in cancer diagnosis.
The inventors found that the current methods of diagnosis under incomplete multi-set of data are: training and diagnosing after excluding the missing samples in the histology data; ensuring that one of the patient's histology data is complete and then training and diagnosing; respectively constructing models according to the availability of the histology data, and then training and diagnosing; and extracting the data of each group to the feature space with the same dimension, and performing training, diagnosis and the like after fusion. However, these training methods depending on preconditions severely limit their practical application in clinic, and in addition, the method of fusing feature spaces in the same dimension only pursues the common features of each group of data, so that the individual features of each group of data are ignored, and thus, there is more room for improvement in machine learning technology for accurate diagnosis of cancer.
Disclosure of Invention
In order to solve the problems, the present disclosure provides a cancer diagnosis system, a device and a medium based on incomplete multi-group data, where the scheme extracts group characteristics through a attention-based characteristic extraction network, and makes group characteristics represent good variability through combination and optimization of sharing loss and personality loss, and attention parameters in the characteristic extraction network can not only alleviate the overfitting problem caused by high dimensionality of the group data, but also make the system have good interpretability and authenticity; generating the missing omics data of the patient by generating an countermeasure strategy, thereby enriching the characteristic representation of the patient, and realizing flexible cancer diagnosis even if only one kind of omics data is available; and finally, fusing the data characteristics of each group and inputting the fused characteristic diagnosis network, so as to obtain a more accurate diagnosis result of the patient on cancer.
According to a first aspect of embodiments of the present disclosure, there is provided a cancer diagnosis system based on incomplete multi-set of chemical data, comprising:
a data acquisition module configured to: acquiring all available to-be-diagnosed group data of the same patient;
A histology data feature extraction module configured to: respectively extracting features of the obtained different groups of data;
A missing omics data generation module configured to: generating missing omics data of the patient based on the generated countermeasure strategies according to the omics data corresponding to the patient, and extracting features;
A multi-set of mathematical feature fusion and diagnosis module configured to: fusing the extracted histology data characteristics of the patient with the generated histology data characteristics, and inputting the fused characteristics into a pre-trained diagnosis network model to obtain a diagnosis result.
Further, the feature extraction is performed on the obtained different sets of data respectively, specifically:
respectively constructing an attention parameter layer for each group of data characteristics;
Constructing a feature extraction network corresponding to each group of the learning data to perform feature extraction;
calculating sharing loss according to the extracted characteristics;
And calculating the personality loss according to the extracted features.
Further, according to the corresponding histology data of the patient, generating missing histology data of the patient based on the generated countermeasure policy, specifically:
Generating missing omic data according to the extracted omic data characteristics;
calculating generation loss and countermeasures loss based on real data corresponding to available omic data and generated omic data;
The generation loss and the antagonism loss of each group of the data are integrated, and the antagonism loss is calculated and generated.
Further, generating missing omic data according to the extracted omic data characteristics; the method comprises the following steps:
Extraction features from patient-available histology data To calculate potential features/>, of the patient corresponding to the respective omics data
Potential features that will correspond to corresponding omics dataInputting into a generation network G v (), corresponding to the corresponding omic data, to generate patient missing omic data/>The concrete representation is as follows:
Wherein G v (. Cndot.) is a generator for generating v-th histology data, Is the generated patient missing data from group v.
Further, the generating loss and the countering loss are calculated based on the real data corresponding to the available omic data and the generated omic data; the method comprises the following steps: true data in available patient histology dataGenerated data/>, from available omics dataInputting an identification network D v (DEG) corresponding to the corresponding histology data, and respectively calculating the generation loss/>, under the corresponding histology, according to the identification resultAnd combat loss/>The concrete representation is as follows:
Wherein D v (·) is a discriminator corresponding to the v-th omic data, whose output is 0 to 1, the smaller the value is representing the higher the likelihood that the discriminator considers the omic data as generated data; An indication matrix indicating whether the omics data is missing.
Further, the extracted patient histology data features and the generated histology data features are fused, specifically: and splicing and fusing the patient histology data characteristics and the generated histology data characteristics.
Further, the generated loss of each group of the data is integrated with the fight loss, and the generated fight loss is calculated as follows:
By optimizing GLoss, the performance of the countermeasure network can be continuously perfected in the process of generating and countermeasures, so that missing omics data similar to real data is generated.
According to a second aspect of the embodiments of the present disclosure, there is provided an electronic device including a memory, a processor and a computer program stored to run on the memory, the processor implementing the following procedure when executing the program:
Acquiring all available to-be-diagnosed group data of the same patient;
respectively extracting features of the obtained different groups of data;
generating missing omics data of the patient based on the generated countermeasure strategies according to the omics data corresponding to the patient, and extracting features;
Fusing the extracted histology data characteristics of the patient with the generated histology data characteristics, and inputting the fused characteristics into a pre-trained diagnosis network model to obtain a diagnosis result.
According to a third aspect of embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the following process:
Acquiring all available to-be-diagnosed group data of the same patient;
Extracting the characteristics of the obtained histology data;
Generating missing omics data for the patient based on the generating countermeasure strategy based on the available omics data for the patient;
And fusing the extracted available features of the patient and the generated histology data features, and inputting the fused features into a pre-trained diagnosis network model to obtain a diagnosis result.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program for performing the following steps when run on one or more processors:
Acquiring all available to-be-diagnosed group data of the same patient;
Extracting the characteristics of the obtained histology data;
Generating missing omics data for the patient based on the generating countermeasure strategy based on the available omics data for the patient;
And fusing the extracted available features of the patient and the generated histology data features, and inputting the fused features into a pre-trained diagnosis network model to obtain a diagnosis result.
Compared with the prior art, the beneficial effects of the present disclosure are:
(1) According to the scheme, firstly, the characteristics of the histology are extracted through the attention-based characteristic extraction network, and the characteristics of the histology are represented with good variability through the combination and optimization of the sharing loss and the individual loss, and the attention parameters in the characteristic extraction network can not only relieve the overfitting problem caused by the high dimensionality of the histology data, but also enable the system to have good interpretability and authenticity; generating the missing omics data of the patient by generating an countermeasure strategy, thereby enriching the characteristic representation of the patient, and realizing flexible cancer diagnosis even if only one kind of omics data is available; and finally, fusing the data characteristics of each group and inputting the fused characteristic diagnosis network, so as to obtain a more accurate diagnosis result of the patient on cancer.
(2) The system disclosed by the disclosure only takes the study data of each group of patients as input, can obtain the diagnosis result of the patients without complicated operation steps, and has good usability.
Additional aspects of the disclosure will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the disclosure.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the exemplary embodiments of the disclosure and together with the description serve to explain the disclosure, and do not constitute an undue limitation on the disclosure.
FIG. 1 is a schematic diagram of a system for diagnosing cancer based on incomplete multi-set of chemical data according to an embodiment of the present disclosure;
FIG. 2 is a process flow diagram of a cancer diagnostic system based on incomplete multi-set of chemical data in an embodiment of the present disclosure.
Detailed Description
The disclosure is further described below with reference to the drawings and examples.
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments in accordance with the present disclosure. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.
Embodiment one:
it is an object of the present embodiment to provide a cancer diagnosis system based on incomplete multi-set of chemical data.
A cancer diagnostic system based on incomplete multi-set of chemical data, comprising:
a data acquisition module configured to: acquiring all available to-be-diagnosed group data of the same patient;
A histology data feature extraction module configured to: respectively extracting features of the obtained different groups of data;
A missing omics data generation module configured to: generating missing omics data of the patient based on the generated countermeasure strategies according to the omics data corresponding to the patient, and extracting features;
A multi-set of mathematical feature fusion and diagnosis module configured to: fusing the extracted histology data characteristics of the patient with the generated histology data characteristics, and inputting the fused characteristics into a pre-trained diagnosis network model to obtain a diagnosis result.
Further, the feature extraction is performed on the obtained different sets of data respectively, specifically:
respectively constructing an attention parameter layer for each group of data characteristics;
Constructing a feature extraction network corresponding to each group of the learning data to perform feature extraction;
calculating sharing loss according to the extracted characteristics;
And calculating the personality loss according to the extracted features.
Further, according to the corresponding histology data of the patient, generating missing histology data of the patient based on the generated countermeasure policy, specifically:
Generating missing omic data according to the extracted omic data characteristics;
calculating generation loss and countermeasures loss based on real data corresponding to available omic data and generated omic data;
The generation loss and the antagonism loss of each group of the data are integrated, and the antagonism loss is calculated and generated.
Further, generating missing omic data according to the extracted omic data characteristics; the method comprises the following steps:
Extraction features from patient-available histology data To calculate potential features/>, of the patient corresponding to the respective omics data
Potential features that will correspond to corresponding omics dataInputting into a generation network G v (), corresponding to the corresponding omic data, to generate patient missing omic data/>The concrete representation is as follows:
Wherein G v (. Cndot.) is a generator for generating v-th histology data, Is the generated patient missing data from group v.
Further, the generating loss and the countering loss are calculated based on the real data corresponding to the available omic data and the generated omic data; the method comprises the following steps: true data in available patient histology dataGenerated data/>, from available omics dataInputting an identification network D v (DEG) corresponding to the corresponding histology data, and respectively calculating the generation loss/>, under the corresponding histology, according to the identification resultAnd combat loss/>The concrete representation is as follows:
Wherein D v (·) is a discriminator corresponding to the v-th omic data, whose output is 0 to 1, the smaller the value is representing the higher the likelihood that the discriminator considers the omic data as generated data; an indication matrix of whether the histologic data is missing or not, when When it indicates that the type-changed omic data is available,/>And then indicates that the type of the omics data is missing.
Further, the extracted patient histology data features and the generated histology data features are fused, specifically: and splicing and fusing the patient histology data characteristics and the generated histology data characteristics.
Further, the generated loss of each group of the data is integrated with the fight loss, and the generated fight loss is calculated as follows:
By optimizing GLoss, the performance of the countermeasure network can be continuously perfected in the process of generating and countermeasures, so that missing omics data similar to real data is generated.
In particular, for easy understanding, the following detailed description of the embodiments will be given with reference to the accompanying drawings:
As shown in fig. 2, there is provided a cancer diagnosis system based on incomplete multi-set of chemical data, comprising:
S101, acquiring all available to-be-diagnosed histology data of the same patient, wherein the to-be-diagnosed histology data can comprise but not only: copy number variation data, DNA methylation data, diagnostic pathological section imaging data, gene expression data, and the like.
S102, extracting the characteristics of available histology data.
Specifically, firstly, an attention parameter layer is respectively constructed for each group of data characteristics to capture important group of data characteristics, then a group of data characteristic extraction network is respectively constructed for each group of data, the group of data captured by attention parameters is input into the characteristic extraction network to perform characteristic extraction, and finally, sharing loss and personality loss are calculated according to the extracted characteristics to be used for optimizing each attention-based characteristic extraction network in a training stage.
Specifically, the specific implementation manner of the step 102 is as follows:
S1021, respectively constructing attention parameter layers for the various histology data characteristics, and characterizing the patient' S histology data Input of the respective histology feature attention layer/>To capture important feature sites in the omics data to obtain the patient's omics data/>, which is captured by the attention parameterThereby reducing the over-fitting problem caused by the high dimensionality of the histology data.
S1022, constructing a feature extraction network corresponding to each group of data, inputting the group of data features captured by the attention parameters into the feature extraction network corresponding to the corresponding group of data to obtain feature vectors of each group of data with the same feature dimensionThe definition is as follows:
Wherein the method comprises the steps of And/>The original data and the extracted features of the ith patient under the v th histology are respectively,/>Is a parameter of a feature extraction network of the v-th histology data, is a para-multiplication symbol, omega v is an attention weight corresponding to a corresponding histology obtained by normalization of a Softmax function in the v-th histology data, the introduction of the attention parameter can enable the feature extraction network to adaptively capture important mutation sites in the patient histology data and give the sites a larger attention weight when the feature extraction network performs feature extraction, the feature extraction network can have a certain interpretability, and the local optimal problem caused by overlarge site weight in the histology data can be prevented by normalization of omega v through the Softmax function.
S1023, calculating sharing loss according to the extracted features, we respectively for each extracted histology featureConstructing a characteristic evaluation network, and calculating sharing loss SLoss according to the cancer type output by the evaluation network and the real cancer type of the patient as follows:
Wherein y i is the type of cancer to which the patient corresponds, Is to measure the extracted characteristics/>Is a loss function of capacity (cross entropy loss is used herein). SLoss aims to induce consistent prediction of the histology characteristics extracted by the characteristic extraction network, and by means of combined optimization of the attention-based characteristic extraction network and the characteristic evaluation network by SLoss, information characteristics which are helpful for cancer type diagnosis can be extracted from various histology data of patients, and the interpretability and the authenticity of the system can be improved.
S1024, calculating personality loss according to the extracted features, firstly extracting features by each groupTo calculate the characteristic prototype/>, of the patientAnd takes this as an approximation of the commonality feature, and then extracts the features/>, of the individual histology data by a personality factor λAnd feature prototype/>The similarity of (2) is converted into a personality loss ILoss for measuring the balance of personality and commonality among the extracted features, defined as follows:
Wherein the method comprises the steps of Is the extraction feature/>And feature prototype/>Cosine similarity between them, when the similarity is larger than the personality factor lambda (-1. Ltoreq.lambda. Ltoreq.1), the term/(I)Much like the commonality feature, we use Relu (·) activation functions to convert this loss to 0; otherwise,/>Containing too many individual features, we alleviate the extracted features/>, by optimizing ILossIs an excessive personalization of (c).
Through the joint optimization of the sharing loss SLoss and the personality loss ILoss, the system can extract commonality and personality characteristics from multiple groups of the patient's learning data, so that the diversity of the patient characteristics of different cancer types is ensured, and the system is helped to realize more accurate diagnosis.
S103, generating missing histology data.
Specifically, potential characteristics of the patient corresponding to each group of the patient are calculated according to the extracted characteristics of the available group of the patient, the potential characteristics are input into a generating network to generate each group of the patient, the generating data of the available group and the real data are input into an antagonism network to calculate the generating loss and the antagonism loss, and the generating loss and the antagonism loss calculated under each group are integrated to obtain the generating antagonism loss.
Specifically, the specific implementation manner of the step 103 is as follows:
S1031, generating missing omics data based on the extracted features, extracting features from the patient-available omics data To calculate potential features/>, of the patient corresponding to the respective omics dataThe following are provided:
then, potential features corresponding to the corresponding group Inputting into a generation network G v (·) corresponding to the respective histology to generate patient missing histology data/>The following are provided:
wherein, in order to ensure the generation capability of the generation network, the extraction features of the v-th histology data are not used for Is calculated by the computer. G v (.) is a generator for generating v-th histology data,/>Is the generated patient missing data from group v.
S1032, calculating generation loss and antagonism loss according to the real data and generation data corresponding to the available group, and collecting the real data of the patient under the available groupGenerated data/>, under available histologyInputting an identification network D v (DEG) corresponding to the corresponding histology data, and respectively calculating the generation loss/>, under the corresponding histology, according to the identification resultAnd combat loss/>The definition is as follows:
Where D v (·) is a discriminator corresponding to the v-th omic data, whose output is 0 to 1, the smaller the value is representing the higher the likelihood that the discriminator considers the omic data as generated data.
S1033, integrating the generation loss and the antagonism loss of each group, obtaining the generation antagonism loss GLoss of the generation antagonism network under all available group data, which is defined as follows:
By optimizing GLoss, the performance of the antagonism network can be continuously perfected in the process of generating and antagonism, so that missing omics data which is more similar to real data is generated.
S104, histology feature fusion and diagnosis.
Specifically, the generated data under the missing histology is input into a attention-based feature extraction network corresponding to the corresponding histology to perform feature extraction, the extracted features are used as approximate features of the patient under the histology, feature fusion is performed on all the histology data of the patient, a cancer diagnosis network based on the fusion features is input, a diagnosis result of the patient is obtained, and the diagnosis loss is calculated according to the diagnosis result.
Specifically, the specific implementation manner of the step 104 is as follows:
s1041, calculating the characteristic vector of each group of the patient, and generating data of the patient Input attention-based feature extraction network under the corresponding group/>Feature extraction and obtaining representative feature vector/>, of the v-th histology to be used for cancer diagnosis, based on whether the histology data is availableThe definition is as follows:
Wherein the method comprises the steps of Is a representative feature vector of the v-th histologic data we will use for cancer diagnosis. When/>When available, we directly use their characteristic representation/>Carrying out subsequent feature fusion; otherwise, we use/>To extract the generated data/>For subsequent feature fusion.
S1042, fusion of the respective histology features, combining all the histology feature vectors of the patient for diagnosisFeature fusion based on splicing is carried out, and fusion features z i are obtained and defined as follows:
S1043, performing cancer diagnosis according to the fusion characteristic, calculating diagnosis loss, inputting the fusion characteristic z i into a cancer diagnosis network based on the fusion characteristic for diagnosis, and calculating diagnosis loss DLoss according to the diagnosis result, wherein the definition is as follows:
where f Ψ (·) is the cancer diagnostic network optimized by the parameter ψ, CE (f Ψ(zi),yi) is the cross entropy function.
Illustratively, the training of the incomplete multi-set of chemical data cancer diagnosis system comprises:
Constructing a first training set; the first training set is all available omics data for patients for whom the diagnostic result is known to the incomplete multi-omic data cancer diagnostic system;
Specifically, inputting the data of each of the group studies in the first training set to the attention-based feature extraction network under the corresponding group study results in sharing loss SLoss and personality loss ILoss. The generated data under the available group and the real data of the corresponding group in the first training set are input into an authentication network based on the specific group, and the generated countermeasure loss GLoss is calculated according to the output of the authentication network. The fusion features derived from the available data and the generated data are input into a fusion feature-based cancer diagnostic network to obtain a final diagnostic result for the patient, and a diagnostic loss DLoss is calculated from the diagnostic result. Finally, based on these losses, the final objective loss function L of the incomplete multi-set of chemical data cancer diagnostic system is obtained, defined as follows:
L=minΦ,Ω,G,ΨSLoss+ILoss+GLoss+DLoss (12)
Through optimizing the target loss function L, the updating of the network parameters of each module in the incomplete multi-mathematic data cancer diagnosis system can be realized, so that the diagnosis performance of the system is improved.
Further, a first validation set and a first test set are constructed. The first validation set and the first test set are all available omics data for patients whose diagnostic results are unknown to the incomplete multi-omic data cancer diagnostic system.
Specifically, all available histology data in the first verification set are input into an incomplete multiple-histology data cancer diagnosis system to obtain a diagnosis result predicted by the system, the diagnosis accuracy of the system in the first verification set is calculated according to the diagnosis result predicted by the system and the real diagnosis result of a patient, and the system parameter with the highest diagnosis accuracy in the training process is selected as the final parameter of the incomplete multiple-histology data cancer diagnosis system.
Further, inputting all available histology data in the first test set into an incomplete multiple-histology data cancer diagnosis system represented by final parameters to obtain a system prediction diagnosis result, calculating the diagnosis accuracy of the system in the first test set according to the system prediction diagnosis result and the real diagnosis result of the patient, and taking the accuracy as an approximation of the diagnosis accuracy of the incomplete multiple-histology data cancer diagnosis system in future diagnosis tasks.
In summary, by embodiments of the present invention, we propose an incomplete multi-set of data cancer diagnostic system. The system firstly extracts the histology characteristics through a characteristic extraction network based on attention, and ensures that the histology characteristics are expressed with good difference through the combination and optimization of sharing loss and personality loss, and attention parameters in the network can not only relieve the overfitting problem caused by the high dimensionality of the histology data, but also ensure that the system has good interpretability and authenticity; generating the missing omics data of the patient by generating an countermeasure strategy, thereby enriching the characteristic representation of the patient, and realizing flexible cancer diagnosis even if only one kind of omics data is available; and finally, fusing the data characteristics of each group and inputting the fused characteristic diagnosis network, so as to obtain a more accurate diagnosis result of the patient on cancer. In addition, the system only takes the data of each group of the patient as input, can obtain the diagnosis result of the patient without complicated operation steps, and has good usability.
Embodiment two:
an object of the present embodiment is to provide an electronic apparatus.
An electronic device comprising a memory, a processor and a computer program stored to run on the memory, the processor implementing the following processes when executing the program:
Acquiring all available to-be-diagnosed group data of the same patient;
respectively extracting features of the obtained different groups of data;
generating missing omics data of the patient based on the generated countermeasure strategies according to the omics data corresponding to the patient, and extracting features;
Fusing the extracted histology data characteristics of the patient with the generated histology data characteristics, and inputting the fused characteristics into a pre-trained diagnosis network model to obtain a diagnosis result.
Further, the details of the implementation steps of the present embodiment are described in the first embodiment, so they will not be described herein.
Embodiment III:
it is an object of the present embodiment to provide a non-transitory computer readable storage medium.
A non-transitory computer readable storage medium having stored thereon a computer program which when executed by a processor performs the following process:
Acquiring all available to-be-diagnosed group data of the same patient;
Extracting the characteristics of the obtained histology data;
Generating missing omics data for the patient based on the generating countermeasure strategy based on the available omics data for the patient;
And fusing the extracted available features of the patient and the generated histology data features, and inputting the fused features into a pre-trained diagnosis network model to obtain a diagnosis result.
Further, the details of the implementation steps of the present embodiment are described in the first embodiment, so they will not be described herein.
Embodiment four:
it is an object of the present embodiment to provide a computer program product.
A computer program product comprising a computer program for performing the following steps when run on one or more processors:
Acquiring all available to-be-diagnosed group data of the same patient;
Extracting the characteristics of the obtained histology data;
Generating missing omics data for the patient based on the generating countermeasure strategy based on the available omics data for the patient;
And fusing the extracted available features of the patient and the generated histology data features, and inputting the fused features into a pre-trained diagnosis network model to obtain a diagnosis result.
Further, the details of the implementation steps of the present embodiment are described in the first embodiment, so they will not be described herein.
The cancer diagnosis system, the equipment and the medium based on incomplete multi-group data provided by the embodiment can be realized, and have wide application prospect.
The foregoing description of the preferred embodiments of the present disclosure is provided only and not intended to limit the disclosure so that various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.
Claims (7)
1. A cancer diagnostic system based on incomplete multi-set of chemical data, comprising:
a data acquisition module configured to: acquiring all available to-be-diagnosed group data of the same patient;
A histology data feature extraction module configured to: respectively extracting features of the obtained different groups of data;
A missing omics data generation module configured to: generating missing omics data of the patient based on the generated countermeasure strategies according to the omics data corresponding to the patient, and extracting features;
A multi-set of mathematical feature fusion and diagnosis module configured to: fusing the extracted histology data characteristics of the patient with the generated histology data characteristics, and inputting the fused characteristics into a pre-trained diagnosis network model to obtain a diagnosis result;
generating missing omics data of the patient based on the generated countermeasure strategies according to the corresponding omics data of the patient, wherein the missing omics data of the patient are specifically:
Generating missing omic data according to the extracted omic data characteristics;
calculating generation loss and countermeasures loss based on real data corresponding to available omic data and generated omic data;
integrating the generation loss and the antagonism loss of each group of the study data, and calculating the generation antagonism loss;
Generating missing omics data according to the extracted omics data characteristics, specifically:
Extraction features from patient-available histology data To calculate potential characteristics of the patient corresponding to respective ones of the histologic data
Potential features that will correspond to corresponding omics dataInputting into a generation network G v (), corresponding to the corresponding omic data, to generate patient missing omic data/>The concrete representation is as follows:
wherein, Is the generated patient missing data of group v;
The calculation of the generation loss and the antagonism loss based on the real data corresponding to the available histology data and the generated histology data is specifically as follows: true data in available patient histology data Generated data/>, from available omics dataInputting an identification network D v (DEG) corresponding to the corresponding histology data, and respectively calculating the generation loss/>, under the corresponding histology, according to the identification resultAnd combat loss/>The concrete representation is as follows:
Wherein the output of D v (·) is 0 to 1, a smaller value representing a higher likelihood that the discriminator considers the set of data as generated data; An indication matrix indicating whether the omics data is missing.
2. The system for diagnosing cancer based on incomplete multi-set of chemical data according to claim 1, wherein the feature extraction is performed on the obtained different sets of chemical data respectively, specifically:
respectively constructing an attention parameter layer for each group of data characteristics;
Constructing a feature extraction network corresponding to each group of the learning data to perform feature extraction;
calculating sharing loss according to the extracted characteristics;
And calculating the personality loss according to the extracted features.
3. The system of claim 1, wherein the generated loss of each set of data is integrated with the fight loss, and the fight loss is calculated as follows:
By optimizing GLoss, the performance of the countermeasure network can be continuously perfected in the process of generating and countermeasures, so that missing omics data similar to real data is generated.
4. The incomplete multi-group data based cancer diagnosis system according to claim 1, wherein the extracted group data features of the patient and the generated group data features are fused, in particular: and splicing and fusing the patient histology data characteristics and the generated histology data characteristics.
5. An electronic device comprising a memory, a processor and a computer program stored to run on the memory, the processor implementing the following processes when executing the program:
Acquiring all available to-be-diagnosed group data of the same patient;
respectively extracting features of the obtained different groups of data;
generating missing omics data of the patient based on the generated countermeasure strategies according to the omics data corresponding to the patient, and extracting features;
Fusing the extracted histology data characteristics of the patient with the generated histology data characteristics, and inputting the fused characteristics into a pre-trained diagnosis network model to obtain a diagnosis result;
generating missing omics data of the patient based on the generated countermeasure strategies according to the corresponding omics data of the patient, wherein the missing omics data of the patient are specifically:
Generating missing omic data according to the extracted omic data characteristics;
calculating generation loss and countermeasures loss based on real data corresponding to available omic data and generated omic data;
integrating the generation loss and the antagonism loss of each group of the study data, and calculating the generation antagonism loss;
Generating missing omics data according to the extracted omics data characteristics, specifically:
Extraction features from patient-available histology data To calculate potential features/>, of the patient corresponding to the respective omics data
Potential features that will correspond to corresponding omics dataInputting into a generation network G v (), corresponding to the corresponding omic data, to generate patient missing omic data/>The concrete representation is as follows:
wherein, Is the generated patient missing data of group v;
The calculation of the generation loss and the antagonism loss based on the real data corresponding to the available histology data and the generated histology data is specifically as follows: true data in available patient histology data Generated data/>, from available omics dataInputting an identification network D v (DEG) corresponding to the corresponding histology data, and respectively calculating the generation loss/>, under the corresponding histology, according to the identification resultAnd combat loss/>The concrete representation is as follows:
Wherein the output of D v (·) is 0 to 1, a smaller value representing a higher likelihood that the discriminator considers the set of data as generated data; An indication matrix indicating whether the omics data is missing.
6. A non-transitory computer readable storage medium, having stored thereon a computer program which when executed by a processor performs the following process:
Acquiring all available to-be-diagnosed group data of the same patient;
Extracting the characteristics of the obtained histology data;
Generating missing omics data for the patient based on the generating countermeasure strategy based on the available omics data for the patient;
Fusing the extracted available features of the patient and the generated histology data features, and inputting the fused features into a pre-trained diagnosis network model to obtain a diagnosis result;
generating missing omics data of the patient based on the generated countermeasure strategies according to the corresponding omics data of the patient, wherein the missing omics data of the patient are specifically:
Generating missing omic data according to the extracted omic data characteristics;
calculating generation loss and countermeasures loss based on real data corresponding to available omic data and generated omic data;
integrating the generation loss and the antagonism loss of each group of the study data, and calculating the generation antagonism loss;
Generating missing omics data according to the extracted omics data characteristics, specifically:
Extraction features from patient-available histology data To calculate potential characteristics of the patient corresponding to respective ones of the histologic data
Potential features that will correspond to corresponding omics dataInputting into a generation network G v (), corresponding to the corresponding omic data, to generate patient missing omic data/>The concrete representation is as follows:
wherein, Is the generated patient missing data of group v;
The calculation of the generation loss and the antagonism loss based on the real data corresponding to the available histology data and the generated histology data is specifically as follows: true data in available patient histology data Generated data/>, from available omics dataInputting an identification network D v (DEG) corresponding to the corresponding histology data, and respectively calculating the generation loss/>, under the corresponding histology, according to the identification resultAnd combat loss/>The concrete representation is as follows:
Wherein the output of D v (·) is 0 to 1, a smaller value representing a higher likelihood that the discriminator considers the set of data as generated data; An indication matrix indicating whether the omics data is missing.
7. A computer program product comprising a computer program for performing the following steps when run on one or more processors:
Acquiring all available to-be-diagnosed group data of the same patient;
Extracting the characteristics of the obtained histology data;
Generating missing omics data for the patient based on the generating countermeasure strategy based on the available omics data for the patient;
Fusing the extracted available features of the patient and the generated histology data features, and inputting the fused features into a pre-trained diagnosis network model to obtain a diagnosis result;
generating missing omics data of the patient based on the generated countermeasure strategies according to the corresponding omics data of the patient, wherein the missing omics data of the patient are specifically:
Generating missing omic data according to the extracted omic data characteristics;
calculating generation loss and countermeasures loss based on real data corresponding to available omic data and generated omic data;
integrating the generation loss and the antagonism loss of each group of the study data, and calculating the generation antagonism loss;
Generating missing omics data according to the extracted omics data characteristics, specifically:
Extraction features from patient-available histology data To calculate potential characteristics of the patient corresponding to respective ones of the histologic data
Potential features that will correspond to corresponding omics dataInputting into a generation network G v (), corresponding to the corresponding omic data, to generate patient missing omic data/>The concrete representation is as follows:
wherein, Is the generated patient missing data of group v;
The calculation of the generation loss and the antagonism loss based on the real data corresponding to the available histology data and the generated histology data is specifically as follows: true data in available patient histology data Generated data/>, from available omics dataInputting an identification network D v (DEG) corresponding to the corresponding histology data, and respectively calculating the generation loss/>, under the corresponding histology, according to the identification resultAnd combat loss/>The concrete representation is as follows:
Wherein the output of D v (·) is 0 to 1, a smaller value representing a higher likelihood that the discriminator considers the set of data as generated data; An indication matrix indicating whether the omics data is missing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210867454.3A CN115064266B (en) | 2022-07-21 | 2022-07-21 | Incomplete multi-set data-based cancer diagnosis system, equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210867454.3A CN115064266B (en) | 2022-07-21 | 2022-07-21 | Incomplete multi-set data-based cancer diagnosis system, equipment and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115064266A CN115064266A (en) | 2022-09-16 |
CN115064266B true CN115064266B (en) | 2024-04-26 |
Family
ID=83205827
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210867454.3A Active CN115064266B (en) | 2022-07-21 | 2022-07-21 | Incomplete multi-set data-based cancer diagnosis system, equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115064266B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115985513B (en) * | 2023-01-05 | 2023-11-03 | 徐州医科大学科技园发展有限公司 | Data processing method, device and equipment based on multiple groups of chemical cancer typing |
CN116741397B (en) * | 2023-08-15 | 2023-11-03 | 数据空间研究院 | Cancer typing method, system and storage medium based on multi-group data fusion |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111028939A (en) * | 2019-11-15 | 2020-04-17 | 华南理工大学 | Multigroup intelligent diagnosis system based on deep learning |
CN111291777A (en) * | 2018-12-07 | 2020-06-16 | 深圳先进技术研究院 | Cancer subtype classification method based on multigroup chemical integration |
CN111816259A (en) * | 2020-07-07 | 2020-10-23 | 西安电子科技大学 | Incomplete omics data integration method based on network representation learning |
CN113228194A (en) * | 2018-10-12 | 2021-08-06 | 人类长寿公司 | Multigroup search engine for comprehensive analysis of cancer genome and clinical data |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU2020274091A1 (en) * | 2019-05-14 | 2021-12-09 | Tempus Ai, Inc. | Systems and methods for multi-label cancer classification |
-
2022
- 2022-07-21 CN CN202210867454.3A patent/CN115064266B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113228194A (en) * | 2018-10-12 | 2021-08-06 | 人类长寿公司 | Multigroup search engine for comprehensive analysis of cancer genome and clinical data |
CN111291777A (en) * | 2018-12-07 | 2020-06-16 | 深圳先进技术研究院 | Cancer subtype classification method based on multigroup chemical integration |
CN111028939A (en) * | 2019-11-15 | 2020-04-17 | 华南理工大学 | Multigroup intelligent diagnosis system based on deep learning |
CN111816259A (en) * | 2020-07-07 | 2020-10-23 | 西安电子科技大学 | Incomplete omics data integration method based on network representation learning |
Non-Patent Citations (2)
Title |
---|
MOMA: a multi-task attention learning algorithm for multi-omics data interpretation and classification;Sehwan Moon, Hyunju Lee;《Bioinformatics》;20220214;第38卷(第8期);全文 * |
多组学数据整合分析的统计方法研究进展;沈思鹏;张汝阳;魏永越;陈峰;;中华疾病控制杂志;20180810(第08期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN115064266A (en) | 2022-09-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115064266B (en) | Incomplete multi-set data-based cancer diagnosis system, equipment and medium | |
EP4123515A1 (en) | Data processing method and data processing device | |
US20220367053A1 (en) | Multimodal fusion for diagnosis, prognosis, and therapeutic response prediction | |
CN110197195B (en) | Novel deep network system and method for behavior recognition | |
Adu et al. | DHS‐CapsNet: Dual horizontal squash capsule networks for lung and colon cancer classification from whole slide histopathological images | |
EP3311311A1 (en) | Automatic entity resolution with rules detection and generation system | |
Huang et al. | Neural network classifier with entropy based feature selection on breast cancer diagnosis | |
CN110245714B (en) | Image recognition method and device and electronic equipment | |
US20230282216A1 (en) | Authentication method and apparatus with transformation model | |
CN113723238B (en) | Face lightweight network model construction method and face recognition method | |
CN114283888A (en) | Differential expression gene prediction system based on hierarchical self-attention mechanism | |
CN116403730A (en) | Medicine interaction prediction method and system based on graph neural network | |
Ghosh et al. | Designing optimal convolutional neural network architecture using differential evolution algorithm | |
Yuan et al. | Evidential deep neural networks for uncertain data classification | |
WO2022162427A1 (en) | Annotation-efficient image anomaly detection | |
CN116879761A (en) | Multi-mode-based battery internal short circuit detection method, system, device and medium | |
CN116958613A (en) | Depth multi-view clustering method and device, electronic equipment and readable storage medium | |
Aufar et al. | Face recognition based on Siamese convolutional neural network using Kivy framework | |
CN114494809A (en) | Feature extraction model optimization method and device and electronic equipment | |
Jule et al. | Micrarray Image Segmentation Using Protracted K-Means Net Algorithm in Enhancement of Accuracy and Robustness | |
Zaghlool et al. | A review of deep learning methods for multi-omics integration in precision medicine | |
Kotyrba et al. | The use of conventional clustering methods combined with SOM to increase the efficiency | |
Rguibi et al. | Automatic searching of deep neural networks for medical imaging diagnostic | |
Luo et al. | Machine Learning for Time-to-Event Prediction and Survival Clustering: A Review from Statistics to Deep Neural Networks | |
SU et al. | Extended ResNet and Label Feature Vector Based Chromosome Classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |