CN115064266B - Incomplete multi-set data-based cancer diagnosis system, equipment and medium - Google Patents

Incomplete multi-set data-based cancer diagnosis system, equipment and medium Download PDF

Info

Publication number
CN115064266B
CN115064266B CN202210867454.3A CN202210867454A CN115064266B CN 115064266 B CN115064266 B CN 115064266B CN 202210867454 A CN202210867454 A CN 202210867454A CN 115064266 B CN115064266 B CN 115064266B
Authority
CN
China
Prior art keywords
data
patient
histology
omics
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210867454.3A
Other languages
Chinese (zh)
Other versions
CN115064266A (en
Inventor
余国先
王星泽
王峻
郭伟
崔立真
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202210867454.3A priority Critical patent/CN115064266B/en
Publication of CN115064266A publication Critical patent/CN115064266A/en
Application granted granted Critical
Publication of CN115064266B publication Critical patent/CN115064266B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Theoretical Computer Science (AREA)
  • Primary Health Care (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The present disclosure provides a cancer diagnosis system based on incomplete multiunit data, which belongs to the technical field of artificial intelligence data mining classification and bioinformatics, comprising: a data acquisition module configured to: acquiring all available to-be-diagnosed group data of the same patient; a histology data feature extraction module configured to: respectively extracting features of the obtained different groups of data; a missing omics data generation module configured to: generating missing omics data of the patient based on the generated countermeasure strategies according to the omics data corresponding to the patient, and extracting features; a multi-set of mathematical feature fusion and diagnosis module configured to: fusing the extracted histology data characteristics of the patient with the generated histology data characteristics, and inputting the fused characteristics into a pre-trained diagnosis network model to obtain a diagnosis result.

Description

Incomplete multi-set data-based cancer diagnosis system, equipment and medium
Technical Field
The disclosure belongs to the technical field of artificial intelligence data mining classification and bioinformatics, and particularly relates to a cancer diagnosis system, equipment and medium based on incomplete multi-group data.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
In the context of current cancer pandemics, it is particularly important to precisely classify the type of cancer in a patient by employing the patient's histology data, as different types of cancer often require different treatments. Compared with the cancer diagnosis by adopting single histology data, the fusion of multiple histology data can lead the characteristics of the patient to be more abundant, thereby further improving the accuracy of diagnosis. However, due to the high cost, invasiveness, legal and ethical constraints and other factors of the partial detection means, incomplete multi-group data are ubiquitous in the real biological world, and how to realize more accurate cancer diagnosis of patients according to the incomplete multi-group data is a difficulty that the current machine learning technology still needs to be improved in cancer diagnosis.
The inventors found that the current methods of diagnosis under incomplete multi-set of data are: training and diagnosing after excluding the missing samples in the histology data; ensuring that one of the patient's histology data is complete and then training and diagnosing; respectively constructing models according to the availability of the histology data, and then training and diagnosing; and extracting the data of each group to the feature space with the same dimension, and performing training, diagnosis and the like after fusion. However, these training methods depending on preconditions severely limit their practical application in clinic, and in addition, the method of fusing feature spaces in the same dimension only pursues the common features of each group of data, so that the individual features of each group of data are ignored, and thus, there is more room for improvement in machine learning technology for accurate diagnosis of cancer.
Disclosure of Invention
In order to solve the problems, the present disclosure provides a cancer diagnosis system, a device and a medium based on incomplete multi-group data, where the scheme extracts group characteristics through a attention-based characteristic extraction network, and makes group characteristics represent good variability through combination and optimization of sharing loss and personality loss, and attention parameters in the characteristic extraction network can not only alleviate the overfitting problem caused by high dimensionality of the group data, but also make the system have good interpretability and authenticity; generating the missing omics data of the patient by generating an countermeasure strategy, thereby enriching the characteristic representation of the patient, and realizing flexible cancer diagnosis even if only one kind of omics data is available; and finally, fusing the data characteristics of each group and inputting the fused characteristic diagnosis network, so as to obtain a more accurate diagnosis result of the patient on cancer.
According to a first aspect of embodiments of the present disclosure, there is provided a cancer diagnosis system based on incomplete multi-set of chemical data, comprising:
a data acquisition module configured to: acquiring all available to-be-diagnosed group data of the same patient;
A histology data feature extraction module configured to: respectively extracting features of the obtained different groups of data;
A missing omics data generation module configured to: generating missing omics data of the patient based on the generated countermeasure strategies according to the omics data corresponding to the patient, and extracting features;
A multi-set of mathematical feature fusion and diagnosis module configured to: fusing the extracted histology data characteristics of the patient with the generated histology data characteristics, and inputting the fused characteristics into a pre-trained diagnosis network model to obtain a diagnosis result.
Further, the feature extraction is performed on the obtained different sets of data respectively, specifically:
respectively constructing an attention parameter layer for each group of data characteristics;
Constructing a feature extraction network corresponding to each group of the learning data to perform feature extraction;
calculating sharing loss according to the extracted characteristics;
And calculating the personality loss according to the extracted features.
Further, according to the corresponding histology data of the patient, generating missing histology data of the patient based on the generated countermeasure policy, specifically:
Generating missing omic data according to the extracted omic data characteristics;
calculating generation loss and countermeasures loss based on real data corresponding to available omic data and generated omic data;
The generation loss and the antagonism loss of each group of the data are integrated, and the antagonism loss is calculated and generated.
Further, generating missing omic data according to the extracted omic data characteristics; the method comprises the following steps:
Extraction features from patient-available histology data To calculate potential features/>, of the patient corresponding to the respective omics data
Potential features that will correspond to corresponding omics dataInputting into a generation network G v (), corresponding to the corresponding omic data, to generate patient missing omic data/>The concrete representation is as follows:
Wherein G v (. Cndot.) is a generator for generating v-th histology data, Is the generated patient missing data from group v.
Further, the generating loss and the countering loss are calculated based on the real data corresponding to the available omic data and the generated omic data; the method comprises the following steps: true data in available patient histology dataGenerated data/>, from available omics dataInputting an identification network D v (DEG) corresponding to the corresponding histology data, and respectively calculating the generation loss/>, under the corresponding histology, according to the identification resultAnd combat loss/>The concrete representation is as follows:
Wherein D v (·) is a discriminator corresponding to the v-th omic data, whose output is 0 to 1, the smaller the value is representing the higher the likelihood that the discriminator considers the omic data as generated data; An indication matrix indicating whether the omics data is missing.
Further, the extracted patient histology data features and the generated histology data features are fused, specifically: and splicing and fusing the patient histology data characteristics and the generated histology data characteristics.
Further, the generated loss of each group of the data is integrated with the fight loss, and the generated fight loss is calculated as follows:
By optimizing GLoss, the performance of the countermeasure network can be continuously perfected in the process of generating and countermeasures, so that missing omics data similar to real data is generated.
According to a second aspect of the embodiments of the present disclosure, there is provided an electronic device including a memory, a processor and a computer program stored to run on the memory, the processor implementing the following procedure when executing the program:
Acquiring all available to-be-diagnosed group data of the same patient;
respectively extracting features of the obtained different groups of data;
generating missing omics data of the patient based on the generated countermeasure strategies according to the omics data corresponding to the patient, and extracting features;
Fusing the extracted histology data characteristics of the patient with the generated histology data characteristics, and inputting the fused characteristics into a pre-trained diagnosis network model to obtain a diagnosis result.
According to a third aspect of embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the following process:
Acquiring all available to-be-diagnosed group data of the same patient;
Extracting the characteristics of the obtained histology data;
Generating missing omics data for the patient based on the generating countermeasure strategy based on the available omics data for the patient;
And fusing the extracted available features of the patient and the generated histology data features, and inputting the fused features into a pre-trained diagnosis network model to obtain a diagnosis result.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program for performing the following steps when run on one or more processors:
Acquiring all available to-be-diagnosed group data of the same patient;
Extracting the characteristics of the obtained histology data;
Generating missing omics data for the patient based on the generating countermeasure strategy based on the available omics data for the patient;
And fusing the extracted available features of the patient and the generated histology data features, and inputting the fused features into a pre-trained diagnosis network model to obtain a diagnosis result.
Compared with the prior art, the beneficial effects of the present disclosure are:
(1) According to the scheme, firstly, the characteristics of the histology are extracted through the attention-based characteristic extraction network, and the characteristics of the histology are represented with good variability through the combination and optimization of the sharing loss and the individual loss, and the attention parameters in the characteristic extraction network can not only relieve the overfitting problem caused by the high dimensionality of the histology data, but also enable the system to have good interpretability and authenticity; generating the missing omics data of the patient by generating an countermeasure strategy, thereby enriching the characteristic representation of the patient, and realizing flexible cancer diagnosis even if only one kind of omics data is available; and finally, fusing the data characteristics of each group and inputting the fused characteristic diagnosis network, so as to obtain a more accurate diagnosis result of the patient on cancer.
(2) The system disclosed by the disclosure only takes the study data of each group of patients as input, can obtain the diagnosis result of the patients without complicated operation steps, and has good usability.
Additional aspects of the disclosure will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the disclosure.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the exemplary embodiments of the disclosure and together with the description serve to explain the disclosure, and do not constitute an undue limitation on the disclosure.
FIG. 1 is a schematic diagram of a system for diagnosing cancer based on incomplete multi-set of chemical data according to an embodiment of the present disclosure;
FIG. 2 is a process flow diagram of a cancer diagnostic system based on incomplete multi-set of chemical data in an embodiment of the present disclosure.
Detailed Description
The disclosure is further described below with reference to the drawings and examples.
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments in accordance with the present disclosure. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.
Embodiment one:
it is an object of the present embodiment to provide a cancer diagnosis system based on incomplete multi-set of chemical data.
A cancer diagnostic system based on incomplete multi-set of chemical data, comprising:
a data acquisition module configured to: acquiring all available to-be-diagnosed group data of the same patient;
A histology data feature extraction module configured to: respectively extracting features of the obtained different groups of data;
A missing omics data generation module configured to: generating missing omics data of the patient based on the generated countermeasure strategies according to the omics data corresponding to the patient, and extracting features;
A multi-set of mathematical feature fusion and diagnosis module configured to: fusing the extracted histology data characteristics of the patient with the generated histology data characteristics, and inputting the fused characteristics into a pre-trained diagnosis network model to obtain a diagnosis result.
Further, the feature extraction is performed on the obtained different sets of data respectively, specifically:
respectively constructing an attention parameter layer for each group of data characteristics;
Constructing a feature extraction network corresponding to each group of the learning data to perform feature extraction;
calculating sharing loss according to the extracted characteristics;
And calculating the personality loss according to the extracted features.
Further, according to the corresponding histology data of the patient, generating missing histology data of the patient based on the generated countermeasure policy, specifically:
Generating missing omic data according to the extracted omic data characteristics;
calculating generation loss and countermeasures loss based on real data corresponding to available omic data and generated omic data;
The generation loss and the antagonism loss of each group of the data are integrated, and the antagonism loss is calculated and generated.
Further, generating missing omic data according to the extracted omic data characteristics; the method comprises the following steps:
Extraction features from patient-available histology data To calculate potential features/>, of the patient corresponding to the respective omics data
Potential features that will correspond to corresponding omics dataInputting into a generation network G v (), corresponding to the corresponding omic data, to generate patient missing omic data/>The concrete representation is as follows:
Wherein G v (. Cndot.) is a generator for generating v-th histology data, Is the generated patient missing data from group v.
Further, the generating loss and the countering loss are calculated based on the real data corresponding to the available omic data and the generated omic data; the method comprises the following steps: true data in available patient histology dataGenerated data/>, from available omics dataInputting an identification network D v (DEG) corresponding to the corresponding histology data, and respectively calculating the generation loss/>, under the corresponding histology, according to the identification resultAnd combat loss/>The concrete representation is as follows:
Wherein D v (·) is a discriminator corresponding to the v-th omic data, whose output is 0 to 1, the smaller the value is representing the higher the likelihood that the discriminator considers the omic data as generated data; an indication matrix of whether the histologic data is missing or not, when When it indicates that the type-changed omic data is available,/>And then indicates that the type of the omics data is missing.
Further, the extracted patient histology data features and the generated histology data features are fused, specifically: and splicing and fusing the patient histology data characteristics and the generated histology data characteristics.
Further, the generated loss of each group of the data is integrated with the fight loss, and the generated fight loss is calculated as follows:
By optimizing GLoss, the performance of the countermeasure network can be continuously perfected in the process of generating and countermeasures, so that missing omics data similar to real data is generated.
In particular, for easy understanding, the following detailed description of the embodiments will be given with reference to the accompanying drawings:
As shown in fig. 2, there is provided a cancer diagnosis system based on incomplete multi-set of chemical data, comprising:
S101, acquiring all available to-be-diagnosed histology data of the same patient, wherein the to-be-diagnosed histology data can comprise but not only: copy number variation data, DNA methylation data, diagnostic pathological section imaging data, gene expression data, and the like.
S102, extracting the characteristics of available histology data.
Specifically, firstly, an attention parameter layer is respectively constructed for each group of data characteristics to capture important group of data characteristics, then a group of data characteristic extraction network is respectively constructed for each group of data, the group of data captured by attention parameters is input into the characteristic extraction network to perform characteristic extraction, and finally, sharing loss and personality loss are calculated according to the extracted characteristics to be used for optimizing each attention-based characteristic extraction network in a training stage.
Specifically, the specific implementation manner of the step 102 is as follows:
S1021, respectively constructing attention parameter layers for the various histology data characteristics, and characterizing the patient' S histology data Input of the respective histology feature attention layer/>To capture important feature sites in the omics data to obtain the patient's omics data/>, which is captured by the attention parameterThereby reducing the over-fitting problem caused by the high dimensionality of the histology data.
S1022, constructing a feature extraction network corresponding to each group of data, inputting the group of data features captured by the attention parameters into the feature extraction network corresponding to the corresponding group of data to obtain feature vectors of each group of data with the same feature dimensionThe definition is as follows:
Wherein the method comprises the steps of And/>The original data and the extracted features of the ith patient under the v th histology are respectively,/>Is a parameter of a feature extraction network of the v-th histology data, is a para-multiplication symbol, omega v is an attention weight corresponding to a corresponding histology obtained by normalization of a Softmax function in the v-th histology data, the introduction of the attention parameter can enable the feature extraction network to adaptively capture important mutation sites in the patient histology data and give the sites a larger attention weight when the feature extraction network performs feature extraction, the feature extraction network can have a certain interpretability, and the local optimal problem caused by overlarge site weight in the histology data can be prevented by normalization of omega v through the Softmax function.
S1023, calculating sharing loss according to the extracted features, we respectively for each extracted histology featureConstructing a characteristic evaluation network, and calculating sharing loss SLoss according to the cancer type output by the evaluation network and the real cancer type of the patient as follows:
Wherein y i is the type of cancer to which the patient corresponds, Is to measure the extracted characteristics/>Is a loss function of capacity (cross entropy loss is used herein). SLoss aims to induce consistent prediction of the histology characteristics extracted by the characteristic extraction network, and by means of combined optimization of the attention-based characteristic extraction network and the characteristic evaluation network by SLoss, information characteristics which are helpful for cancer type diagnosis can be extracted from various histology data of patients, and the interpretability and the authenticity of the system can be improved.
S1024, calculating personality loss according to the extracted features, firstly extracting features by each groupTo calculate the characteristic prototype/>, of the patientAnd takes this as an approximation of the commonality feature, and then extracts the features/>, of the individual histology data by a personality factor λAnd feature prototype/>The similarity of (2) is converted into a personality loss ILoss for measuring the balance of personality and commonality among the extracted features, defined as follows:
Wherein the method comprises the steps of Is the extraction feature/>And feature prototype/>Cosine similarity between them, when the similarity is larger than the personality factor lambda (-1. Ltoreq.lambda. Ltoreq.1), the term/(I)Much like the commonality feature, we use Relu (·) activation functions to convert this loss to 0; otherwise,/>Containing too many individual features, we alleviate the extracted features/>, by optimizing ILossIs an excessive personalization of (c).
Through the joint optimization of the sharing loss SLoss and the personality loss ILoss, the system can extract commonality and personality characteristics from multiple groups of the patient's learning data, so that the diversity of the patient characteristics of different cancer types is ensured, and the system is helped to realize more accurate diagnosis.
S103, generating missing histology data.
Specifically, potential characteristics of the patient corresponding to each group of the patient are calculated according to the extracted characteristics of the available group of the patient, the potential characteristics are input into a generating network to generate each group of the patient, the generating data of the available group and the real data are input into an antagonism network to calculate the generating loss and the antagonism loss, and the generating loss and the antagonism loss calculated under each group are integrated to obtain the generating antagonism loss.
Specifically, the specific implementation manner of the step 103 is as follows:
S1031, generating missing omics data based on the extracted features, extracting features from the patient-available omics data To calculate potential features/>, of the patient corresponding to the respective omics dataThe following are provided:
then, potential features corresponding to the corresponding group Inputting into a generation network G v (·) corresponding to the respective histology to generate patient missing histology data/>The following are provided:
wherein, in order to ensure the generation capability of the generation network, the extraction features of the v-th histology data are not used for Is calculated by the computer. G v (.) is a generator for generating v-th histology data,/>Is the generated patient missing data from group v.
S1032, calculating generation loss and antagonism loss according to the real data and generation data corresponding to the available group, and collecting the real data of the patient under the available groupGenerated data/>, under available histologyInputting an identification network D v (DEG) corresponding to the corresponding histology data, and respectively calculating the generation loss/>, under the corresponding histology, according to the identification resultAnd combat loss/>The definition is as follows:
Where D v (·) is a discriminator corresponding to the v-th omic data, whose output is 0 to 1, the smaller the value is representing the higher the likelihood that the discriminator considers the omic data as generated data.
S1033, integrating the generation loss and the antagonism loss of each group, obtaining the generation antagonism loss GLoss of the generation antagonism network under all available group data, which is defined as follows:
By optimizing GLoss, the performance of the antagonism network can be continuously perfected in the process of generating and antagonism, so that missing omics data which is more similar to real data is generated.
S104, histology feature fusion and diagnosis.
Specifically, the generated data under the missing histology is input into a attention-based feature extraction network corresponding to the corresponding histology to perform feature extraction, the extracted features are used as approximate features of the patient under the histology, feature fusion is performed on all the histology data of the patient, a cancer diagnosis network based on the fusion features is input, a diagnosis result of the patient is obtained, and the diagnosis loss is calculated according to the diagnosis result.
Specifically, the specific implementation manner of the step 104 is as follows:
s1041, calculating the characteristic vector of each group of the patient, and generating data of the patient Input attention-based feature extraction network under the corresponding group/>Feature extraction and obtaining representative feature vector/>, of the v-th histology to be used for cancer diagnosis, based on whether the histology data is availableThe definition is as follows:
Wherein the method comprises the steps of Is a representative feature vector of the v-th histologic data we will use for cancer diagnosis. When/>When available, we directly use their characteristic representation/>Carrying out subsequent feature fusion; otherwise, we use/>To extract the generated data/>For subsequent feature fusion.
S1042, fusion of the respective histology features, combining all the histology feature vectors of the patient for diagnosisFeature fusion based on splicing is carried out, and fusion features z i are obtained and defined as follows:
S1043, performing cancer diagnosis according to the fusion characteristic, calculating diagnosis loss, inputting the fusion characteristic z i into a cancer diagnosis network based on the fusion characteristic for diagnosis, and calculating diagnosis loss DLoss according to the diagnosis result, wherein the definition is as follows:
where f Ψ (·) is the cancer diagnostic network optimized by the parameter ψ, CE (f Ψ(zi),yi) is the cross entropy function.
Illustratively, the training of the incomplete multi-set of chemical data cancer diagnosis system comprises:
Constructing a first training set; the first training set is all available omics data for patients for whom the diagnostic result is known to the incomplete multi-omic data cancer diagnostic system;
Specifically, inputting the data of each of the group studies in the first training set to the attention-based feature extraction network under the corresponding group study results in sharing loss SLoss and personality loss ILoss. The generated data under the available group and the real data of the corresponding group in the first training set are input into an authentication network based on the specific group, and the generated countermeasure loss GLoss is calculated according to the output of the authentication network. The fusion features derived from the available data and the generated data are input into a fusion feature-based cancer diagnostic network to obtain a final diagnostic result for the patient, and a diagnostic loss DLoss is calculated from the diagnostic result. Finally, based on these losses, the final objective loss function L of the incomplete multi-set of chemical data cancer diagnostic system is obtained, defined as follows:
L=minΦ,Ω,G,ΨSLoss+ILoss+GLoss+DLoss (12)
Through optimizing the target loss function L, the updating of the network parameters of each module in the incomplete multi-mathematic data cancer diagnosis system can be realized, so that the diagnosis performance of the system is improved.
Further, a first validation set and a first test set are constructed. The first validation set and the first test set are all available omics data for patients whose diagnostic results are unknown to the incomplete multi-omic data cancer diagnostic system.
Specifically, all available histology data in the first verification set are input into an incomplete multiple-histology data cancer diagnosis system to obtain a diagnosis result predicted by the system, the diagnosis accuracy of the system in the first verification set is calculated according to the diagnosis result predicted by the system and the real diagnosis result of a patient, and the system parameter with the highest diagnosis accuracy in the training process is selected as the final parameter of the incomplete multiple-histology data cancer diagnosis system.
Further, inputting all available histology data in the first test set into an incomplete multiple-histology data cancer diagnosis system represented by final parameters to obtain a system prediction diagnosis result, calculating the diagnosis accuracy of the system in the first test set according to the system prediction diagnosis result and the real diagnosis result of the patient, and taking the accuracy as an approximation of the diagnosis accuracy of the incomplete multiple-histology data cancer diagnosis system in future diagnosis tasks.
In summary, by embodiments of the present invention, we propose an incomplete multi-set of data cancer diagnostic system. The system firstly extracts the histology characteristics through a characteristic extraction network based on attention, and ensures that the histology characteristics are expressed with good difference through the combination and optimization of sharing loss and personality loss, and attention parameters in the network can not only relieve the overfitting problem caused by the high dimensionality of the histology data, but also ensure that the system has good interpretability and authenticity; generating the missing omics data of the patient by generating an countermeasure strategy, thereby enriching the characteristic representation of the patient, and realizing flexible cancer diagnosis even if only one kind of omics data is available; and finally, fusing the data characteristics of each group and inputting the fused characteristic diagnosis network, so as to obtain a more accurate diagnosis result of the patient on cancer. In addition, the system only takes the data of each group of the patient as input, can obtain the diagnosis result of the patient without complicated operation steps, and has good usability.
Embodiment two:
an object of the present embodiment is to provide an electronic apparatus.
An electronic device comprising a memory, a processor and a computer program stored to run on the memory, the processor implementing the following processes when executing the program:
Acquiring all available to-be-diagnosed group data of the same patient;
respectively extracting features of the obtained different groups of data;
generating missing omics data of the patient based on the generated countermeasure strategies according to the omics data corresponding to the patient, and extracting features;
Fusing the extracted histology data characteristics of the patient with the generated histology data characteristics, and inputting the fused characteristics into a pre-trained diagnosis network model to obtain a diagnosis result.
Further, the details of the implementation steps of the present embodiment are described in the first embodiment, so they will not be described herein.
Embodiment III:
it is an object of the present embodiment to provide a non-transitory computer readable storage medium.
A non-transitory computer readable storage medium having stored thereon a computer program which when executed by a processor performs the following process:
Acquiring all available to-be-diagnosed group data of the same patient;
Extracting the characteristics of the obtained histology data;
Generating missing omics data for the patient based on the generating countermeasure strategy based on the available omics data for the patient;
And fusing the extracted available features of the patient and the generated histology data features, and inputting the fused features into a pre-trained diagnosis network model to obtain a diagnosis result.
Further, the details of the implementation steps of the present embodiment are described in the first embodiment, so they will not be described herein.
Embodiment four:
it is an object of the present embodiment to provide a computer program product.
A computer program product comprising a computer program for performing the following steps when run on one or more processors:
Acquiring all available to-be-diagnosed group data of the same patient;
Extracting the characteristics of the obtained histology data;
Generating missing omics data for the patient based on the generating countermeasure strategy based on the available omics data for the patient;
And fusing the extracted available features of the patient and the generated histology data features, and inputting the fused features into a pre-trained diagnosis network model to obtain a diagnosis result.
Further, the details of the implementation steps of the present embodiment are described in the first embodiment, so they will not be described herein.
The cancer diagnosis system, the equipment and the medium based on incomplete multi-group data provided by the embodiment can be realized, and have wide application prospect.
The foregoing description of the preferred embodiments of the present disclosure is provided only and not intended to limit the disclosure so that various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims (7)

1. A cancer diagnostic system based on incomplete multi-set of chemical data, comprising:
a data acquisition module configured to: acquiring all available to-be-diagnosed group data of the same patient;
A histology data feature extraction module configured to: respectively extracting features of the obtained different groups of data;
A missing omics data generation module configured to: generating missing omics data of the patient based on the generated countermeasure strategies according to the omics data corresponding to the patient, and extracting features;
A multi-set of mathematical feature fusion and diagnosis module configured to: fusing the extracted histology data characteristics of the patient with the generated histology data characteristics, and inputting the fused characteristics into a pre-trained diagnosis network model to obtain a diagnosis result;
generating missing omics data of the patient based on the generated countermeasure strategies according to the corresponding omics data of the patient, wherein the missing omics data of the patient are specifically:
Generating missing omic data according to the extracted omic data characteristics;
calculating generation loss and countermeasures loss based on real data corresponding to available omic data and generated omic data;
integrating the generation loss and the antagonism loss of each group of the study data, and calculating the generation antagonism loss;
Generating missing omics data according to the extracted omics data characteristics, specifically:
Extraction features from patient-available histology data To calculate potential characteristics of the patient corresponding to respective ones of the histologic data
Potential features that will correspond to corresponding omics dataInputting into a generation network G v (), corresponding to the corresponding omic data, to generate patient missing omic data/>The concrete representation is as follows:
wherein, Is the generated patient missing data of group v;
The calculation of the generation loss and the antagonism loss based on the real data corresponding to the available histology data and the generated histology data is specifically as follows: true data in available patient histology data Generated data/>, from available omics dataInputting an identification network D v (DEG) corresponding to the corresponding histology data, and respectively calculating the generation loss/>, under the corresponding histology, according to the identification resultAnd combat loss/>The concrete representation is as follows:
Wherein the output of D v (·) is 0 to 1, a smaller value representing a higher likelihood that the discriminator considers the set of data as generated data; An indication matrix indicating whether the omics data is missing.
2. The system for diagnosing cancer based on incomplete multi-set of chemical data according to claim 1, wherein the feature extraction is performed on the obtained different sets of chemical data respectively, specifically:
respectively constructing an attention parameter layer for each group of data characteristics;
Constructing a feature extraction network corresponding to each group of the learning data to perform feature extraction;
calculating sharing loss according to the extracted characteristics;
And calculating the personality loss according to the extracted features.
3. The system of claim 1, wherein the generated loss of each set of data is integrated with the fight loss, and the fight loss is calculated as follows:
By optimizing GLoss, the performance of the countermeasure network can be continuously perfected in the process of generating and countermeasures, so that missing omics data similar to real data is generated.
4. The incomplete multi-group data based cancer diagnosis system according to claim 1, wherein the extracted group data features of the patient and the generated group data features are fused, in particular: and splicing and fusing the patient histology data characteristics and the generated histology data characteristics.
5. An electronic device comprising a memory, a processor and a computer program stored to run on the memory, the processor implementing the following processes when executing the program:
Acquiring all available to-be-diagnosed group data of the same patient;
respectively extracting features of the obtained different groups of data;
generating missing omics data of the patient based on the generated countermeasure strategies according to the omics data corresponding to the patient, and extracting features;
Fusing the extracted histology data characteristics of the patient with the generated histology data characteristics, and inputting the fused characteristics into a pre-trained diagnosis network model to obtain a diagnosis result;
generating missing omics data of the patient based on the generated countermeasure strategies according to the corresponding omics data of the patient, wherein the missing omics data of the patient are specifically:
Generating missing omic data according to the extracted omic data characteristics;
calculating generation loss and countermeasures loss based on real data corresponding to available omic data and generated omic data;
integrating the generation loss and the antagonism loss of each group of the study data, and calculating the generation antagonism loss;
Generating missing omics data according to the extracted omics data characteristics, specifically:
Extraction features from patient-available histology data To calculate potential features/>, of the patient corresponding to the respective omics data
Potential features that will correspond to corresponding omics dataInputting into a generation network G v (), corresponding to the corresponding omic data, to generate patient missing omic data/>The concrete representation is as follows:
wherein, Is the generated patient missing data of group v;
The calculation of the generation loss and the antagonism loss based on the real data corresponding to the available histology data and the generated histology data is specifically as follows: true data in available patient histology data Generated data/>, from available omics dataInputting an identification network D v (DEG) corresponding to the corresponding histology data, and respectively calculating the generation loss/>, under the corresponding histology, according to the identification resultAnd combat loss/>The concrete representation is as follows:
Wherein the output of D v (·) is 0 to 1, a smaller value representing a higher likelihood that the discriminator considers the set of data as generated data; An indication matrix indicating whether the omics data is missing.
6. A non-transitory computer readable storage medium, having stored thereon a computer program which when executed by a processor performs the following process:
Acquiring all available to-be-diagnosed group data of the same patient;
Extracting the characteristics of the obtained histology data;
Generating missing omics data for the patient based on the generating countermeasure strategy based on the available omics data for the patient;
Fusing the extracted available features of the patient and the generated histology data features, and inputting the fused features into a pre-trained diagnosis network model to obtain a diagnosis result;
generating missing omics data of the patient based on the generated countermeasure strategies according to the corresponding omics data of the patient, wherein the missing omics data of the patient are specifically:
Generating missing omic data according to the extracted omic data characteristics;
calculating generation loss and countermeasures loss based on real data corresponding to available omic data and generated omic data;
integrating the generation loss and the antagonism loss of each group of the study data, and calculating the generation antagonism loss;
Generating missing omics data according to the extracted omics data characteristics, specifically:
Extraction features from patient-available histology data To calculate potential characteristics of the patient corresponding to respective ones of the histologic data
Potential features that will correspond to corresponding omics dataInputting into a generation network G v (), corresponding to the corresponding omic data, to generate patient missing omic data/>The concrete representation is as follows:
wherein, Is the generated patient missing data of group v;
The calculation of the generation loss and the antagonism loss based on the real data corresponding to the available histology data and the generated histology data is specifically as follows: true data in available patient histology data Generated data/>, from available omics dataInputting an identification network D v (DEG) corresponding to the corresponding histology data, and respectively calculating the generation loss/>, under the corresponding histology, according to the identification resultAnd combat loss/>The concrete representation is as follows:
Wherein the output of D v (·) is 0 to 1, a smaller value representing a higher likelihood that the discriminator considers the set of data as generated data; An indication matrix indicating whether the omics data is missing.
7. A computer program product comprising a computer program for performing the following steps when run on one or more processors:
Acquiring all available to-be-diagnosed group data of the same patient;
Extracting the characteristics of the obtained histology data;
Generating missing omics data for the patient based on the generating countermeasure strategy based on the available omics data for the patient;
Fusing the extracted available features of the patient and the generated histology data features, and inputting the fused features into a pre-trained diagnosis network model to obtain a diagnosis result;
generating missing omics data of the patient based on the generated countermeasure strategies according to the corresponding omics data of the patient, wherein the missing omics data of the patient are specifically:
Generating missing omic data according to the extracted omic data characteristics;
calculating generation loss and countermeasures loss based on real data corresponding to available omic data and generated omic data;
integrating the generation loss and the antagonism loss of each group of the study data, and calculating the generation antagonism loss;
Generating missing omics data according to the extracted omics data characteristics, specifically:
Extraction features from patient-available histology data To calculate potential characteristics of the patient corresponding to respective ones of the histologic data
Potential features that will correspond to corresponding omics dataInputting into a generation network G v (), corresponding to the corresponding omic data, to generate patient missing omic data/>The concrete representation is as follows:
wherein, Is the generated patient missing data of group v;
The calculation of the generation loss and the antagonism loss based on the real data corresponding to the available histology data and the generated histology data is specifically as follows: true data in available patient histology data Generated data/>, from available omics dataInputting an identification network D v (DEG) corresponding to the corresponding histology data, and respectively calculating the generation loss/>, under the corresponding histology, according to the identification resultAnd combat loss/>The concrete representation is as follows:
Wherein the output of D v (·) is 0 to 1, a smaller value representing a higher likelihood that the discriminator considers the set of data as generated data; An indication matrix indicating whether the omics data is missing.
CN202210867454.3A 2022-07-21 2022-07-21 Incomplete multi-set data-based cancer diagnosis system, equipment and medium Active CN115064266B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210867454.3A CN115064266B (en) 2022-07-21 2022-07-21 Incomplete multi-set data-based cancer diagnosis system, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210867454.3A CN115064266B (en) 2022-07-21 2022-07-21 Incomplete multi-set data-based cancer diagnosis system, equipment and medium

Publications (2)

Publication Number Publication Date
CN115064266A CN115064266A (en) 2022-09-16
CN115064266B true CN115064266B (en) 2024-04-26

Family

ID=83205827

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210867454.3A Active CN115064266B (en) 2022-07-21 2022-07-21 Incomplete multi-set data-based cancer diagnosis system, equipment and medium

Country Status (1)

Country Link
CN (1) CN115064266B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115985513B (en) * 2023-01-05 2023-11-03 徐州医科大学科技园发展有限公司 Data processing method, device and equipment based on multiple groups of chemical cancer typing
CN116741397B (en) * 2023-08-15 2023-11-03 数据空间研究院 Cancer typing method, system and storage medium based on multi-group data fusion

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111028939A (en) * 2019-11-15 2020-04-17 华南理工大学 Multigroup intelligent diagnosis system based on deep learning
CN111291777A (en) * 2018-12-07 2020-06-16 深圳先进技术研究院 Cancer subtype classification method based on multigroup chemical integration
CN111816259A (en) * 2020-07-07 2020-10-23 西安电子科技大学 Incomplete omics data integration method based on network representation learning
CN113228194A (en) * 2018-10-12 2021-08-06 人类长寿公司 Multigroup search engine for comprehensive analysis of cancer genome and clinical data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2020274091A1 (en) * 2019-05-14 2021-12-09 Tempus Ai, Inc. Systems and methods for multi-label cancer classification

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113228194A (en) * 2018-10-12 2021-08-06 人类长寿公司 Multigroup search engine for comprehensive analysis of cancer genome and clinical data
CN111291777A (en) * 2018-12-07 2020-06-16 深圳先进技术研究院 Cancer subtype classification method based on multigroup chemical integration
CN111028939A (en) * 2019-11-15 2020-04-17 华南理工大学 Multigroup intelligent diagnosis system based on deep learning
CN111816259A (en) * 2020-07-07 2020-10-23 西安电子科技大学 Incomplete omics data integration method based on network representation learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MOMA: a multi-task attention learning algorithm for multi-omics data interpretation and classification;Sehwan Moon, Hyunju Lee;《Bioinformatics》;20220214;第38卷(第8期);全文 *
多组学数据整合分析的统计方法研究进展;沈思鹏;张汝阳;魏永越;陈峰;;中华疾病控制杂志;20180810(第08期);全文 *

Also Published As

Publication number Publication date
CN115064266A (en) 2022-09-16

Similar Documents

Publication Publication Date Title
CN115064266B (en) Incomplete multi-set data-based cancer diagnosis system, equipment and medium
EP4123515A1 (en) Data processing method and data processing device
US20220367053A1 (en) Multimodal fusion for diagnosis, prognosis, and therapeutic response prediction
CN110197195B (en) Novel deep network system and method for behavior recognition
Adu et al. DHS‐CapsNet: Dual horizontal squash capsule networks for lung and colon cancer classification from whole slide histopathological images
EP3311311A1 (en) Automatic entity resolution with rules detection and generation system
Huang et al. Neural network classifier with entropy based feature selection on breast cancer diagnosis
CN110245714B (en) Image recognition method and device and electronic equipment
US20230282216A1 (en) Authentication method and apparatus with transformation model
CN113723238B (en) Face lightweight network model construction method and face recognition method
CN114283888A (en) Differential expression gene prediction system based on hierarchical self-attention mechanism
CN116403730A (en) Medicine interaction prediction method and system based on graph neural network
Ghosh et al. Designing optimal convolutional neural network architecture using differential evolution algorithm
Yuan et al. Evidential deep neural networks for uncertain data classification
WO2022162427A1 (en) Annotation-efficient image anomaly detection
CN116879761A (en) Multi-mode-based battery internal short circuit detection method, system, device and medium
CN116958613A (en) Depth multi-view clustering method and device, electronic equipment and readable storage medium
Aufar et al. Face recognition based on Siamese convolutional neural network using Kivy framework
CN114494809A (en) Feature extraction model optimization method and device and electronic equipment
Jule et al. Micrarray Image Segmentation Using Protracted K-Means Net Algorithm in Enhancement of Accuracy and Robustness
Zaghlool et al. A review of deep learning methods for multi-omics integration in precision medicine
Kotyrba et al. The use of conventional clustering methods combined with SOM to increase the efficiency
Rguibi et al. Automatic searching of deep neural networks for medical imaging diagnostic
Luo et al. Machine Learning for Time-to-Event Prediction and Survival Clustering: A Review from Statistics to Deep Neural Networks
SU et al. Extended ResNet and Label Feature Vector Based Chromosome Classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant