CN116259407B - Disease diagnosis method, device, equipment and medium based on multi-mode data - Google Patents

Disease diagnosis method, device, equipment and medium based on multi-mode data Download PDF

Info

Publication number
CN116259407B
CN116259407B CN202310550630.5A CN202310550630A CN116259407B CN 116259407 B CN116259407 B CN 116259407B CN 202310550630 A CN202310550630 A CN 202310550630A CN 116259407 B CN116259407 B CN 116259407B
Authority
CN
China
Prior art keywords
feature
features
text
image
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310550630.5A
Other languages
Chinese (zh)
Other versions
CN116259407A (en
Inventor
李祯其
谢雄敦
胡尧
温志庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202310550630.5A priority Critical patent/CN116259407B/en
Publication of CN116259407A publication Critical patent/CN116259407A/en
Application granted granted Critical
Publication of CN116259407B publication Critical patent/CN116259407B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Data Mining & Analysis (AREA)
  • Pathology (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention relates to the technical field of medical diagnosis and treatment, and particularly discloses a disease diagnosis method, device, equipment and medium based on multi-mode data, wherein the method comprises the following steps: respectively extracting a problem text feature, a disease text feature, an audio feature and an image feature; mapping the symptom text features, the audio features and the image features to the same dimension, and performing feature alignment treatment; acquiring fusion feature vectors by fusion feature aligned disorder text features, audio features and image features; the text features and the fusion feature vectors of the splicing problems acquire splicing vectors; placing the spliced vector into a pre-trained language model to generate a diagnosis result; the method can fully mine the relevance and the difference between different modal data, solves the problem that the relationship between the multi-modal data is difficult to capture due to the difference between the data, effectively fuses various features, can effectively relieve the fatigue of doctors, and improves the disease diagnosis accuracy.

Description

Disease diagnosis method, device, equipment and medium based on multi-mode data
Technical Field
The application relates to the technical field of medical diagnosis and treatment, in particular to a disease diagnosis method, device, equipment and medium based on multi-mode data.
Background
As an empirical discipline, medical science requires clinicians to have a great deal of experience in disease prediction, disease analysis, prescription storage, and the like. The deep learning technology can provide support in this aspect, assist doctors in diagnosing diseases, reduce the influence of subjective factors and improve the accuracy of diagnosis.
The existing method or system for diagnosing diseases by using the deep learning technology generally adopts specific type of modal data (such as text and pictures) as input data to diagnose the diseases, and the existing model for analyzing by adopting single modal data cannot accurately judge the diseases; meanwhile, as the different types of modal data have differences, the existing model for analyzing by adopting the multi-type modal data is difficult to capture the differences of the different modal data, and the problem of low diagnosis accuracy exists.
In view of the above problems, no effective technical solution is currently available.
Disclosure of Invention
The application aims to provide a disease diagnosis method, device, equipment and medium based on multi-mode data, so as to realize effective fusion of the multi-mode data and improve disease diagnosis accuracy of a model.
In a first aspect, the present application provides a disease diagnosis method based on multi-modal data for performing disease diagnosis according to the multi-modal data, the multi-modal data including question text information, disorder text information, audio information, and image information, the method comprising the steps of:
extracting question text features, disorder text features, audio features and image features according to the question text information, the disorder text information, the audio information and the image information respectively;
mapping the symptom text features, the audio features and the image features to the same dimension, and performing feature alignment treatment;
acquiring fusion feature vectors by fusion feature aligned disorder text features, audio features and image features;
splicing the problem text features and the fusion feature vectors to obtain splicing vectors;
and placing the spliced vector into a pre-trained language model to generate a diagnosis result.
The disease diagnosis method based on the multi-modal data can fully mine the relevance and the difference between different modal data, solves the problem that the relationship between the multi-modal data is difficult to capture due to the difference between the data, effectively fuses various features, and solves the problem that the model based on the modal data training is easy to cause misjudgment and missed diagnosis.
The disease diagnosis method based on multi-modal data, wherein the feature alignment process comprises:
the disorder text features and audio features mapped to the same dimension are respectively aligned with the image features based on the ternary ordering loss function.
The disease diagnosis method based on multi-mode data, wherein the ternary ordering loss function is a ternary ordering loss function based on foldout, and the expression is as follows:
wherein L is matching For the hinge-based ternary ordering loss function, alpha, beta are edge parameters, I, V and T are image features, audio features and disorder text features mapped to the same dimension, respectively,and->Respectively a local audio feature and a local image feature with the lowest similarity between the feature pairs (I, V), respectively +.>And->The local text features and the local image features with the lowest similarity between the feature pairs (I, T) are respectively, S (,) is a similarity function, [ x ]] + =max (x, 0), representing taking the maximum value of the fill data x and 0.
The ternary sorting loss function based on the foldout can enable the features of three mode data to be aligned by taking the image features as alignment references so as to fully define the feature relation among the features of different mode data, so that the method can perform feature alignment on the premise of accurately capturing the differences and the similarities among the different mode data, and a bridge is built for subsequent feature fusion.
The disease diagnosis method based on multi-mode data, wherein the step of splicing the text feature of the problem and the fusion feature vector to obtain a spliced vector comprises the following steps:
the problem text feature is taken as a token to be spliced with the fusion feature vector to obtain the spliced vector [ T ] m ,Z]Wherein T is m And Z is the fusion feature vector for the text feature of the problem.
The multi-modal data-based disease diagnosis method, wherein the image features include a plurality of salient image region features.
According to the disease diagnosis method based on the multi-mode data, the problem text information and the disorder text information are subjected to feature extraction based on the same text encoder to obtain the problem text features and the disorder text features.
The disease diagnosis method based on multi-mode data, wherein the step of acquiring the fusion feature vector by the fusion feature aligned disease text feature, the audio feature and the image feature comprises the following steps:
acquiring text feature fusion weights, audio feature fusion weights and image feature fusion weights corresponding to symptom text features, audio features and image features respectively based on a gating expert neural network;
and generating the fusion feature vector according to the text feature fusion weight, the audio feature fusion weight and the symptom text feature, the audio feature and the image feature after the alignment of the image feature fusion weight fusion feature.
In a second aspect, the present application also provides a disease diagnosis apparatus based on multi-modal data including question text information, disorder text information, audio information, and image information, for performing disease diagnosis based on the multi-modal data, the apparatus comprising:
the feature extraction module is used for extracting problem text features, disorder text features, audio features and image features according to the problem text information, disorder text information, audio information and image information respectively;
the feature mapping module is used for mapping the symptom text features, the audio features and the image features to the same dimension and carrying out feature alignment processing;
the feature fusion module is used for fusing the symptom text features, the audio features and the image features after feature alignment to obtain fusion feature vectors;
the splicing module is used for splicing the text features of the problems and the fusion feature vectors to obtain splicing vectors;
and the diagnosis module is used for placing the spliced vector into a pre-trained language model to generate a diagnosis result.
The disease diagnosis device based on the multi-modal data can fully mine the relevance and the difference between different modal data, solves the problem that the relationship between the multi-modal data is difficult to capture due to the difference between the data, effectively fuses various features, and solves the problem that the model based on the modal data training is easy to cause misjudgment and missed diagnosis.
In a third aspect, the present application also provides an electronic device comprising a processor and a memory storing computer readable instructions which, when executed by the processor, perform the steps of the method as provided in the first aspect above.
In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method as provided in the first aspect above.
As can be seen from the above, the present application provides a disease diagnosis method, device, equipment and medium based on multi-modal data, wherein the disease diagnosis method based on multi-modal data obtains a fusion feature vector capable of being put into a language model based on mapping, feature alignment and fusion processing after extracting corresponding features according to the multi-modal data, and combines the feature of a problem text as a token to form a splice vector to be put into the language model to generate a diagnosis result; the disease diagnosis method can fully mine the relevance and the difference between different modal data, solves the problem that the relationship between the multi-modal data is difficult to capture due to the difference between the data, effectively fuses various features, solves the problem that the model is easy to cause misjudgment and missed diagnosis caused by modal data training, and the generated diagnosis result can provide more accurate diagnosis basis for doctors, effectively relieves the fatigue of the doctors and improves the disease diagnosis accuracy.
Drawings
Fig. 1 is a flowchart of a disease diagnosis method based on multi-modal data according to an embodiment of the present application.
Fig. 2 is a schematic structural diagram of a disease diagnosis device based on multi-modal data according to an embodiment of the present application.
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Reference numerals: 201. a feature extraction module; 202. a feature mapping module; 203. a feature fusion module; 204. splicing modules; 205. a diagnostic module; 301. a processor; 302. a memory; 303. a communication bus.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.
In a first aspect, referring to fig. 1, some embodiments of the present application provide a disease diagnosis method based on multi-modal data, for performing disease diagnosis according to the multi-modal data, where the multi-modal data includes question text information, disorder text information, audio information, and image information, the method includes the following steps:
s1, extracting question text features, disorder text features, audio features and image features according to question text information, disorder text information, audio information and image information respectively;
s2, mapping the symptom text features, the audio features and the image features to the same dimension, and performing feature alignment processing;
s3, acquiring fusion feature vectors by using the disease text features, the audio features and the image features with the fusion features aligned;
s4, acquiring a splicing vector by splicing the text features and the fusion feature vectors of the problems;
s5, placing the spliced vector into a pre-trained language model to generate a diagnosis result.
Specifically, the multi-modal data are all data information related to the physical state of the subject; the problem text information is a problem template related to the disease, such as text contents of related problems including whether the disease has a specific disease, whether the disease is transferred, whether drug allergy exists and the like; the condition text information is the description content of the inquiry object for the self diseases, can be the description content comprising text information corresponding to the problems, can also comprise the description content of the inquiry object for the physical state and the like; the audio information is disorder audio data about the subject, including audio data about the physical condition of the subject acquired based on the medical instrument, such as heart beat audio data, pulse diagnosis audio data, bone sound data, etc.; the image information is disorder image data about the subject, such as general photo data including color ultrasound data, B-ultrasound data, disease sites, and the like.
More specifically, in step S1, the pre-trained encoder is used to extract the corresponding modal feature according to the corresponding type of modal data, for example, the text encoder is used to extract the question text feature according to the question text information.
More specifically, in order to solve the problem that the relationships between different features cannot be effectively captured due to the differences between the features of different modal data and are difficult to comprehensively utilize, the method of the embodiment of the present application unifies the features of the different modal data into a unified feature space and performs feature alignment processing in a mapping processing manner, so that the step S3 can effectively fuse the features of the aligned multiple modal data in the same feature space.
More specifically, the three types of features after feature alignment can be fused based on the existing feature fusion model, and a feature vector is generated as input data of a language model, wherein the language model is a text information output model such as GPT4, n-gram, glove and the like; in the embodiment, under the processes of mapping, feature alignment and fusion, the features of the multi-modal data are effectively fused, so that the language model can train and analyze the features of the multi-modal data comprehensively, and the disease diagnosis can be performed according to the respective and related disease features of the different-modal data, and the accuracy of the disease diagnosis is effectively improved.
More specifically, the spliced vector input by the language model also comprises a problem text feature which is used as a means for verifying the access authority input by the language model, so that the problem that the non-compliance problem affects the diagnosis analysis result of the language model or the language model is prevented from being illegally used, and meanwhile, the disease analysis direction can be guided, so that the language model can perform access verification and inquiry guidance according to the token, and the safety and accuracy of disease diagnosis are effectively improved.
According to the disease diagnosis method based on the multi-mode data, after corresponding features are extracted according to the multi-mode data, fusion feature vectors which can be placed in a language model are obtained based on mapping, feature alignment and fusion processing, and a spliced vector is formed by combining the feature of a problem text serving as a token so as to be placed in the language model to generate a diagnosis result; the disease diagnosis method can fully mine the relevance and the difference between different modal data, solves the problem that the relationship between the multi-modal data is difficult to capture due to the difference between the data, effectively fuses various features, solves the problem that the model is easy to cause misjudgment and missed diagnosis caused by modal data training, and the generated diagnosis result can provide more accurate diagnosis basis for doctors, effectively relieves the fatigue of the doctors and improves the disease diagnosis accuracy.
In some preferred embodiments, step S1 comprises:
the method comprises the steps of obtaining problem text features according to problem text information by using a text encoder, obtaining disorder text features according to disorder text information by using a text encoder, obtaining audio features according to audio information by using an audio encoder, and obtaining image features according to image information by using an image encoder.
The problem text information and the disorder text information can adopt the same or different text encoders to extract corresponding features; in the embodiment of the application, since the problem text information and the disorder text information are correspondingly related and belong to text contents, the problem text information and the disorder text information are preferably extracted based on the same text encoder to obtain the problem text characteristics and the disorder text characteristics, so that data analysis resources can be saved, and the problem text characteristics and the disorder text characteristics obtained in the step S1 are ensured to keep the relativity.
More specifically, the type of the encoder can be selected according to the characteristic features of the input data; in the embodiment of the application, the text encoder preferably adopts a doc2vec model, and the doc2vec model can learn to obtain a characteristic representation with a fixed length from a text with an indefinite length, namely, can generate a problem text characteristic or a disorder text characteristic with a specific length according to problem text information or disorder text information with different information amounts so as to represent specific document contents by utilizing a characteristic vector with a single length; wherein, the network parameter of the doc2vec model is marked as W T The text information of the question is marked as X Pr The disease text information is marked as X T The text feature of the question is marked as T m The condition text is characterized by T.
More specifically, in embodiments of the present application, the audio encoder preferably employs a transducer model that captures with a self-attention mechanismThe long-distance dependency relationship in the sequence relieves the problem that sound events in the audio information are possibly far apart; in addition, the transducer model comprises a plurality of encoder and decoder layers, can learn characteristic representations of audio data at different levels, and is beneficial to capturing audio structures with different time scales and complexity so as to acquire audio characteristics capable of accurately representing disease characteristics according to audio information; wherein, the network parameter of the transducer model is marked as W V The audio information is noted as X V The audio features are denoted V.
More specifically, in the embodiment of the present application, the image encoder is preferably a yolov5 model, and the yolov5 model performs feature extraction based on the prediction instance class and the attribute class instead of the object class in the image, so that a feature representation containing richer voice data can be learned, where the network parameters of the yolov5 model are denoted as W I The image information is denoted as X I The image features are denoted as I.
More specifically, the instance class includes objects and highlights that are difficult to identify, attributes such as "ground glass" and objects such as "lung", "heart", and there may be a plurality of feature areas for one image information, attributes such as "ground glass" and objects such as "lung", "heart"; therefore, the method of the embodiment of the present application uses the preset cross ratio threshold (which may be set according to the use requirement, for example, 0.7) as the non-maximum compression to act on the final output of the yolov5 model, and sets the target detection confidence threshold (which may be set according to the use requirement, for example, 0.3), so that the yolov5 model can screen out N image key regions with high class detection confidence according to the image information (N may be set according to the use requirement, N is a positive integer), where I i The image region features are the i-th image key region in the image features.
Thus, in the embodiments of the present application, the image features include a plurality of salient image region features, so the notation i= { I 1 ,…,I i ,…,I N I, using the image region features of all the key regions to identify the whole picture, different image region features can represent the non-representation of the image informationDisorders of the same subject (e.g., corresponding to different human viscera).
In some preferred embodiments, the step of mapping the disorder text feature, the audio feature, and the image feature to the same dimension is mapping the disorder text feature, the audio feature, and the image feature to the same dimension based on respective mapping networks; wherein the disorder text features are mapped based on a text mapping network, the audio features are mapped based on an audio mapping network, and the image features are mapped based on an image mapping network.
Specifically, parameters of different mapping networks may be set, or may be obtained through training, where network parameters of mapping networks corresponding to each of the symptom text feature, the audio feature and the image feature are respectively recorded as θ T 、θ V 、θ I The purpose of setting the three mapping networks is to map the features of the three mode data into the same dimension, namely, map the features of the three mode data into the same feature space, so that the relation between the features of the different mode data can be fully mined in the subsequent processing steps; wherein the dimension is denoted as R D Since the image features contain a plurality of significant image region features, the dimension of the image features after the mapping process can be denoted as R N×D
More specifically, the mapped condition text feature, audio feature and image feature only change dimensions without changing feature features, so T, V, I is still adopted in the method of the embodiment of the present application to represent the mapped condition text feature, audio feature and image feature, respectively (the condition text feature, audio feature and image feature in the subsequent processing step refer to the mapped condition text feature, audio feature and image feature).
In order to more accurately mine the relation between the features of different modal data, the method of the embodiment of the application also needs to align the three features after the mapping processing, wherein the feature alignment processing process can be to align with the fourth-side reference feature as a standard or to align with one of the three features as a standard; since the image information contains the most extensive semantic information related to the disease, in some preferred embodiments, the feature alignment process includes:
the disorder text features and audio features mapped to the same dimension are respectively aligned with the image features based on the ternary ordering loss function.
In some preferred embodiments, the method of the embodiments of the present application designs a ternary ordering loss function based on the existing ternary ordering loss function, and its expression is:
(1)
wherein L is matching For the hinge-based ternary ordering loss function, alpha, beta are edge parameters, I, V and T are image features, audio features and disorder text features mapped to the same dimension, respectively,and->Respectively a local audio feature and a local image feature with the lowest similarity between the feature pairs (I, V), respectively +.>And->The local text features and the local image features with the lowest similarity between the feature pairs (I, T) are respectively, S (,) is a similarity function, [ x ]] + =max (x, 0), representing taking the maximum of x and 0.
It should be noted that, the similarity is a measurement value of element similarity, and in the embodiment of the present application, the lower the similarity is, the higher the element similarity is.
Specifically, the ternary ordering loss function based on the foldout can simultaneously perform feature alignment processing on the features of three modal data, and the feature alignment processing of different modal data is realized by optimizing feature pairs with the lowest similarity (namely two types of features) to promote the alignment among all the features; wherein, based on formula (1), the feature alignment comprises two parts: the first part is to align the image features and the audio features by taking the image features and the audio features as training queries, and the second part is to align the image features and the disorder text features by taking the image features and the disorder text features as training queries.
More specifically, as can be seen from the formula (1), α and β are edge parameters for controlling the alignment of the image feature and the audio feature and controlling the alignment of the image feature and the disorder text feature, respectively, and are super parameters greater than 0, and can be set according to the use requirement, so that the three-dimensional sorting loss function based on the hinge can follow the principle that the whole can be minimized in the adjustment process; in the embodiment of the application, alpha and beta are preferably 0.2, so that the ternary sorting loss function based on the foldout can be stably optimized to realize characteristic alignment of three mode data.
More specifically, the similarity function S (,) is a function for measuring the similarity between two elements therein, and may be selected according to the requirements of use, and in the embodiment of the present application, the similarity function preferably uses a function based on cosine similarity (cosine similarity) as a similarity measure (the smaller the output value thereof is, the more similar the two elements are represented).
More specifically, the process is carried out,、/>、/>and->The following respectively satisfy:
(2)
(3)
(4)
(5)
wherein, the liquid crystal display device comprises a liquid crystal display device,argminminimizing the function for the objective, so j in equation (2) is the j-th local audio feature found in traversal V; d in formula (3) is the d-th local image feature found in traversal I; r in formula (4) is the r-th local text feature found in traversal T; e in formula (5) is the e-th local image feature found in traversal I; j, d, r, e is based on formulas (2) - (5)argminThe function is determined by traversing the corresponding local features.
More specifically, minimize L matching Can determine the optimum、/>、/>And->So that the audio features and the image features can be according to +.>And->Is aligned with features while enabling the pathological text features and image features to be aligned according to +.>And->The feature alignment is realized by the alignment of the three types of the modal data, so that the feature alignment is realized by taking the image feature as an alignment reference, the feature relation among the features of different modal data is fully defined, and the feature alignment can be performed on the premise of accurately capturing the difference and the similarity among the different modal data by the method of the embodiment of the application, so that a bridge is built for subsequent feature fusion.
After the relation among the features of different modal data is clarified, the features of the multi-modal data which are mapped into the same dimension can be fused by utilizing the existing feature fusion means; because the disease characteristics reflected by different types of modal data have certain difference and have different constraint and influence corresponding to different diagnosis results, the method of the embodiment of the application preferably adopts a gating expert neural network to perform fusion processing on the characteristics of the multi-modal data so as to acquire the influence relationship between the characteristics of the different modal data and tasks (disease diagnosis) to determine the fusion weights of the characteristics of the different modal data for feature fusion; thus, in some preferred embodiments, the step of obtaining a fusion feature vector from the registered condition text feature, audio feature, and image feature comprises:
acquiring text feature fusion weights, audio feature fusion weights and image feature fusion weights corresponding to symptom text features, audio features and image features respectively based on a gating expert neural network;
and generating a fusion feature vector according to the disease text feature, the audio feature and the image feature after the text feature fusion weight, the audio feature fusion weight and the image feature fusion weight are aligned.
Specifically, the gating expert neural network is a multi-layer perceptron network, and is used for distributing fusion weights with corresponding sizes to the features of different modal data according to the correlation between the features of the different modal data and the tasks, so that the features of the different modal data are effectively fused; in the embodiment of the application, the text feature fusion weight is denoted as g T The audio feature fusion weight is recorded as g V The fusion weight of the image features is recorded as g I Satisfy g T +g V +g I =1, and the three fusion weights are determined based on the following equation:
g V ,g I ,g T =softmax(W g ·[V,I,T]+b g ) (6)
wherein softmax is a normalization function, W g B, network parameters of the gating expert neural network g The bias value of the gated expert neural network, [,]representing a splice vector operator; determining g based on formula (6) V ,g I And g T And then, the three mode features with the aligned features can be fused, and the fused feature vector obtained after the fusion is expressed as Z, so that the following conditions are satisfied:
Z=g T ·T+g V ·V+g I ·I,Z∈R D (7)
the fusion feature vector Z obtained based on the fusion processing synthesizes the modal features of the disease text information, the audio information and the video information, fully considers the relevance of different modal features to disease diagnosis for weight distribution, and can effectively improve the accuracy of the subsequently generated diagnosis result.
In some preferred embodiments, the step of stitching the question text feature and the fused feature vector to obtain a stitched vector comprises:
splicing the text features of the questions as tokens and fusion feature vectors to obtain a spliced vector [ T ] m ,Z]Wherein T is m For the text feature of the question, Z is the fusion feature vector.
Specifically, the text feature of the problem used as the token can be used as an access request of a language model, the language model can be analyzed after analyzing that the token meeting the verification requirement exists in the spliced vector, and the safety and stability of model analysis are improved; in other embodiments, the method of the embodiment of the present application provides multiple kinds of problem text information, and the language model can identify and classify the problem text features obtained based on different types of problem text information in the stitching vector, and then perform disease analysis, that is, based on the problem text features as tokens, as a disease type primary screening means, so as to further improve the accuracy of disease diagnosis.
In some preferred embodiments, the language model is a learning model that outputs diagnostic conclusions in the form of language (e.g., text, speech), and the network parameters of the language model are recorded as θ lan
Specifically, the language model, the encoder, the mapping network and the gating expert neural network are all pre-trained models, can be respectively and independently trained and determined before formal use, can also be jointly learned and trained and determined before formal use, and are preferably the latter in the embodiment of the application; wherein, the damage function of the whole body composed of the language model, the encoder, the mapping network and the gate control expert neural network is defined as L ce The following steps are:
(8)
wherein Y is T The training label is the actually collected real diagnosis text, such as disease description determined by doctor aiming at the symptoms of the inquiry object; updating L using gradient descent method ce The training of the language model, the encoder, the mapping network and the gate expert neural network can be completed by the network parameters in the model.
More specifically, since the network parameter adjustment of the encoder affects the output result of the features of the different-modality data and affects the feature alignment processing effect, in the embodiment of the application, the process of training the language model, the encoder, the mapping network and the gating expert neural network introduces L matching Training is performed, so that an overall loss function L is defined that satisfies:
L=γ·L ce +δ·L matching (9)
wherein, gamma and delta are respectively control L ce And L matching The influencing superparameter may be set according to actual needs, and in the embodiment of the present application, γ=δ=0.5 is preferable.
More specifically, based on the foregoing, L matching The loss value determined for the traversal feature is therefore freeTraining is required, and the result is self-adaptively adjusted according to the dimension-adjusted characteristics extracted by the encoder, so that the overall loss function trains network parameters and L ce In agreement, the training method is also performed by adopting a gradient descent method, and the overall network parameter is expressed as theta= (theta) lanTVI ,W T ,W V ,W I ,W g ,b g ) The feature alignment relation is fully considered in the model training process, so that different modal features can be accurately mapped into the same feature space, comprehensive analysis is performed after alignment, the accuracy of language model analysis is effectively improved, and the accuracy of disease diagnosis is improved.
In a second aspect, referring to fig. 2, some embodiments of the present application further provide a disease diagnosis device based on multi-modal data, for performing disease diagnosis according to the multi-modal data, where the multi-modal data includes problem text information, disorder text information, audio information, and image information, the device includes:
a feature extraction module 201, configured to extract a question text feature, a disorder text feature, an audio feature, and an image feature according to the question text information, the disorder text information, the audio information, and the image information, respectively;
the feature mapping module 202 is configured to map the symptom text feature, the audio feature and the image feature to the same dimension, and perform feature alignment processing;
the feature fusion module 203 is configured to fuse the feature-aligned symptom text feature, the audio feature and the image feature to obtain a fusion feature vector;
the splicing module 204 is used for splicing the text features and the fusion feature vectors of the problems to obtain splicing vectors;
the diagnosis module 205 is configured to put the stitching vector into a pre-trained language model to generate a diagnosis result.
According to the disease diagnosis device based on the multi-mode data, after the corresponding features are extracted according to the multi-mode data, fusion feature vectors which can be put into a language model are obtained based on mapping, feature alignment and fusion processing, and a spliced vector is formed by combining the feature of a problem text serving as a token so as to be put into the language model to generate a diagnosis result; the disease diagnosis device can fully mine the relevance and the difference between different modal data, solves the problem that the relationship between the multi-modal data is difficult to capture due to the difference between the data, effectively fuses various features, solves the problem that the model is easy to cause misjudgment and missed diagnosis caused by modal data training, and can provide more accurate diagnosis basis for doctors, effectively relieve the fatigue of the doctors and improve the disease diagnosis accuracy.
In some preferred embodiments, the disease diagnosis apparatus based on multimodal data of the embodiments of the present application is used to perform the disease diagnosis method based on multimodal data provided in the first aspect described above.
In a third aspect, referring to fig. 3, some embodiments of the present application further provide a schematic structural diagram of an electronic device, where the present application provides an electronic device, including: processor 301 and memory 302, the processor 301 and memory 302 being interconnected and in communication with each other by a communication bus 303 and/or other form of connection mechanism (not shown), the memory 302 storing computer readable instructions executable by the processor 301, which when executed by an electronic device, the processor 301 executes to perform the method in any of the alternative implementations of the embodiments described above.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs a method in any of the alternative implementations of the above embodiments. The computer readable storage medium may be implemented by any type or combination of volatile or non-volatile Memory devices, such as static random access Memory (Static Random Access Memory, SRAM), electrically erasable Programmable Read-Only Memory (EEPROM), erasable Programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), programmable Read-Only Memory (PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
Further, the units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Furthermore, functional modules in various embodiments of the present application may be integrated together to form a single portion, or each module may exist alone, or two or more modules may be integrated to form a single portion.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application, and various modifications and variations may be suggested to one skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims (9)

1. A disease diagnosis method based on multi-modal data for performing disease diagnosis according to the multi-modal data, wherein the multi-modal data includes question text information, disorder text information, audio information, and image information, the method comprising the steps of:
extracting question text features, disorder text features, audio features and image features according to the question text information, the disorder text information, the audio information and the image information respectively;
mapping the symptom text features, the audio features and the image features to the same dimension, and performing feature alignment treatment;
acquiring fusion feature vectors by fusion feature aligned disorder text features, audio features and image features;
splicing the problem text features and the fusion feature vectors to obtain splicing vectors;
placing the spliced vector into a pre-trained language model to generate a diagnosis result;
the step of splicing the problem text features and the fusion feature vector to obtain a spliced vector comprises the following steps:
the problem text feature is taken as a token to be spliced with the fusion feature vector to obtain the spliced vector [ T ] m ,Z]Wherein T is m And Z is the fusion feature vector for the text feature of the problem.
2. The multi-modal data-based disease diagnosis method of claim 1, wherein the feature alignment process comprises:
the disorder text features and audio features mapped to the same dimension are respectively aligned with the image features based on the ternary ordering loss function.
3. The multi-modal data-based disease diagnosis method of claim 2, wherein the ternary ordering loss function is a hinge-based ternary ordering loss function, the expression of which is:
wherein L is matching For the hinge-based ternary ordering loss function, alpha, beta are edge parameters, I, V and T are image features, audio features and disorder text features mapped to the same dimension, respectively,and->Respectively a local audio feature and a local image feature with the lowest similarity between the feature pairs (I, V), respectively +.>And->The local text features and the local image features with the lowest similarity between the feature pairs (I, T) are respectively, S (,) is a similarity function, [ x ]] + =max (x, 0), representing taking the maximum of x and 0.
4. The multi-modality data based disease diagnostic method of claim 1, wherein the image features include a plurality of salient image area features.
5. The multi-modal data-based disease diagnosis method of claim 1, wherein the question text information and the disorder text information are feature extracted based on a same text encoder to obtain the question text feature and the disorder text feature.
6. The method of claim 1, wherein the step of obtaining a fusion feature vector from the aligned text features, audio features, and image features of the disorder comprises:
acquiring text feature fusion weights, audio feature fusion weights and image feature fusion weights corresponding to symptom text features, audio features and image features respectively based on a gating expert neural network;
and generating the fusion feature vector according to the text feature fusion weight, the audio feature fusion weight and the symptom text feature, the audio feature and the image feature after the alignment of the image feature fusion weight fusion feature.
7. A disease diagnosis apparatus based on multi-modal data for performing disease diagnosis based on the multi-modal data, wherein the multi-modal data includes question text information, disorder text information, audio information, and image information, the apparatus comprising:
the feature extraction module is used for extracting problem text features, disorder text features, audio features and image features according to the problem text information, disorder text information, audio information and image information respectively;
the feature mapping module is used for mapping the symptom text features, the audio features and the image features to the same dimension and carrying out feature alignment processing;
the feature fusion module is used for fusing the symptom text features, the audio features and the image features after feature alignment to obtain fusion feature vectors;
the splicing module is used for splicing the text features of the problems and the fusion feature vectors to obtain splicing vectors;
the diagnosis module is used for placing the spliced vector into a pre-trained language model to generate a diagnosis result;
the step of splicing the problem text features and the fusion feature vector to obtain a spliced vector comprises the following steps:
the problem text feature is taken as a token to be spliced with the fusion feature vector to obtain the spliced vector [ T ] m ,Z]Wherein T is m And Z is the fusion feature vector for the text feature of the problem.
8. An electronic device comprising a processor and a memory storing computer readable instructions that, when executed by the processor, perform the steps in the method of any of claims 1-7.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of the method according to any of claims 1-7.
CN202310550630.5A 2023-05-16 2023-05-16 Disease diagnosis method, device, equipment and medium based on multi-mode data Active CN116259407B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310550630.5A CN116259407B (en) 2023-05-16 2023-05-16 Disease diagnosis method, device, equipment and medium based on multi-mode data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310550630.5A CN116259407B (en) 2023-05-16 2023-05-16 Disease diagnosis method, device, equipment and medium based on multi-mode data

Publications (2)

Publication Number Publication Date
CN116259407A CN116259407A (en) 2023-06-13
CN116259407B true CN116259407B (en) 2023-07-25

Family

ID=86682951

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310550630.5A Active CN116259407B (en) 2023-05-16 2023-05-16 Disease diagnosis method, device, equipment and medium based on multi-mode data

Country Status (1)

Country Link
CN (1) CN116259407B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116340778B (en) * 2023-05-25 2023-10-03 智慧眼科技股份有限公司 Medical large model construction method based on multiple modes and related equipment thereof
CN116631567B (en) * 2023-07-21 2023-10-13 紫东信息科技(苏州)有限公司 Gastroscopy report generation device, equipment and computer readable storage medium
CN117611845B (en) * 2024-01-24 2024-04-26 浪潮通信信息系统有限公司 Multi-mode data association identification method, device, equipment and storage medium
CN118155638A (en) * 2024-05-07 2024-06-07 武汉人工智能研究院 Speech generation and understanding system, method and electronic equipment based on large language model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113095081A (en) * 2021-06-11 2021-07-09 深圳市北科瑞声科技股份有限公司 Disease identification method and device, storage medium and electronic device
CN113780012A (en) * 2021-09-30 2021-12-10 东南大学 Depression interview conversation generation method based on pre-training language model
CN114579723A (en) * 2022-03-02 2022-06-03 平安科技(深圳)有限公司 Interrogation method and apparatus, electronic device, and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115223678A (en) * 2022-07-27 2022-10-21 重庆师范大学 X-ray chest radiography diagnosis report generation method based on multi-task multi-mode deep learning
CN115732076A (en) * 2022-11-16 2023-03-03 四川大学华西医院 Fusion analysis method for multi-modal depression data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113095081A (en) * 2021-06-11 2021-07-09 深圳市北科瑞声科技股份有限公司 Disease identification method and device, storage medium and electronic device
CN113780012A (en) * 2021-09-30 2021-12-10 东南大学 Depression interview conversation generation method based on pre-training language model
CN114579723A (en) * 2022-03-02 2022-06-03 平安科技(深圳)有限公司 Interrogation method and apparatus, electronic device, and storage medium

Also Published As

Publication number Publication date
CN116259407A (en) 2023-06-13

Similar Documents

Publication Publication Date Title
CN116259407B (en) Disease diagnosis method, device, equipment and medium based on multi-mode data
US10902588B2 (en) Anatomical segmentation identifying modes and viewpoints with deep learning across modalities
US20240203599A1 (en) Method and system of for predicting disease risk based on multimodal fusion
CN107742100B (en) A kind of examinee's auth method and terminal device
RU2714264C2 (en) Systems, methods and computer-readable media for detecting probable effect of medical condition on patient
CN112712879B (en) Information extraction method, device, equipment and storage medium for medical image report
CN110720124A (en) Monitoring patient speech usage to identify potential speech and associated neurological disorders
CN101944154A (en) The shadow system is read in medical imaging
CN111666477A (en) Data processing method and device, intelligent equipment and medium
CN111724136A (en) Method and device for entering information of first page of medical record and computer equipment
US20210241906A1 (en) Machine-aided dialog system and medical condition inquiry apparatus and method
WO2022057309A1 (en) Lung feature recognition method and apparatus, computer device, and storage medium
KR20210013830A (en) Medical image analyzing apparatus and method based on medical use artificial neural network evaluating analysis result thereof
CN111477320B (en) Treatment effect prediction model construction system, treatment effect prediction system and terminal
CN115862831B (en) Intelligent online reservation diagnosis and treatment management system and method
CN111008269A (en) Data processing method and device, storage medium and electronic terminal
CN113707304B (en) Triage data processing method, triage data processing device, triage data processing equipment and storage medium
CN113257412B (en) Information processing method, information processing device, computer equipment and storage medium
CN111898528B (en) Data processing method, device, computer readable medium and electronic equipment
Cheng et al. Report of clinical bone age assessment using deep learning for an Asian population in Taiwan
CN117237351A (en) Ultrasonic image analysis method and related device
CN117079291A (en) Image track determining method, device, computer equipment and storage medium
US10460155B2 (en) Facial identification techniques
JP2015219247A (en) Nursing learning system, nursing learning server, and program
CN116130088A (en) Multi-mode face diagnosis method, device and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant