CN114201592A - Visual question-answering method for medical image diagnosis - Google Patents

Visual question-answering method for medical image diagnosis Download PDF

Info

Publication number
CN114201592A
CN114201592A CN202111461563.7A CN202111461563A CN114201592A CN 114201592 A CN114201592 A CN 114201592A CN 202111461563 A CN202111461563 A CN 202111461563A CN 114201592 A CN114201592 A CN 114201592A
Authority
CN
China
Prior art keywords
image
question
features
medical
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111461563.7A
Other languages
Chinese (zh)
Other versions
CN114201592B (en
Inventor
蔡林沁
陈珂佳
方豪度
赖廷杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202111461563.7A priority Critical patent/CN114201592B/en
Publication of CN114201592A publication Critical patent/CN114201592A/en
Application granted granted Critical
Publication of CN114201592B publication Critical patent/CN114201592B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Probability & Statistics with Applications (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

The invention requests to protect a visual question-answering method for medical image diagnosis, which belongs to the field of medical image processing, natural language processing and multi-mode fusion and comprises the following steps: acquiring medical images and corresponding related medical problems; respectively extracting features of the image focus target and the medical question text, capturing the dependency relationship among the question words, and performing text representation learning to obtain the correlation between each image area and the question; processing the same focus target by interacting with the image features and the position features, realizing relational association modeling, obtaining the relative position relationship of different targets, and matching the multi-modal features; a cross-guided multi-modal feature fusion stacking mode is introduced to capture complex interaction relations among multiple modalities; a fusion mode and a classifier are designed and selected, and the method is applied to medical question answering to realize the visual question answering research oriented to medical image diagnosis.

Description

Visual question-answering method for medical image diagnosis
Technical Field
The invention belongs to the field of medical image processing, natural language processing and multi-mode fusion, and particularly relates to a visual question-answering method for medical image diagnosis.
Background
Health is always one of the most concerned issues for human beings, and with the continuous development of deep learning, it becomes more important to use different tools and techniques to help doctors diagnose and patients better understand their own physical conditions. Medical imaging is an extremely important tool for physicians to understand patient condition in clinical analysis and diagnosis. However, the information obtained from medical images by different doctors may vary and the number of doctors is much smaller than the number of patients, resulting in that doctors often face problems of physical and mental fatigue, and thus it is difficult to manually answer all the problems of patients.
The Visual Question Answering (VQA) is given a picture, then the question content is input, and the system can select the appropriate answer according to the characteristic information of the picture to output the natural language answer. A good visual question-answering model facing medical image diagnosis can automatically extract information contained in a medical image, capture the position of a focus and the like, can provide a second opinion for image analysis for a radiologist, realizes auxiliary diagnosis, and is beneficial to enhancing the confidence of the radiologist in explaining complex medical images. Meanwhile, the VQA model can help the patient to preliminarily know the own physical condition, thereby being beneficial to selecting a more targeted medical scheme.
However, the current mainstream visual question-answering models often ignore fine-grained interactions between images and questions. In fact, learning keywords in questions and obtaining location information for different regions of the image may provide useful clues for answer reasoning. There are still some disadvantages if the mainstream model is directly used in medical image diagnosis. Firstly, the existing method mostly only realizes the rough interaction between the image and the problem, and can not capture the correlation between each region of the image and the problem; second, the inherent dependencies between words in different positions in a sentence cannot be captured efficiently. Thirdly, the existing method only extracts the image features in the image and lacks the spatial features. These methods do not solve the problem of correlation of different objects in the image.
Through retrieval, the publication number is CN113516182A, and the method and the device for training the visual question-answering model and the visual question-answering model are provided. The method comprises the following steps: acquiring a picture sample and a question sample for training a visual question-answering model; performing feature extraction on the picture sample to obtain picture sample features, and performing feature extraction on the problem sample to obtain problem sample features; determining a relation hidden variable between the picture sample characteristic and the question sample characteristic; the relation hidden variable is used for representing whether the picture sample and the question sample are related or not; training a visual question-answer model according to the relation hidden variables, the picture sample characteristics and the question sample characteristics to obtain a target visual question-answer model; the target visual question-answering model is used for carrying out visual question-answering. By adopting the method, the answer with higher accuracy can be still given when the fuzzy question is answered. Compared with the method, the method has the advantages that the graph convolution can better understand semantic information, but the complexity is higher. In addition, the technology adopts a first weight mode and a second weight mode to respectively obtain the picture characteristics and the text characteristics, and the more exquisite degree interaction of layer number stacking is lacked. In addition, the invention also introduces a position correlation module, and pays attention to the position relation between different objects while deeply interacting the image characteristic and the problem characteristic.
CN110321946A, a multimodal medical image recognition method and apparatus based on deep learning, which uses medical imaging equipment to collect medical image data; the image enhancement algorithm carries out enhancement processing on the acquired image; extracting the collected image characteristics by an extraction program; identifying the extracted features by using an identification program; converting medical images of different modes by using a conversion program; the printer prints the collected image; and displaying the acquired medical image data information by using a display. According to the invention, the image feature extraction effect is improved through the image feature extraction module; meanwhile, the mode conversion module adopts a three-dimensional reconstruction, registration and segmentation mode, so that the corresponding image height matching of the first mode image and the second mode image is ensured; in addition, the invention divides the training image into a plurality of image blocks, thereby reducing the requirement of the whole input training image on hardware equipment. The technology utilizes a feature recognition program to recognize the extracted features and uses an image enhancement algorithm to improve the recognition capability. But neglect modal interaction, namely lack of interaction ability with doctors or patients, and can not intelligently answer questions of patients and efficiently assist doctors in diagnosis assistance. The invention improves the image recognition capability and simultaneously considers the interaction with the user, so that the invention is more intelligent and improves the participation of the user.
Therefore, in order to better assist the doctor in making an auxiliary diagnosis and to allow the patient to use it also to obtain basic information of the image without consulting the doctor. There is a need to design an explicit mechanism to learn the correlation between questions and images, and to build a model to process image features and location features and apply it to the visual question-answering task oriented to medical image diagnosis.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. A visual question-answering method for medical image diagnosis is provided. The technical scheme of the invention is as follows:
a visual question-answering method oriented to medical image diagnosis comprises the following steps:
acquiring medical images and corresponding related medical problems;
respectively extracting features of the image focus target and the medical question text, capturing the dependency relationship among the question words, and performing text representation learning to obtain the correlation between each image area and the question;
processing the same focus target by interacting with the image features and the position features, realizing relational association modeling, obtaining the relative position relationship of different targets, and matching the multi-modal features;
a cross-guided multi-modal feature fusion stacking mode is introduced to capture complex interaction relations among multiple modalities;
a fusion mode and a classifier are designed and selected, and the method is applied to medical question answering to realize the visual question answering research oriented to medical image diagnosis.
Further, the medical image and the corresponding related medical problems specifically include the following steps:
downloading medical related image data and question labels on the network, wherein the image comprises pictures, mainly CT and MRI scanned images, and real answers corresponding to questions and questions matched with the pictures, and forming a group of objects of the pictures, the questions and the answers.
Further, the feature extraction is respectively performed on the image focus target and the medical problem text, and specifically includes:
and (3) carrying out feature extraction on the pictures and the problems: inputting a scanning image, and extracting a relevant area in the image by using a target detection algorithm of Faster R-CNN based on ResNet-101; inputting an English sentence, and obtaining problem characteristics through word embedding and a recurrent neural network.
Further, the image feature obtaining specifically includes: image information is processed in a manner of combining fast-RCNN with Resnet 101: firstly, extracting global image features in an image by using a residual error network Resnet101, and then identifying and extracting local features of the image according to a target detection algorithm, namely fast-RCNN to obtain corresponding focus information; and (3) not only using an object detector but also using an attribute classifier for each region in the image, wherein each object bounding box has a corresponding attribute class, so that binary description of the object can be obtained, extracting K object regions from each image, and each object region is represented by a 2048-dimensional vector and used as the input of a subsequent network.
Further, the problem feature acquisition specifically includes: the input medical problem is firstly processed into a single word, the longest word is intercepted into 14 words, redundant discarding is carried out, and less than 14 words are filled with zeros; and then capturing semantic features of words by combining a 300-dimensional GloVe word vector model, converting the semantic features into a vector mode, and encoding text features by using an LSTM network so as to extract problem semantic feature information as input of a subsequent network.
Furthermore, a self-recognition module is arranged to obtain characteristics among image regions and characteristics among question words, and the self-recognition module is an attention model and obtains the characteristics among the image regions and the characteristics among the question words through self-correlation learning; the core of the self-recognition module is an attention mechanism; the input consists of a query key of dimension d _ key and a value of dimension d _ value; firstly, calculating the dot product of a query key and all keys, and dividing each key by √ d; then, applying a softmax function to obtain the weight of the required value; in practice, to calculate attention weights for a set of query keys synchronously, they are packed into a matrix Q; the keys and values are also packed into matrices K and V.
Further, the attention model adopts an attention mechanism model of H parallel heads, which allows the model to focus on information from different representation subspaces at different positions simultaneously, and calculates an output feature matrix as:
F=MultiHead(Q,K,V)=Concat([head1,head2,…headH])W0
headi=Attention(QWi Q,KWi k,VWi v)
the self-recognition module consists of an attention mechanism model and a feed-forward network and is used for extracting the fine characteristics of the image or the medical problem;
outputting the problem characteristics after the learning attention characteristics obtain the weight; then inputting them into LayerNorm layer; the feedforward layer comprises two full-connection layers, a ReLu function and a Dropout function, and finally a LayerNorm layer, and the final characteristics are obtained through self-attention
Figure BDA0003388882300000051
Further, the processing of the same lesion target by interacting with the image features and the position features to realize relational modeling and obtain the relative position relationship of different targets specifically includes:
inputting object by image features
Figure BDA0003388882300000052
And a position feature P, and a position feature,
Figure BDA0003388882300000053
is a feature obtained by a self-recognition module, and P is a four-dimensional object frame;
to calculate the position feature weights, one object coordinate is represented as { x }i,yi,hi,wiIn which xiPosition of abscissa, y, representing center of objectiIndicating the position of the ordinate of the center of the object, wiWidth of the object frame, hiIndicating the height of the object box. First, the coordinates of P are transformed as follows,
Figure BDA0003388882300000054
m and n respectively represent two object frames and carry out scale normalization and logarithm operation; the input N objects can be expressed as
Figure BDA0003388882300000055
Next, the geometric features of the two objects are embedded into the high-dimensional features and expressed as epsilonGW is to beGMultiplying by the embedded feature to obtain a weight, WGIs also realized by a full connection layer; the final max operation is similar to the relu layer, whose main purpose is to impose a certain limit on the location feature weights;
Figure BDA0003388882300000056
Figure BDA0003388882300000057
representing the position feature weight between two objects.
εGRepresenting the embedding of geometric features into high-dimensional features.
Pm,PnRepresenting the geometrical characteristics of the m and n objects.
The object relationship between the nth object and the whole set can be obtained through the following formula;
Figure BDA0003388882300000058
r (n) represents the object relationship between the nth object and the entire set.
Figure BDA0003388882300000061
Representing image features of the m-th object, wmnOutputting W for the weight of the relation between different objectsVThe weighted sum of the image characteristics of other objects after linear change;
the following is wmnAnd
Figure BDA0003388882300000062
and (4) calculating a formula.
Figure BDA0003388882300000063
Figure BDA0003388882300000064
Figure BDA0003388882300000065
Representing image feature weights between m, n two objects
Figure BDA0003388882300000066
Representing the relative position characteristic weight between m and n objects
k represents the number of object objects
Figure BDA0003388882300000067
Representing the relative position feature weight before the kth object and the nth object
After obtaining the relation characteristic R (n), the last step is to fuse the Nr relation characteristic and then to the image characteristic
Figure BDA0003388882300000068
The fusion is carried out, and the fusion is carried out,
Figure BDA0003388882300000069
further, the method for capturing the complex interaction relationship among multiple modalities by introducing the cross-guided multi-modality feature fusion stacking manner specifically includes:
the cross guide module consists of a problem guide picture attention module and a problem attention module; updating image region characteristics and problem text characteristics by establishing semantic association between two different modes to obtain more detailed characteristics; performing cross fusion characteristic extraction on the sample image characteristic information and the sample problem characteristic information to obtain an image characteristic vector carrying the sample problem information and a sample problem characteristic vector carrying the sample image information;
the core of the cross-guide attention module is also an attention mechanism, and the input is also represented as Q, K and V; taking the attention model of the problem guide image as an example, the self-recognition feature of the input image
Figure BDA00033888823000000610
Self-identification of questions
Figure BDA00033888823000000611
Mapping to obtain the input and output of the image interactive attention model and the problem interactive attention modelOutputting the model;
after obtaining the image characteristic vector carrying the sample problem information and the sample problem characteristic vector carrying the sample image information, stacking a layer number, wherein N is the layer number of the attention model, and the output of the previous attention layer is used as the input of the next attention layer; the multiple attention model layers are connected with the model of a deeper layer, so that the embedding of the attention model can be guided, the image to be processed and the problem feature can be gradually refined, and the representation capability of the model can be enhanced.
Further, the design selects a fusion mode and a classifier, and the fusion mode and the classifier are applied to medical question answering to realize the visual question answering research facing medical image diagnosis, and the visual question answering research specifically comprises the following steps:
in obtaining the effective characteristics
Figure BDA0003388882300000074
And
Figure BDA0003388882300000075
then sending the data to a linear multi-mode fusion network; then, mapping the fused feature f to a vector space s epsilon R through an s-shaped functionLWherein L is the number of the most frequent answers in the training set;
Figure BDA0003388882300000071
s=Linear(f)
A=sigmoid(s)
a represents the model predicted answer.
The final prediction stage can be viewed as a logistic regression that predicts the correctness of each candidate answer; selecting the answer with the highest probability from all predicted answers as a final prediction; returning and predicting by using a binary cross entropy function; and determining a loss value of the loss function according to the real answer and the predicted answer, and updating the model according to the loss value.
Figure BDA0003388882300000072
M represents the training problem
N represents a candidate answer
Figure BDA0003388882300000073
Predicted answer representing model output
szkTrue answer representing model
Z, K values when training separately
The invention has the following advantages and beneficial effects:
in the present invention, we propose a visual question-answering method for medical image diagnosis. Given that many of the existing attention-based VQA methods can only learn the rough interaction of multimodal instances, our models can direct each other's attention and get a correlation between each image region and the problem. The other core idea of the invention is to increase the position attention, which can improve a judgment on the position relation of the object in the image and improve the counting performance of the object in the image. The invention can be used as an effective reference for assisting diagnosis of doctors, thereby greatly improving the diagnosis efficiency; the invention can help the patient to preliminarily know the self physical condition, thereby being beneficial to selecting a more targeted medical scheme.
The method of claims 6-7. The invention firstly considers that the extracted image features and the medical text features are mutually independent features, and firstly uses a self-identification module to emphasize respective emphasis of the image and the text in order to obtain the features with more fineness. The conventional model only considers picture recognition, namely a self-attention model and the like are only used on the picture, but the invention highlights that text features are also important and have key points and keywords in the problem, so that the self-recognition module is not only applied to picture processing but also applied to medical text problems. The finer single-mode model can better perform subsequent model fusion.
The method of claim 8. The common visual question-answering models are questions in some open domains, and it can be found in related data that answers of basic models are often bad and popular when answering related questions about positions. The obtained picture features not only contain original features but also contain rich inter-object position relations.
The method of claim 9. The common visual question-answering model usually adopts a text-guided picture mode to perform multi-mode fusion, and the text information can be guided by neglecting picture information. The cross-guided multi-modal feature fusion stacking mode designed by the invention can capture the complex interaction relationship among multiple modes. Updating image region characteristics and problem text characteristics by establishing semantic association between two different modes to obtain more detailed characteristics; and performing cross fusion characteristic extraction on the sample image characteristic information and the sample problem characteristic information to obtain an image characteristic vector carrying the sample problem information and a sample problem characteristic vector carrying the sample image information.
Drawings
FIG. 1 is a flow chart of a visual question-answering method for medical image diagnosis according to a preferred embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
the first embodiment: as shown in fig. 1, the present invention provides a visual question-answering method for medical image diagnosis, which implements feature fusion for two cross-modal data, namely, picture and text, helps a doctor to perform an auxiliary diagnosis, and enables a patient to use it to obtain basic information of an image without consulting the doctor.
Firstly, we need to download the data set related to the medical image from the internet, and combine the question and the answer of the object to generate a group of objects of pictures, questions and answers, which is convenient for the subsequent learning and training.
Then, the image and the text are preprocessed, namely sample pictures and problem characteristic information are obtained and then input to the main network of the user.
For pictures, a fast-RCNN + Resnet101 network is used as a network feature extraction network to be used, a residual error network Resnet101 is used for extracting global image features in the pictures, and then local features of the extracted images are identified according to a target detection algorithm, namely fast-RCNN. Input picture features are described as X ═ X1,x2,…xm]∈Rm ×2048And m represents the number of object objects in the picture.
For the question, the text is preprocessed, and the sentence is written as a word with the length not exceeding 14. Words in the question are embedded by using a GloVe word with 300 dimensions, and text characteristics are encoded by using an LSTM network so as to extract semantic characteristic information of the question as input of a subsequent network. Describing the problem feature of the input as Y ═ Y1,y2,…yn]∈Rn×512And n is the number of words in the sentence.
For the subject network in the figure. Firstly, the self-recognition module is used for extracting the characteristics of the image target and the question text, so that the interference of redundant information in the image target can be reduced, the dependency relationship between the question words can be effectively captured for text representation learning, and the subsequent obtaining of the correlation between each image area and the question is facilitated.
The self-recognition module is mainly realized by an attention mechanism, wherein the attention mechanism calculates the correlation between the inputs, then performs weighted summation on all vectors in the inputs, and calculates the concerned features as the output of multi-head attention. This output is then fed into a feedforward neural network consisting of fully-connected layer functions, resulting in the output from the attention module. Problem characteristic Y ═ Y1,y2,…yn]∈Rn×512Obtained after passing through a self-identification module
Figure BDA0003388882300000101
Picture characteristic X ═ X1,x2,…xm]∈Rm ×2048Obtained after passing through a self-identification module
Figure BDA0003388882300000102
Secondly, the image features are processed by a position correlation module, the same target is processed by interacting with the image features and the position features, relational correlation modeling is achieved, relative position relations of different targets are obtained, and matching capability of multi-modal features is enhanced.
Firstly, coordinate information of an object is obtained, and scale normalization and logarithm operation are carried out on the coordinate information. By passing
Figure BDA0003388882300000103
The object relation between different objects can be obtained, and after the object relation is obtained, the relation characteristic fusion is carried out with the picture characteristic
Figure BDA0003388882300000104
Obtaining the final picture characteristics
Figure BDA0003388882300000105
Then, a cross-guided multi-modal feature fusion mode is introduced, and the complex interaction relation among multiple modes can be captured. The cross-guide model is similar to the self-recognition model, except that the input features are not in the same group, but are image features and text features, respectively, and the final features are determined by mutual guidance
Figure BDA00033888823000001010
And
Figure BDA00033888823000001011
then, a plurality of attention model layers are connected with a model with a deeper layer through deepening the layer number of the main network, so that the embedding of the attention model can be guided, the image to be processed and the problem feature can be gradually refined, and the representation capability of the model can be enhanced. Through the fusion of N layers, the image features are represented as X(N)Problems ofThe characteristic is represented as Y(N)
And finally, designing and selecting a fusion mode and a classifier to achieve a better effect. The learned joint characterization is used for answer prediction. By ax=softmax(MLP(X(N)) And ay=softmax(MLP(Y(N)) And respectively obtaining attention weights when the two features are subjected to weighted summation. Multiplying attention weight by picture feature
Figure BDA0003388882300000106
To obtain final characteristics
Figure BDA0003388882300000107
The final characteristics of the problem are obtained by the same way
Figure BDA0003388882300000108
We adopt a linear multi-modal fusion mode
Figure BDA0003388882300000109
Then, mapping the fused feature f to a vector space s ∈ R through a functionLAnd L is the number of answers with the highest occurrence frequency in the training set. And finally outputting the predicted answer with the highest probability as the final predicted answer. And determining a loss value of the loss function according to the real answer and the predicted answer, and updating the model according to the loss value.
Second embodiment:
1 obtaining sample image and medical problem characteristic information
First, we need to download medical related image data and question labels including pictures, mainly CT and MRI scanned images, and real answers corresponding to questions and questions matched with the pictures, to form a group of objects of pictures, questions and answers.
And then extracting the features of the pictures and the questions. Inputting a scanning image, and extracting a relevant area in the image by using a target detection algorithm of Faster R-CNN based on ResNet-101; inputting an English sentence, and obtaining problem characteristics through word embedding and a recurrent neural network. The specific operation is as follows:
acquiring picture characteristics: in order to better extract the required picture features, the image information is processed in a manner of combining fast-RCNN with Resnet 101. Firstly, global image features in an image are extracted by using a residual error network Resnet101, and then local features of the extracted image are identified according to a target detection algorithm, namely fast-RCNN, so that corresponding focus information is obtained. Not only an object detector but also an attribute classifier is used for each region in the image, each object bounding box has a corresponding attribute class, so that a binary description of the object can be obtained. K object regions are extracted from each image, each object region being represented by a 2048-dimensional vector as input to the subsequent network.
Problem feature acquisition: the entered medical question is first treated as a single word, with a maximum of 14 words truncated, redundant discard, and fewer than 14 filled with zeros. And then capturing semantic features of words by combining a 300-dimensional GloVe word vector model, converting the semantic features into a vector mode, and encoding text features by using an LSTM network so as to extract problem semantic feature information as input of a subsequent network.
2 image and medical problem self-identification
The self-recognition module is an attention model and obtains characteristics among image areas and characteristics among question words through self-correlation learning. The core of the self-recognition module is an attention mechanism. The input consists of a query key of dimension d _ key and a value of dimension d _ value. Typically d _ key and d _ value are both written as d. First, we compute the dot product of the query key and all keys and divide each key by √ d. Then, the softmax function is applied to obtain the weights of these values. In practice, we compute a set of attention functions for the query key at the same time and pack them into the matrix Q. The keys and values are also packed into matrices K and V. We compute the output matrix as:
Figure BDA0003388882300000121
still further, an attention model of H parallel heads is employed, which allows the model to focus on information from different representation subspaces from different locations at the same time, and thus a wider area can be focused on at the same time. We compute the output feature matrix as:
F=MultiHead(Q,K,V)=Concat([head1,head2,…headH])W0
headi=Attention(QWi Q,KWi k,VWi v)
intra-modal self-identification unit. They consist of an attention model and a feed-forward network for extracting subtle features of an image or medical problem. Taking the problem feature as an example, the problem feature is Y ═ Y1,y2,…yn]∈Rn×512The input of the self-recognition module can be obtained by the following formula:
Figure BDA0003388882300000122
Figure BDA0003388882300000123
Figure BDA0003388882300000124
and outputting the problem characteristics after the learning attention characteristics obtain the weight. They are then fed into the LayerNorm layer.
L Y=LayerNorm(Y+MultiHead(QY,KY,VY))
The feedforward layer comprises two full-connection layers, a ReLu function and a Dropout function, and finally a LayerNorm layer, and the final characteristics are obtained through self-attention
Figure BDA0003388882300000125
L′Y=FF(LY)=max(0,LYW1+b1)W2+b2
Figure BDA0003388882300000126
After the self-recognition module is used, the medical images and the medical texts focus on the self-emphasis, redundant information can be eliminated, and subsequent modal interaction and feature fusion can be conveniently carried out.
3 image focus object position relation modeling
In order to better acquire image characteristics and position relations among different targets, after self-identification characteristics of image information are acquired, output characteristics are sent to a position association unit working together with a self-identification module, and the position relations of the image characteristics and different objects are modeled simultaneously. Therefore, better understanding of image images is facilitated, and problems of position relations, such as front, back, left, right, foreground, background and the like, can be effectively processed through position relation modeling, so that a focus area is conveniently positioned, and effective auxiliary diagnosis is conveniently provided for doctors.
Inputting object by image features
Figure BDA00033888823000001313
And a position feature P, and a position feature,
Figure BDA00033888823000001314
is the feature obtained by the self-recognition module, and P is a four-dimensional object box.
To calculate the position feature weight, first, the coordinates of P are transformed as follows,
Figure BDA0003388882300000131
the method mainly performs scale normalization and logarithm operation, and aims to increase scale invariance so that training divergence caused by overlarge change range of values is avoided. Thus the N objects entered can be represented as
Figure BDA0003388882300000132
Then, W is addedGAnd insertCharacteristic multiplication of W inGIs also realized by a full connection layer. The final max operation is similar to the relu layer, whose main purpose is to impose a certain limit on the location feature weights.
Figure BDA0003388882300000133
The object relationship between the nth object and the entire set can be obtained by the following formula.
Figure BDA0003388882300000134
Figure BDA0003388882300000135
Representing image features of the m-th object, wmnOutputting W for the weight of the relation between different objectsVIs the weighted sum of the image characteristics of other objects after linear change.
The following is wmnAnd
Figure BDA0003388882300000136
and (4) calculating a formula.
Figure BDA0003388882300000137
Figure BDA0003388882300000138
After obtaining the relation characteristic R (n), the last step is to fuse the Nr relation characteristic and then to the image characteristic
Figure BDA0003388882300000139
The fusion is carried out, and the fusion is carried out,
Figure BDA00033888823000001310
the main reason for using concat here is the calculationThe number is small because the channel dimension of each R (n) is
Figure BDA00033888823000001311
1/Nr times, dimension after concat and
Figure BDA00033888823000001312
the same is true.
4 image problem cross-guide
The cross guide module is composed of a question guide picture attention module and a picture guide question attention module. The mutually-guided attention unit pays more attention to the interaction between the modes, and the image region feature and the question text feature are updated by establishing semantic association between two different modes so as to obtain more refined features. And performing cross fusion characteristic extraction on the sample image characteristic information and the sample problem characteristic information to obtain an image characteristic vector carrying the sample problem information and a sample problem characteristic vector carrying the sample image information.
Similar to the self-identity module, the cross-lead attention module core is also the attention mechanism, with inputs also denoted as Q, K, V. Taking the attention model of the problem guide image as an example, the self-recognition feature of the input image
Figure BDA0003388882300000143
Self-identification of questions
Figure BDA0003388882300000144
And mapping to obtain an image interaction attention model input and a question interaction attention model output.
After obtaining the image feature vector carrying the sample problem information and the sample problem feature vector carrying the sample image information, we stack a layer number, where N is the layer number of the attention model, and the output of the previous attention layer is used as the input of the next attention layer. The multiple attention model layers are connected with the model of a deeper layer, so that the embedding of the attention model can be guided, the image to be processed and the problem feature can be gradually refined, and the representation capability of the model can be enhanced.
Model 5 fusion and classifier
Through the learning of intra-modality self-attention and cross-directed attention mechanisms, features containing rich images and question information can be obtained. And simply fusing the image characteristic vector carrying the sample question information and the sample question characteristic vector carrying the sample image information, inputting the fused image characteristic vectors into a model classifier, and obtaining a predicted answer through the classifier.
In obtaining the effective characteristics
Figure BDA0003388882300000141
And
Figure BDA0003388882300000142
and then sending the data into a linear multi-mode fusion network. Then, mapping the fused feature f to a vector space s epsilon R through an s-shaped functionLAnd L is the number of the most frequent answers in the training set.
Figure BDA0003388882300000151
s=Linear(f)
A=sigmoid(s)
The final prediction stage can be viewed as a logistic regression that predicts the correctness of each candidate answer. We select the answer with the highest probability from all the predicted answers as the final prediction. Therefore, we come back to the prediction using a binary cross-entropy function. And determining a loss value of the loss function according to the real answer and the predicted answer, and updating the model according to the loss value.
Figure BDA0003388882300000152
The visual question-answering method for medical image diagnosis disclosed by the invention has the capability of visual question-answering, can better help doctors to perform auxiliary diagnosis particularly for judging the position relation of a focus, and enables patients to use the visual question-answering method to obtain the basic information of the image without consulting the doctors.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims (10)

1. A visual question-answering method for medical image diagnosis is characterized by comprising the following steps:
acquiring medical images and corresponding related medical problems;
respectively extracting features of the image focus target and the medical question text, capturing the dependency relationship among the question words, and performing text representation learning to obtain the correlation between each image area and the question;
processing the same focus target by interacting with the image features and the position features, realizing relational association modeling, obtaining the relative position relationship of different targets, and matching the multi-modal features;
a cross-guided multi-modal feature fusion stacking mode is introduced to capture complex interaction relations among multiple modalities;
a fusion mode and a classifier are designed and selected, and the method is applied to medical question answering to realize the visual question answering research oriented to medical image diagnosis.
2. The visual question-answering method oriented to medical image diagnosis according to claim 1, wherein the medical image and the corresponding related medical question specifically include the following steps:
downloading medical related image data and question labels on the network, wherein the image comprises pictures, mainly CT and MRI scanned images, and real answers corresponding to questions and questions matched with the pictures, and forming a group of objects of the pictures, the questions and the answers.
3. The visual question-answering method for medical image diagnosis according to claim 1 or 2, wherein the feature extraction is performed on the image focus target and the medical question text respectively, and specifically comprises:
and (3) carrying out feature extraction on the pictures and the problems: inputting a scanning image, and extracting a relevant area in the image by using a target detection algorithm of Faster R-CNN based on ResNet-101; inputting an English sentence, and obtaining problem characteristics through word embedding and a recurrent neural network.
4. The visual question-answering method for medical image diagnosis according to claim 3, wherein the picture feature acquisition specifically comprises: image information is processed in a manner of combining fast-RCNN with Resnet 101: firstly, extracting global image features in an image by using a residual error network Resnet101, and then identifying and extracting local features of the image according to a target detection algorithm, namely fast-RCNN to obtain corresponding focus information; and (3) not only using an object detector but also using an attribute classifier for each region in the image, wherein each object bounding box has a corresponding attribute class, so that binary description of the object can be obtained, extracting K object regions from each image, and each object region is represented by a 2048-dimensional vector and used as the input of a subsequent network.
5. The visual question-answering method for medical image diagnosis according to claim 3 or 4, wherein the question feature acquisition specifically comprises: the input medical problem is firstly processed into a single word, the longest word is intercepted into 14 words, redundant discarding is carried out, and less than 14 words are filled with zeros; and then capturing semantic features of words by combining a 300-dimensional GloVe word vector model, converting the semantic features into a vector mode, and encoding text features by using an LSTM network so as to extract problem semantic feature information as input of a subsequent network.
6. The visual question-answering method oriented to medical image diagnosis according to claim 5, wherein a self-recognition module is further arranged to obtain image inter-region features and question inter-word features, the self-recognition module is an attention model, and the image inter-region features and the question inter-word features are obtained through self-correlation learning; the core of the self-recognition module is an attention mechanism; the input consists of a query key of dimension d _ key and a value of dimension d _ value; firstly, calculating the dot product of a query key and all keys, and dividing each key by √ d; then, applying a softmax function to obtain the weight of the required value; in practice, to calculate attention weights for a set of query keys synchronously, they are packed into a matrix Q; the keys and values are also packed into matrices K and V.
7. The visual question-answering method oriented to medical image diagnosis according to claim 6, wherein the attention model adopts an attention mechanism model of H parallel heads, which allows the model to focus on information of different representation subspaces from different positions simultaneously, and the output feature matrix is calculated as:
F=MultiHead(Q,K,V)=Concat([head1,head2,…headH])W0
Figure FDA0003388882290000021
the self-recognition module consists of an attention mechanism model and a feed-forward network and is used for extracting the fine characteristics of the image or the medical problem;
outputting the problem characteristics after the learning attention characteristics obtain the weight; then inputting them into LayerNorm layer; the feedforward layer comprises two full-connection layers, a ReLu function and a Dropout function, and finally a LayerNorm layer, and the final characteristics are obtained through self-attention
Figure FDA0003388882290000022
8. The visual question-answering method for medical image diagnosis according to claim 7, wherein the processing of the same lesion target by interacting with the image features and the position features to realize relational modeling and obtain the relative position relationship of different targets specifically comprises:
inputting object by image features
Figure FDA0003388882290000038
And a position feature P, and a position feature,
Figure FDA0003388882290000039
is a feature obtained by a self-recognition module, and P is a four-dimensional object frame;
to calculate the position feature weights, one object coordinate is represented as { x }i,yi,hi,wiIn which xiPosition of abscissa, y, representing center of objectiIndicating the position of the ordinate of the center of the object, wiWidth of the object frame, hiIndicating the height of the object box. First, the coordinates of P are transformed as follows,
Figure FDA0003388882290000031
m and n respectively represent two object frames and carry out scale normalization and logarithm operation; the input N objects can be expressed as
Figure FDA0003388882290000032
Next, the geometric features of the two objects are embedded into the high-dimensional features and expressed as epsilonGW is to beGMultiplying by the embedded feature to obtain a weight, WGIs also realized by a full connection layer; the final max operation is similar to the relu layer, whose main purpose is to impose a certain limit on the location feature weights;
Figure FDA0003388882290000033
Figure FDA0003388882290000034
representing the position feature weight between two objects.
εGRepresenting the embedding of geometric features into high-dimensional features.
Pm,PnRepresenting the geometrical characteristics of the m and n objects.
The object relationship between the nth object and the whole set can be obtained through the following formula;
Figure FDA0003388882290000035
r (n) represents the object relationship between the nth object and the entire set.
Figure FDA0003388882290000036
Representing image features of the m-th object, wmnAs a weight of the relation between different objects, WVFor linear change ofConverting to obtain the weighted sum of the image characteristics of other objects;
the following is wmnAnd
Figure FDA0003388882290000037
and (4) calculating a formula.
Figure FDA0003388882290000041
Figure FDA0003388882290000042
Figure FDA0003388882290000043
Representing image feature weights between m, n two objects
Figure FDA0003388882290000044
Representing the relative position characteristic weight between m and n objects
k represents the number of object objects
Figure FDA0003388882290000045
Representing the relative position feature weight before the kth object and the nth object
After obtaining the relation characteristic R (n), the last step is to fuse the Nr relation characteristic and then to the image characteristic
Figure FDA0003388882290000047
The fusion is carried out, and the fusion is carried out,
Figure FDA0003388882290000046
9. the visual question-answering method for medical image diagnosis according to claim 8, wherein the method for capturing the complex interaction relationship among multiple modalities by introducing the cross-guided multi-modality feature fusion stacking manner specifically comprises:
the cross guide module consists of a problem guide picture attention module and a problem attention module; updating image region characteristics and problem text characteristics by establishing semantic association between two different modes to obtain more detailed characteristics; performing cross fusion characteristic extraction on the sample image characteristic information and the sample problem characteristic information to obtain an image characteristic vector carrying the sample problem information and a sample problem characteristic vector carrying the sample image information;
the core of the cross-guide attention module is also an attention mechanism, and the input is also represented as Q, K and V; taking the attention model of the problem guide image as an example, the self-recognition feature of the input image
Figure FDA0003388882290000048
Self-identification of questions
Figure FDA0003388882290000049
Mapping to obtain the output of the image interaction attention model and the output of the problem interaction attention model;
after obtaining the image characteristic vector carrying the sample problem information and the sample problem characteristic vector carrying the sample image information, stacking a layer number, wherein N is the layer number of the attention model, and the output of the previous attention layer is used as the input of the next attention layer; the multiple attention model layers are connected with the model of a deeper layer, so that the embedding of the attention model can be guided, the image to be processed and the problem feature can be gradually refined, and the representation capability of the model can be enhanced.
10. The visual question-answer method for medical image diagnosis according to claim 8, wherein the design selects a fusion mode and a classifier, and the method is applied to a medical question-answer to realize the visual question-answer research for medical image diagnosis, and specifically comprises the following steps:
in obtaining the effective characteristics
Figure FDA0003388882290000054
And
Figure FDA0003388882290000055
then sending the data to a linear multi-mode fusion network; then, mapping the fused feature f to a vector space s epsilon R through an s-shaped functionLWherein L is the number of the most frequent answers in the training set;
Figure FDA0003388882290000051
s=Linear(f)
A=sigmoid(s)
a represents the model predicted answer.
The final prediction stage can be viewed as a logistic regression that predicts the correctness of each candidate answer; selecting the answer with the highest probability from all predicted answers as a final prediction; returning and predicting by using a binary cross entropy function; and determining a loss value of the loss function according to the real answer and the predicted answer, and updating the model according to the loss value.
Figure FDA0003388882290000052
M represents the training problem
N represents a candidate answer
Figure FDA0003388882290000053
Predicted answer representing model output
szkTrue answer representing model
Z, K respectively.
CN202111461563.7A 2021-12-02 2021-12-02 Visual question-answering method for medical image diagnosis Active CN114201592B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111461563.7A CN114201592B (en) 2021-12-02 2021-12-02 Visual question-answering method for medical image diagnosis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111461563.7A CN114201592B (en) 2021-12-02 2021-12-02 Visual question-answering method for medical image diagnosis

Publications (2)

Publication Number Publication Date
CN114201592A true CN114201592A (en) 2022-03-18
CN114201592B CN114201592B (en) 2024-07-23

Family

ID=80650233

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111461563.7A Active CN114201592B (en) 2021-12-02 2021-12-02 Visual question-answering method for medical image diagnosis

Country Status (1)

Country Link
CN (1) CN114201592B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780775A (en) * 2022-04-24 2022-07-22 西安交通大学 Image description text generation method based on content selection and guide mechanism
CN114780701A (en) * 2022-04-20 2022-07-22 平安科技(深圳)有限公司 Automatic question-answer matching method, device, computer equipment and storage medium
CN114821245A (en) * 2022-05-30 2022-07-29 大连大学 Medical visual question-answering method based on global visual information intervention
CN117235670A (en) * 2023-11-10 2023-12-15 南京信息工程大学 Medical image problem vision solving method based on fine granularity cross attention
CN117407541A (en) * 2023-12-15 2024-01-16 中国科学技术大学 Knowledge graph question-answering method based on knowledge enhancement
CN117648976A (en) * 2023-11-08 2024-03-05 北京医准医疗科技有限公司 Answer generation method, device, equipment and storage medium based on medical image
CN118471487A (en) * 2024-07-12 2024-08-09 福建自贸试验区厦门片区Manteia数据科技有限公司 Diagnosis and treatment scheme generating device based on multi-source heterogeneous data and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019211250A1 (en) * 2018-04-30 2019-11-07 Koninklijke Philips N.V. Visual question answering using on-image annotations
WO2020263711A1 (en) * 2019-06-28 2020-12-30 Facebook Technologies, Llc Memory grounded conversational reasoning and question answering for assistant systems
CN112818889A (en) * 2021-02-09 2021-05-18 北京工业大学 Dynamic attention-based method for integrating accuracy of visual question-answer answers by hyper-network
CN112926655A (en) * 2021-02-25 2021-06-08 电子科技大学 Image content understanding and visual question and answer VQA method, storage medium and terminal
CN113240046A (en) * 2021-06-02 2021-08-10 哈尔滨工程大学 Knowledge-based multi-mode information fusion method under visual question-answering task
CN113392253A (en) * 2021-06-28 2021-09-14 北京百度网讯科技有限公司 Visual question-answering model training and visual question-answering method, device, equipment and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019211250A1 (en) * 2018-04-30 2019-11-07 Koninklijke Philips N.V. Visual question answering using on-image annotations
WO2020263711A1 (en) * 2019-06-28 2020-12-30 Facebook Technologies, Llc Memory grounded conversational reasoning and question answering for assistant systems
CN112818889A (en) * 2021-02-09 2021-05-18 北京工业大学 Dynamic attention-based method for integrating accuracy of visual question-answer answers by hyper-network
CN112926655A (en) * 2021-02-25 2021-06-08 电子科技大学 Image content understanding and visual question and answer VQA method, storage medium and terminal
CN113240046A (en) * 2021-06-02 2021-08-10 哈尔滨工程大学 Knowledge-based multi-mode information fusion method under visual question-answering task
CN113392253A (en) * 2021-06-28 2021-09-14 北京百度网讯科技有限公司 Visual question-answering model training and visual question-answering method, device, equipment and medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering", 《THE JOURNAL OF SUPERCOMPUTING》, 29 March 2023 (2023-03-29), pages 13696 *
A LUBNA等: "MoBVQA: A Modality based Medical Image Visual Question Answering System", 《TENCON 2019 - 2019 IEEE REGION 10 CONFERENCE (TENCON)》, 20 October 2019 (2019-10-20), pages 727 - 732, XP033672617, DOI: 10.1109/TENCON.2019.8929456 *
张礼阳: "结合视觉内容理解与文本信息分析的视觉问答方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 07, 15 July 2020 (2020-07-15), pages 138 - 844 *
陈珂佳: "基于深度学习的视觉问答研究", 《重庆邮电大学硕士学位论文》, 16 April 2024 (2024-04-16), pages 1 - 86 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780701A (en) * 2022-04-20 2022-07-22 平安科技(深圳)有限公司 Automatic question-answer matching method, device, computer equipment and storage medium
CN114780701B (en) * 2022-04-20 2024-07-02 平安科技(深圳)有限公司 Automatic question-answer matching method, device, computer equipment and storage medium
CN114780775A (en) * 2022-04-24 2022-07-22 西安交通大学 Image description text generation method based on content selection and guide mechanism
CN114821245A (en) * 2022-05-30 2022-07-29 大连大学 Medical visual question-answering method based on global visual information intervention
CN114821245B (en) * 2022-05-30 2024-03-26 大连大学 Medical visual question-answering method based on global visual information intervention
CN117648976A (en) * 2023-11-08 2024-03-05 北京医准医疗科技有限公司 Answer generation method, device, equipment and storage medium based on medical image
CN117235670A (en) * 2023-11-10 2023-12-15 南京信息工程大学 Medical image problem vision solving method based on fine granularity cross attention
CN117407541A (en) * 2023-12-15 2024-01-16 中国科学技术大学 Knowledge graph question-answering method based on knowledge enhancement
CN117407541B (en) * 2023-12-15 2024-03-29 中国科学技术大学 Knowledge graph question-answering method based on knowledge enhancement
CN118471487A (en) * 2024-07-12 2024-08-09 福建自贸试验区厦门片区Manteia数据科技有限公司 Diagnosis and treatment scheme generating device based on multi-source heterogeneous data and electronic equipment

Also Published As

Publication number Publication date
CN114201592B (en) 2024-07-23

Similar Documents

Publication Publication Date Title
CN114201592B (en) Visual question-answering method for medical image diagnosis
Arevalo et al. Gated multimodal networks
CN110750959B (en) Text information processing method, model training method and related device
CN110491502A (en) Microscope video stream processing method, system, computer equipment and storage medium
CN114936623B (en) Aspect-level emotion analysis method integrating multi-mode data
CN113806609B (en) Multi-modal emotion analysis method based on MIT and FSM
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN113886626B (en) Visual question-answering method of dynamic memory network model based on multi-attention mechanism
CN116129141B (en) Medical data processing method, apparatus, device, medium and computer program product
KR20200010672A (en) Smart merchandise searching method and system using deep learning
Halvardsson et al. Interpretation of swedish sign language using convolutional neural networks and transfer learning
CN115761757A (en) Multi-mode text page classification method based on decoupling feature guidance
CN110705490A (en) Visual emotion recognition method
CN116821391A (en) Cross-modal image-text retrieval method based on multi-level semantic alignment
CN115410254A (en) Multi-feature expression recognition method based on deep learning
CN116187349A (en) Visual question-answering method based on scene graph relation information enhancement
Thangavel et al. A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models
Nahar et al. A robust model for translating arabic sign language into spoken arabic using deep learning
CN116578738B (en) Graph-text retrieval method and device based on graph attention and generating countermeasure network
Gasimova Automated enriched medical concept generation for chest X-ray images
Shahadat et al. Cross channel weight sharing for image classification
CN116311518A (en) Hierarchical character interaction detection method based on human interaction intention information
Abu-Jamie et al. Classification of Sign-Language Using Deep Learning-A Comparison between Inception and Xception models
Wu et al. Question-driven multiple attention (dqma) model for visual question answer
Liu et al. Multi-type decision fusion network for visual Q&A

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant