CN114201592A - Visual Question Answering Method for Medical Image Diagnosis - Google Patents

Visual Question Answering Method for Medical Image Diagnosis Download PDF

Info

Publication number
CN114201592A
CN114201592A CN202111461563.7A CN202111461563A CN114201592A CN 114201592 A CN114201592 A CN 114201592A CN 202111461563 A CN202111461563 A CN 202111461563A CN 114201592 A CN114201592 A CN 114201592A
Authority
CN
China
Prior art keywords
image
features
feature
question
medical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111461563.7A
Other languages
Chinese (zh)
Other versions
CN114201592B (en
Inventor
蔡林沁
陈珂佳
方豪度
赖廷杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202111461563.7A priority Critical patent/CN114201592B/en
Publication of CN114201592A publication Critical patent/CN114201592A/en
Application granted granted Critical
Publication of CN114201592B publication Critical patent/CN114201592B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Probability & Statistics with Applications (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

The invention requests to protect a visual question-answering method for medical image diagnosis, which belongs to the field of medical image processing, natural language processing and multi-mode fusion and comprises the following steps: acquiring medical images and corresponding related medical problems; respectively extracting features of the image focus target and the medical question text, capturing the dependency relationship among the question words, and performing text representation learning to obtain the correlation between each image area and the question; processing the same focus target by interacting with the image features and the position features, realizing relational association modeling, obtaining the relative position relationship of different targets, and matching the multi-modal features; a cross-guided multi-modal feature fusion stacking mode is introduced to capture complex interaction relations among multiple modalities; a fusion mode and a classifier are designed and selected, and the method is applied to medical question answering to realize the visual question answering research oriented to medical image diagnosis.

Description

Visual question-answering method for medical image diagnosis
Technical Field
The invention belongs to the field of medical image processing, natural language processing and multi-mode fusion, and particularly relates to a visual question-answering method for medical image diagnosis.
Background
Health is always one of the most concerned issues for human beings, and with the continuous development of deep learning, it becomes more important to use different tools and techniques to help doctors diagnose and patients better understand their own physical conditions. Medical imaging is an extremely important tool for physicians to understand patient condition in clinical analysis and diagnosis. However, the information obtained from medical images by different doctors may vary and the number of doctors is much smaller than the number of patients, resulting in that doctors often face problems of physical and mental fatigue, and thus it is difficult to manually answer all the problems of patients.
The Visual Question Answering (VQA) is given a picture, then the question content is input, and the system can select the appropriate answer according to the characteristic information of the picture to output the natural language answer. A good visual question-answering model facing medical image diagnosis can automatically extract information contained in a medical image, capture the position of a focus and the like, can provide a second opinion for image analysis for a radiologist, realizes auxiliary diagnosis, and is beneficial to enhancing the confidence of the radiologist in explaining complex medical images. Meanwhile, the VQA model can help the patient to preliminarily know the own physical condition, thereby being beneficial to selecting a more targeted medical scheme.
However, the current mainstream visual question-answering models often ignore fine-grained interactions between images and questions. In fact, learning keywords in questions and obtaining location information for different regions of the image may provide useful clues for answer reasoning. There are still some disadvantages if the mainstream model is directly used in medical image diagnosis. Firstly, the existing method mostly only realizes the rough interaction between the image and the problem, and can not capture the correlation between each region of the image and the problem; second, the inherent dependencies between words in different positions in a sentence cannot be captured efficiently. Thirdly, the existing method only extracts the image features in the image and lacks the spatial features. These methods do not solve the problem of correlation of different objects in the image.
Through retrieval, the publication number is CN113516182A, and the method and the device for training the visual question-answering model and the visual question-answering model are provided. The method comprises the following steps: acquiring a picture sample and a question sample for training a visual question-answering model; performing feature extraction on the picture sample to obtain picture sample features, and performing feature extraction on the problem sample to obtain problem sample features; determining a relation hidden variable between the picture sample characteristic and the question sample characteristic; the relation hidden variable is used for representing whether the picture sample and the question sample are related or not; training a visual question-answer model according to the relation hidden variables, the picture sample characteristics and the question sample characteristics to obtain a target visual question-answer model; the target visual question-answering model is used for carrying out visual question-answering. By adopting the method, the answer with higher accuracy can be still given when the fuzzy question is answered. Compared with the method, the method has the advantages that the graph convolution can better understand semantic information, but the complexity is higher. In addition, the technology adopts a first weight mode and a second weight mode to respectively obtain the picture characteristics and the text characteristics, and the more exquisite degree interaction of layer number stacking is lacked. In addition, the invention also introduces a position correlation module, and pays attention to the position relation between different objects while deeply interacting the image characteristic and the problem characteristic.
CN110321946A, a multimodal medical image recognition method and apparatus based on deep learning, which uses medical imaging equipment to collect medical image data; the image enhancement algorithm carries out enhancement processing on the acquired image; extracting the collected image characteristics by an extraction program; identifying the extracted features by using an identification program; converting medical images of different modes by using a conversion program; the printer prints the collected image; and displaying the acquired medical image data information by using a display. According to the invention, the image feature extraction effect is improved through the image feature extraction module; meanwhile, the mode conversion module adopts a three-dimensional reconstruction, registration and segmentation mode, so that the corresponding image height matching of the first mode image and the second mode image is ensured; in addition, the invention divides the training image into a plurality of image blocks, thereby reducing the requirement of the whole input training image on hardware equipment. The technology utilizes a feature recognition program to recognize the extracted features and uses an image enhancement algorithm to improve the recognition capability. But neglect modal interaction, namely lack of interaction ability with doctors or patients, and can not intelligently answer questions of patients and efficiently assist doctors in diagnosis assistance. The invention improves the image recognition capability and simultaneously considers the interaction with the user, so that the invention is more intelligent and improves the participation of the user.
Therefore, in order to better assist the doctor in making an auxiliary diagnosis and to allow the patient to use it also to obtain basic information of the image without consulting the doctor. There is a need to design an explicit mechanism to learn the correlation between questions and images, and to build a model to process image features and location features and apply it to the visual question-answering task oriented to medical image diagnosis.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. A visual question-answering method for medical image diagnosis is provided. The technical scheme of the invention is as follows:
a visual question-answering method oriented to medical image diagnosis comprises the following steps:
acquiring medical images and corresponding related medical problems;
respectively extracting features of the image focus target and the medical question text, capturing the dependency relationship among the question words, and performing text representation learning to obtain the correlation between each image area and the question;
processing the same focus target by interacting with the image features and the position features, realizing relational association modeling, obtaining the relative position relationship of different targets, and matching the multi-modal features;
a cross-guided multi-modal feature fusion stacking mode is introduced to capture complex interaction relations among multiple modalities;
a fusion mode and a classifier are designed and selected, and the method is applied to medical question answering to realize the visual question answering research oriented to medical image diagnosis.
Further, the medical image and the corresponding related medical problems specifically include the following steps:
downloading medical related image data and question labels on the network, wherein the image comprises pictures, mainly CT and MRI scanned images, and real answers corresponding to questions and questions matched with the pictures, and forming a group of objects of the pictures, the questions and the answers.
Further, the feature extraction is respectively performed on the image focus target and the medical problem text, and specifically includes:
and (3) carrying out feature extraction on the pictures and the problems: inputting a scanning image, and extracting a relevant area in the image by using a target detection algorithm of Faster R-CNN based on ResNet-101; inputting an English sentence, and obtaining problem characteristics through word embedding and a recurrent neural network.
Further, the image feature obtaining specifically includes: image information is processed in a manner of combining fast-RCNN with Resnet 101: firstly, extracting global image features in an image by using a residual error network Resnet101, and then identifying and extracting local features of the image according to a target detection algorithm, namely fast-RCNN to obtain corresponding focus information; and (3) not only using an object detector but also using an attribute classifier for each region in the image, wherein each object bounding box has a corresponding attribute class, so that binary description of the object can be obtained, extracting K object regions from each image, and each object region is represented by a 2048-dimensional vector and used as the input of a subsequent network.
Further, the problem feature acquisition specifically includes: the input medical problem is firstly processed into a single word, the longest word is intercepted into 14 words, redundant discarding is carried out, and less than 14 words are filled with zeros; and then capturing semantic features of words by combining a 300-dimensional GloVe word vector model, converting the semantic features into a vector mode, and encoding text features by using an LSTM network so as to extract problem semantic feature information as input of a subsequent network.
Furthermore, a self-recognition module is arranged to obtain characteristics among image regions and characteristics among question words, and the self-recognition module is an attention model and obtains the characteristics among the image regions and the characteristics among the question words through self-correlation learning; the core of the self-recognition module is an attention mechanism; the input consists of a query key of dimension d _ key and a value of dimension d _ value; firstly, calculating the dot product of a query key and all keys, and dividing each key by √ d; then, applying a softmax function to obtain the weight of the required value; in practice, to calculate attention weights for a set of query keys synchronously, they are packed into a matrix Q; the keys and values are also packed into matrices K and V.
Further, the attention model adopts an attention mechanism model of H parallel heads, which allows the model to focus on information from different representation subspaces at different positions simultaneously, and calculates an output feature matrix as:
F=MultiHead(Q,K,V)=Concat([head1,head2,…headH])W0
headi=Attention(QWi Q,KWi k,VWi v)
the self-recognition module consists of an attention mechanism model and a feed-forward network and is used for extracting the fine characteristics of the image or the medical problem;
outputting the problem characteristics after the learning attention characteristics obtain the weight; then inputting them into LayerNorm layer; the feedforward layer comprises two full-connection layers, a ReLu function and a Dropout function, and finally a LayerNorm layer, and the final characteristics are obtained through self-attention
Figure BDA0003388882300000051
Further, the processing of the same lesion target by interacting with the image features and the position features to realize relational modeling and obtain the relative position relationship of different targets specifically includes:
inputting object by image features
Figure BDA0003388882300000052
And a position feature P, and a position feature,
Figure BDA0003388882300000053
is a feature obtained by a self-recognition module, and P is a four-dimensional object frame;
to calculate the position feature weights, one object coordinate is represented as { x }i,yi,hi,wiIn which xiPosition of abscissa, y, representing center of objectiIndicating the position of the ordinate of the center of the object, wiWidth of the object frame, hiIndicating the height of the object box. First, the coordinates of P are transformed as follows,
Figure BDA0003388882300000054
m and n respectively represent two object frames and carry out scale normalization and logarithm operation; the input N objects can be expressed as
Figure BDA0003388882300000055
Next, the geometric features of the two objects are embedded into the high-dimensional features and expressed as epsilonGW is to beGMultiplying by the embedded feature to obtain a weight, WGIs also realized by a full connection layer; the final max operation is similar to the relu layer, whose main purpose is to impose a certain limit on the location feature weights;
Figure BDA0003388882300000056
Figure BDA0003388882300000057
representing the position feature weight between two objects.
εGRepresenting the embedding of geometric features into high-dimensional features.
Pm,PnRepresenting the geometrical characteristics of the m and n objects.
The object relationship between the nth object and the whole set can be obtained through the following formula;
Figure BDA0003388882300000058
r (n) represents the object relationship between the nth object and the entire set.
Figure BDA0003388882300000061
Representing image features of the m-th object, wmnOutputting W for the weight of the relation between different objectsVThe weighted sum of the image characteristics of other objects after linear change;
the following is wmnAnd
Figure BDA0003388882300000062
and (4) calculating a formula.
Figure BDA0003388882300000063
Figure BDA0003388882300000064
Figure BDA0003388882300000065
Representing image feature weights between m, n two objects
Figure BDA0003388882300000066
Representing the relative position characteristic weight between m and n objects
k represents the number of object objects
Figure BDA0003388882300000067
Representing the relative position feature weight before the kth object and the nth object
After obtaining the relation characteristic R (n), the last step is to fuse the Nr relation characteristic and then to the image characteristic
Figure BDA0003388882300000068
The fusion is carried out, and the fusion is carried out,
Figure BDA0003388882300000069
further, the method for capturing the complex interaction relationship among multiple modalities by introducing the cross-guided multi-modality feature fusion stacking manner specifically includes:
the cross guide module consists of a problem guide picture attention module and a problem attention module; updating image region characteristics and problem text characteristics by establishing semantic association between two different modes to obtain more detailed characteristics; performing cross fusion characteristic extraction on the sample image characteristic information and the sample problem characteristic information to obtain an image characteristic vector carrying the sample problem information and a sample problem characteristic vector carrying the sample image information;
the core of the cross-guide attention module is also an attention mechanism, and the input is also represented as Q, K and V; taking the attention model of the problem guide image as an example, the self-recognition feature of the input image
Figure BDA00033888823000000610
Self-identification of questions
Figure BDA00033888823000000611
Mapping to obtain the input and output of the image interactive attention model and the problem interactive attention modelOutputting the model;
after obtaining the image characteristic vector carrying the sample problem information and the sample problem characteristic vector carrying the sample image information, stacking a layer number, wherein N is the layer number of the attention model, and the output of the previous attention layer is used as the input of the next attention layer; the multiple attention model layers are connected with the model of a deeper layer, so that the embedding of the attention model can be guided, the image to be processed and the problem feature can be gradually refined, and the representation capability of the model can be enhanced.
Further, the design selects a fusion mode and a classifier, and the fusion mode and the classifier are applied to medical question answering to realize the visual question answering research facing medical image diagnosis, and the visual question answering research specifically comprises the following steps:
in obtaining the effective characteristics
Figure BDA0003388882300000074
And
Figure BDA0003388882300000075
then sending the data to a linear multi-mode fusion network; then, mapping the fused feature f to a vector space s epsilon R through an s-shaped functionLWherein L is the number of the most frequent answers in the training set;
Figure BDA0003388882300000071
s=Linear(f)
A=sigmoid(s)
a represents the model predicted answer.
The final prediction stage can be viewed as a logistic regression that predicts the correctness of each candidate answer; selecting the answer with the highest probability from all predicted answers as a final prediction; returning and predicting by using a binary cross entropy function; and determining a loss value of the loss function according to the real answer and the predicted answer, and updating the model according to the loss value.
Figure BDA0003388882300000072
M represents the training problem
N represents a candidate answer
Figure BDA0003388882300000073
Predicted answer representing model output
szkTrue answer representing model
Z, K values when training separately
The invention has the following advantages and beneficial effects:
in the present invention, we propose a visual question-answering method for medical image diagnosis. Given that many of the existing attention-based VQA methods can only learn the rough interaction of multimodal instances, our models can direct each other's attention and get a correlation between each image region and the problem. The other core idea of the invention is to increase the position attention, which can improve a judgment on the position relation of the object in the image and improve the counting performance of the object in the image. The invention can be used as an effective reference for assisting diagnosis of doctors, thereby greatly improving the diagnosis efficiency; the invention can help the patient to preliminarily know the self physical condition, thereby being beneficial to selecting a more targeted medical scheme.
The method of claims 6-7. The invention firstly considers that the extracted image features and the medical text features are mutually independent features, and firstly uses a self-identification module to emphasize respective emphasis of the image and the text in order to obtain the features with more fineness. The conventional model only considers picture recognition, namely a self-attention model and the like are only used on the picture, but the invention highlights that text features are also important and have key points and keywords in the problem, so that the self-recognition module is not only applied to picture processing but also applied to medical text problems. The finer single-mode model can better perform subsequent model fusion.
The method of claim 8. The common visual question-answering models are questions in some open domains, and it can be found in related data that answers of basic models are often bad and popular when answering related questions about positions. The obtained picture features not only contain original features but also contain rich inter-object position relations.
The method of claim 9. The common visual question-answering model usually adopts a text-guided picture mode to perform multi-mode fusion, and the text information can be guided by neglecting picture information. The cross-guided multi-modal feature fusion stacking mode designed by the invention can capture the complex interaction relationship among multiple modes. Updating image region characteristics and problem text characteristics by establishing semantic association between two different modes to obtain more detailed characteristics; and performing cross fusion characteristic extraction on the sample image characteristic information and the sample problem characteristic information to obtain an image characteristic vector carrying the sample problem information and a sample problem characteristic vector carrying the sample image information.
Drawings
FIG. 1 is a flow chart of a visual question-answering method for medical image diagnosis according to a preferred embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
the first embodiment: as shown in fig. 1, the present invention provides a visual question-answering method for medical image diagnosis, which implements feature fusion for two cross-modal data, namely, picture and text, helps a doctor to perform an auxiliary diagnosis, and enables a patient to use it to obtain basic information of an image without consulting the doctor.
Firstly, we need to download the data set related to the medical image from the internet, and combine the question and the answer of the object to generate a group of objects of pictures, questions and answers, which is convenient for the subsequent learning and training.
Then, the image and the text are preprocessed, namely sample pictures and problem characteristic information are obtained and then input to the main network of the user.
For pictures, a fast-RCNN + Resnet101 network is used as a network feature extraction network to be used, a residual error network Resnet101 is used for extracting global image features in the pictures, and then local features of the extracted images are identified according to a target detection algorithm, namely fast-RCNN. Input picture features are described as X ═ X1,x2,…xm]∈Rm ×2048And m represents the number of object objects in the picture.
For the question, the text is preprocessed, and the sentence is written as a word with the length not exceeding 14. Words in the question are embedded by using a GloVe word with 300 dimensions, and text characteristics are encoded by using an LSTM network so as to extract semantic characteristic information of the question as input of a subsequent network. Describing the problem feature of the input as Y ═ Y1,y2,…yn]∈Rn×512And n is the number of words in the sentence.
For the subject network in the figure. Firstly, the self-recognition module is used for extracting the characteristics of the image target and the question text, so that the interference of redundant information in the image target can be reduced, the dependency relationship between the question words can be effectively captured for text representation learning, and the subsequent obtaining of the correlation between each image area and the question is facilitated.
The self-recognition module is mainly realized by an attention mechanism, wherein the attention mechanism calculates the correlation between the inputs, then performs weighted summation on all vectors in the inputs, and calculates the concerned features as the output of multi-head attention. This output is then fed into a feedforward neural network consisting of fully-connected layer functions, resulting in the output from the attention module. Problem characteristic Y ═ Y1,y2,…yn]∈Rn×512Obtained after passing through a self-identification module
Figure BDA0003388882300000101
Picture characteristic X ═ X1,x2,…xm]∈Rm ×2048Obtained after passing through a self-identification module
Figure BDA0003388882300000102
Secondly, the image features are processed by a position correlation module, the same target is processed by interacting with the image features and the position features, relational correlation modeling is achieved, relative position relations of different targets are obtained, and matching capability of multi-modal features is enhanced.
Firstly, coordinate information of an object is obtained, and scale normalization and logarithm operation are carried out on the coordinate information. By passing
Figure BDA0003388882300000103
The object relation between different objects can be obtained, and after the object relation is obtained, the relation characteristic fusion is carried out with the picture characteristic
Figure BDA0003388882300000104
Obtaining the final picture characteristics
Figure BDA0003388882300000105
Then, a cross-guided multi-modal feature fusion mode is introduced, and the complex interaction relation among multiple modes can be captured. The cross-guide model is similar to the self-recognition model, except that the input features are not in the same group, but are image features and text features, respectively, and the final features are determined by mutual guidance
Figure BDA00033888823000001010
And
Figure BDA00033888823000001011
then, a plurality of attention model layers are connected with a model with a deeper layer through deepening the layer number of the main network, so that the embedding of the attention model can be guided, the image to be processed and the problem feature can be gradually refined, and the representation capability of the model can be enhanced. Through the fusion of N layers, the image features are represented as X(N)Problems ofThe characteristic is represented as Y(N)
And finally, designing and selecting a fusion mode and a classifier to achieve a better effect. The learned joint characterization is used for answer prediction. By ax=softmax(MLP(X(N)) And ay=softmax(MLP(Y(N)) And respectively obtaining attention weights when the two features are subjected to weighted summation. Multiplying attention weight by picture feature
Figure BDA0003388882300000106
To obtain final characteristics
Figure BDA0003388882300000107
The final characteristics of the problem are obtained by the same way
Figure BDA0003388882300000108
We adopt a linear multi-modal fusion mode
Figure BDA0003388882300000109
Then, mapping the fused feature f to a vector space s ∈ R through a functionLAnd L is the number of answers with the highest occurrence frequency in the training set. And finally outputting the predicted answer with the highest probability as the final predicted answer. And determining a loss value of the loss function according to the real answer and the predicted answer, and updating the model according to the loss value.
Second embodiment:
1 obtaining sample image and medical problem characteristic information
First, we need to download medical related image data and question labels including pictures, mainly CT and MRI scanned images, and real answers corresponding to questions and questions matched with the pictures, to form a group of objects of pictures, questions and answers.
And then extracting the features of the pictures and the questions. Inputting a scanning image, and extracting a relevant area in the image by using a target detection algorithm of Faster R-CNN based on ResNet-101; inputting an English sentence, and obtaining problem characteristics through word embedding and a recurrent neural network. The specific operation is as follows:
acquiring picture characteristics: in order to better extract the required picture features, the image information is processed in a manner of combining fast-RCNN with Resnet 101. Firstly, global image features in an image are extracted by using a residual error network Resnet101, and then local features of the extracted image are identified according to a target detection algorithm, namely fast-RCNN, so that corresponding focus information is obtained. Not only an object detector but also an attribute classifier is used for each region in the image, each object bounding box has a corresponding attribute class, so that a binary description of the object can be obtained. K object regions are extracted from each image, each object region being represented by a 2048-dimensional vector as input to the subsequent network.
Problem feature acquisition: the entered medical question is first treated as a single word, with a maximum of 14 words truncated, redundant discard, and fewer than 14 filled with zeros. And then capturing semantic features of words by combining a 300-dimensional GloVe word vector model, converting the semantic features into a vector mode, and encoding text features by using an LSTM network so as to extract problem semantic feature information as input of a subsequent network.
2 image and medical problem self-identification
The self-recognition module is an attention model and obtains characteristics among image areas and characteristics among question words through self-correlation learning. The core of the self-recognition module is an attention mechanism. The input consists of a query key of dimension d _ key and a value of dimension d _ value. Typically d _ key and d _ value are both written as d. First, we compute the dot product of the query key and all keys and divide each key by √ d. Then, the softmax function is applied to obtain the weights of these values. In practice, we compute a set of attention functions for the query key at the same time and pack them into the matrix Q. The keys and values are also packed into matrices K and V. We compute the output matrix as:
Figure BDA0003388882300000121
still further, an attention model of H parallel heads is employed, which allows the model to focus on information from different representation subspaces from different locations at the same time, and thus a wider area can be focused on at the same time. We compute the output feature matrix as:
F=MultiHead(Q,K,V)=Concat([head1,head2,…headH])W0
headi=Attention(QWi Q,KWi k,VWi v)
intra-modal self-identification unit. They consist of an attention model and a feed-forward network for extracting subtle features of an image or medical problem. Taking the problem feature as an example, the problem feature is Y ═ Y1,y2,…yn]∈Rn×512The input of the self-recognition module can be obtained by the following formula:
Figure BDA0003388882300000122
Figure BDA0003388882300000123
Figure BDA0003388882300000124
and outputting the problem characteristics after the learning attention characteristics obtain the weight. They are then fed into the LayerNorm layer.
L Y=LayerNorm(Y+MultiHead(QY,KY,VY))
The feedforward layer comprises two full-connection layers, a ReLu function and a Dropout function, and finally a LayerNorm layer, and the final characteristics are obtained through self-attention
Figure BDA0003388882300000125
L′Y=FF(LY)=max(0,LYW1+b1)W2+b2
Figure BDA0003388882300000126
After the self-recognition module is used, the medical images and the medical texts focus on the self-emphasis, redundant information can be eliminated, and subsequent modal interaction and feature fusion can be conveniently carried out.
3 image focus object position relation modeling
In order to better acquire image characteristics and position relations among different targets, after self-identification characteristics of image information are acquired, output characteristics are sent to a position association unit working together with a self-identification module, and the position relations of the image characteristics and different objects are modeled simultaneously. Therefore, better understanding of image images is facilitated, and problems of position relations, such as front, back, left, right, foreground, background and the like, can be effectively processed through position relation modeling, so that a focus area is conveniently positioned, and effective auxiliary diagnosis is conveniently provided for doctors.
Inputting object by image features
Figure BDA00033888823000001313
And a position feature P, and a position feature,
Figure BDA00033888823000001314
is the feature obtained by the self-recognition module, and P is a four-dimensional object box.
To calculate the position feature weight, first, the coordinates of P are transformed as follows,
Figure BDA0003388882300000131
the method mainly performs scale normalization and logarithm operation, and aims to increase scale invariance so that training divergence caused by overlarge change range of values is avoided. Thus the N objects entered can be represented as
Figure BDA0003388882300000132
Then, W is addedGAnd insertCharacteristic multiplication of W inGIs also realized by a full connection layer. The final max operation is similar to the relu layer, whose main purpose is to impose a certain limit on the location feature weights.
Figure BDA0003388882300000133
The object relationship between the nth object and the entire set can be obtained by the following formula.
Figure BDA0003388882300000134
Figure BDA0003388882300000135
Representing image features of the m-th object, wmnOutputting W for the weight of the relation between different objectsVIs the weighted sum of the image characteristics of other objects after linear change.
The following is wmnAnd
Figure BDA0003388882300000136
and (4) calculating a formula.
Figure BDA0003388882300000137
Figure BDA0003388882300000138
After obtaining the relation characteristic R (n), the last step is to fuse the Nr relation characteristic and then to the image characteristic
Figure BDA0003388882300000139
The fusion is carried out, and the fusion is carried out,
Figure BDA00033888823000001310
the main reason for using concat here is the calculationThe number is small because the channel dimension of each R (n) is
Figure BDA00033888823000001311
1/Nr times, dimension after concat and
Figure BDA00033888823000001312
the same is true.
4 image problem cross-guide
The cross guide module is composed of a question guide picture attention module and a picture guide question attention module. The mutually-guided attention unit pays more attention to the interaction between the modes, and the image region feature and the question text feature are updated by establishing semantic association between two different modes so as to obtain more refined features. And performing cross fusion characteristic extraction on the sample image characteristic information and the sample problem characteristic information to obtain an image characteristic vector carrying the sample problem information and a sample problem characteristic vector carrying the sample image information.
Similar to the self-identity module, the cross-lead attention module core is also the attention mechanism, with inputs also denoted as Q, K, V. Taking the attention model of the problem guide image as an example, the self-recognition feature of the input image
Figure BDA0003388882300000143
Self-identification of questions
Figure BDA0003388882300000144
And mapping to obtain an image interaction attention model input and a question interaction attention model output.
After obtaining the image feature vector carrying the sample problem information and the sample problem feature vector carrying the sample image information, we stack a layer number, where N is the layer number of the attention model, and the output of the previous attention layer is used as the input of the next attention layer. The multiple attention model layers are connected with the model of a deeper layer, so that the embedding of the attention model can be guided, the image to be processed and the problem feature can be gradually refined, and the representation capability of the model can be enhanced.
Model 5 fusion and classifier
Through the learning of intra-modality self-attention and cross-directed attention mechanisms, features containing rich images and question information can be obtained. And simply fusing the image characteristic vector carrying the sample question information and the sample question characteristic vector carrying the sample image information, inputting the fused image characteristic vectors into a model classifier, and obtaining a predicted answer through the classifier.
In obtaining the effective characteristics
Figure BDA0003388882300000141
And
Figure BDA0003388882300000142
and then sending the data into a linear multi-mode fusion network. Then, mapping the fused feature f to a vector space s epsilon R through an s-shaped functionLAnd L is the number of the most frequent answers in the training set.
Figure BDA0003388882300000151
s=Linear(f)
A=sigmoid(s)
The final prediction stage can be viewed as a logistic regression that predicts the correctness of each candidate answer. We select the answer with the highest probability from all the predicted answers as the final prediction. Therefore, we come back to the prediction using a binary cross-entropy function. And determining a loss value of the loss function according to the real answer and the predicted answer, and updating the model according to the loss value.
Figure BDA0003388882300000152
The visual question-answering method for medical image diagnosis disclosed by the invention has the capability of visual question-answering, can better help doctors to perform auxiliary diagnosis particularly for judging the position relation of a focus, and enables patients to use the visual question-answering method to obtain the basic information of the image without consulting the doctors.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims (10)

1.一种面向医学图像诊断的视觉问答方法,其特征在于,包括以下步骤:1. a visual question answering method for medical image diagnosis, is characterized in that, comprises the following steps: 获取医学影像和对应相关医学问题;Obtaining medical images and corresponding medical problems; 对图像病灶目标和医学问题文本分别进行特征提取,捕捉问题词之间的依赖关系进行文本表示学习,得到每个图像区域和问题的相关性;Perform feature extraction on image lesion targets and medical question texts, capture the dependencies between question words, and perform text representation learning to obtain the correlation between each image area and the question; 通过与影像特征和位置特征交互,对同一病灶目标进行处理,实现关系关联建模,获得不同目标的相对位置关系,用于多模态特征的匹配;By interacting with image features and location features, the same lesion target is processed to achieve relationship association modeling, and the relative positional relationship of different targets can be obtained for multi-modal feature matching; 引入交叉引导的多模态特征融合堆叠方式,捕捉多模态之间的复杂交互关系;The cross-guided multi-modal feature fusion stacking method is introduced to capture the complex interaction between multi-modalities; 设计选取融合方式和分类器,运用到医学问答中,实现面向医学图像诊断的视觉问答研究。Design and select fusion methods and classifiers, and apply them to medical question answering to realize visual question answering research for medical image diagnosis. 2.根据权利要求1所述的一种面向医学图像诊断的视觉问答方法,其特征在于,所述医学影像和对应相关医学问题,具体包括以下步骤:2. A kind of visual question answering method oriented to medical image diagnosis according to claim 1, is characterized in that, described medical image and corresponding relevant medical question, specifically comprise the following steps: 在网上下载医学相关影像资料和问题标签,其中包括图片,主要是CT、MRI在内的扫描图像,以及与图片相匹配的问题和问题对应的真实答案,形成图片、问题、答案的一组对象。Download medical-related imaging materials and question labels online, including pictures, mainly scanned images including CT and MRI, as well as real answers to questions and questions that match the pictures, forming a group of objects for pictures, questions, and answers . 3.根据权利要求1或2所述的一种面向医学图像诊断的视觉问答方法,其特征在于,所述对图像病灶目标和医学问题文本分别进行特征提取,具体包括:3. The visual question answering method for medical image diagnosis according to claim 1 or 2, wherein the feature extraction is performed on the image focus target and the medical question text respectively, specifically comprising: 对图片和问题进行特征提取:输入一幅扫描图像,使用基于ResNet-101的Faster R-CNN的目标检测算法提取图像中的相关区域;输入一个英文句子,通过词嵌入和循环神经网络后得到问题特征。Feature extraction for pictures and questions: input a scanned image, use the ResNet-101-based Faster R-CNN target detection algorithm to extract relevant areas in the image; input an English sentence, get the question after word embedding and recurrent neural network feature. 4.根据权利要求3所述的一种面向医学图像诊断的视觉问答方法,其特征在于,所述图片特征获取具体包括:采用Faster-RCNN与Resnet101相结合的方式处理图像信息:首先利用残差网络Resnet101提取影像中的全局图像特征,然后根据目标检测算法Faster-RCNN来识别抽取图像的局部特征,获得相应病灶信息;对图像中的每一个区域不仅使用对象检测器,还使用属性分类器,每一个对象包围框都有一个对应的属性类,这样可以获得对象的二元描述,每幅图像提取K个对象区域,每个对象区域用一个2048维的向量表示,作为后续网络的输入。4. a kind of visual question answering method oriented to medical image diagnosis according to claim 3, it is characterized in that, described picture characteristic acquisition specifically comprises: adopt the mode of Faster-RCNN and Resnet101 combination to process image information: first utilize residual error The network Resnet101 extracts the global image features in the image, and then identifies the local features of the extracted image according to the target detection algorithm Faster-RCNN to obtain the corresponding lesion information; for each area in the image, not only the object detector but also the attribute classifier is used. Each object bounding box has a corresponding attribute class, so that the binary description of the object can be obtained. K object regions are extracted from each image, and each object region is represented by a 2048-dimensional vector as the input of the subsequent network. 5.根据权利要求3或4所述的一种面向医学图像诊断的视觉问答方法,其特征在于,所述问题特征获取具体包括:输入的医学问题首先会被处理为单个单词,最长截取为14个单词,多余的丢弃,少于14个的用零填充;然后结合300维的GloVe词向量模型捕捉单词的语义特征,转化为向量模式,再利用LSTM网络对文本特征进行编码从而抽取问题语义特征信息,作为后续网络的输入。5. A kind of visual question answering method oriented to medical image diagnosis according to claim 3 or 4, it is characterized in that, described question characteristic acquisition specifically comprises: the medical question of input can be processed as a single word at first, and the longest interception is 14 words, discard the redundant ones, fill with zeros for less than 14 words; then combine the 300-dimensional GloVe word vector model to capture the semantic features of the words, convert them into vector patterns, and then use the LSTM network to encode the text features to extract the semantics of the problem feature information, as the input of the subsequent network. 6.根据权利要求5所述的一种面向医学图像诊断的视觉问答方法,其特征在于,还通过设置一个自我识别模块来获取影像区域间特征和问题词间特征,自我识别模块对是一种注意模型,通过自相关学习获取影像区域间特征和问题词间特征;自我识别模块的核心是注意力机制;该输入由一个维度为d_key的查询键和一个维度为d_value的值组成;首先,计算查询键与所有键的点积,并将每个键除以√d;然后,应用softmax函数获得需要的值的权重;实际上,为了同步计算一组查询键的注意权重,将它们打包到矩阵Q中;键和值也被打包到矩阵K和V中。6. a kind of visual question answering method for medical image diagnosis according to claim 5, is characterized in that, also by setting a self-identification module to obtain the feature between image regions and the feature between question words, the self-identification module is a kind of The attention model obtains the features between image regions and between words in the question through autocorrelation learning; the core of the self-recognition module is the attention mechanism; the input consists of a query key with dimension d_key and a value with dimension d_value; first, calculate Do the dot product of the query key with all keys, and divide each key by √d; then, apply the softmax function to get the weights of the desired values; in fact, to simultaneously compute the attention weights for a set of query keys, pack them into a matrix into Q; keys and values are also packed into matrices K and V. 7.根据权利要求6所述的一种面向医学图像诊断的视觉问答方法,其特征在于,所述注意模型采用H个并行头的注意力机制模型,它允许模型同时关注来自不同位置的不同表示子空间的信息,将输出特征矩阵计算为:7. A visual question answering method for medical image diagnosis according to claim 6, wherein the attention model adopts the attention mechanism model of H parallel heads, which allows the model to simultaneously pay attention to different representations from different positions information of the subspace, the output feature matrix is computed as: F=MultiHead(Q,K,V)=Concat([head1,head2,…headH])W0 F=MultiHead(Q,K,V)=Concat([head 1 ,head 2 ,...head H ])W 0
Figure FDA0003388882290000021
Figure FDA0003388882290000021
自我识别模块由注意力机制模型和前馈网络组成,用于提取影像或医学问题的细微特征;The self-recognition module consists of an attention mechanism model and a feed-forward network to extract subtle features of imaging or medical problems; 学习注意力特征得到权重后,输出问题特征;然后将它们输入LayerNorm层;前馈层包含两个全连接层以及ReLu函数和Dropout函数,最后是LayerNorm层,经过自我关注得到最终的特征
Figure FDA0003388882290000022
After learning the attention features and getting the weights, the problem features are output; then they are input into the LayerNorm layer; the feedforward layer contains two fully connected layers and the ReLu function and the Dropout function, and finally the LayerNorm layer, after self-attention to get the final features
Figure FDA0003388882290000022
8.根据权利要求7所述的一种面向医学图像诊断的视觉问答方法,其特征在于,所述通过与影像特征和位置特征交互,对同一病灶目标进行处理,实现关系关联建模,获得不同目标的相对位置关系,具体包括:8. A visual question answering method for medical image diagnosis according to claim 7, characterized in that, by interacting with image features and location features, the same lesion target is processed to realize relational association modeling and obtain different The relative positional relationship of the target, including: 输入对象由图像特征
Figure FDA0003388882290000038
和位置特征P组成,
Figure FDA0003388882290000039
是经过自我识别模块得到的特征,P是一个四维对象框;
The input object consists of image features
Figure FDA0003388882290000038
and the position feature P,
Figure FDA0003388882290000039
is the feature obtained by the self-identification module, and P is a four-dimensional object frame;
为了计算位置特征权重,将一个对象坐标表示为{xi,yi,hi,wi},其中xi表示对象中心的横坐标位置,yi表示对象中心的纵坐标位置,wi表示对象框的宽度,hi表示对象框的高度。首先,对P的坐标进行如下变换,
Figure FDA0003388882290000031
m、n分别表示两个对象框,进行尺度归一化和对数运算;输入的N个对象可以表示为
Figure FDA0003388882290000032
In order to calculate the position feature weight, the coordinates of an object are expressed as {x i , y i , h i , w i }, where x i represents the abscissa position of the object center, y i represents the ordinate position of the object center, and wi represents the The width of the object box, hi represents the height of the object box. First, the coordinates of P are transformed as follows,
Figure FDA0003388882290000031
m and n represent two object boxes, respectively, for scale normalization and logarithmic operations; the input N objects can be expressed as
Figure FDA0003388882290000032
接着,将两个物体的几何特征嵌入到高维特征中表示为εG,将WG与嵌入特征相乘,得到一个权重,其中的WG也是由一个全连接层实现的;最后的max操作类似于relu层,其主要目的是对位置特征权重施加一定的限制;Next, the geometric features of the two objects are embedded into the high-dimensional features and expressed as ε G , and WG is multiplied by the embedded feature to obtain a weight, where WG is also implemented by a fully connected layer; the final max operation Similar to the relu layer, its main purpose is to impose certain restrictions on the position feature weights;
Figure FDA0003388882290000033
Figure FDA0003388882290000033
Figure FDA0003388882290000034
表示两个对象之间的位置特征权重。
Figure FDA0003388882290000034
Represents the positional feature weight between two objects.
εG表示将几何特征嵌入到高维特征。 εG represents the embedding of geometric features into high-dimensional features. Pm,Pn表示m、n两个对象的几何特征。P m , P n represent the geometric features of m and n objects. 通过下列公式可以得到第n个对象与整个集合之间的对象关系;The object relationship between the nth object and the entire collection can be obtained by the following formula;
Figure FDA0003388882290000035
Figure FDA0003388882290000035
R(n)表示第n个对象与整个集合之间的对象关系。R(n) represents the object relationship between the nth object and the entire collection.
Figure FDA0003388882290000036
表示第m个物体的图像特征,wmn为不同物体之间关系的权重,WV用于线性变化,最终得到其他物体图像特征的加权和;
Figure FDA0003388882290000036
Represents the image feature of the mth object, w mn is the weight of the relationship between different objects, W V is used for linear change, and finally the weighted sum of the image features of other objects is obtained;
以下是wmn
Figure FDA0003388882290000037
计算公式。
The following are w mn and
Figure FDA0003388882290000037
calculation formula.
Figure FDA0003388882290000041
Figure FDA0003388882290000041
Figure FDA0003388882290000042
Figure FDA0003388882290000042
Figure FDA0003388882290000043
表示m、n两个对象之间的图像特征权重
Figure FDA0003388882290000043
Represents the image feature weight between m and n two objects
Figure FDA0003388882290000044
表示m、n两个对象之间的相对位置特征权重
Figure FDA0003388882290000044
Represents the relative position feature weight between m and n two objects
k表示物体对象个数k represents the number of objects
Figure FDA0003388882290000045
表示第k个物体对象与第n个物体对象之前的相对位置特征权重
Figure FDA0003388882290000045
Represents the relative position feature weight between the k-th object object and the n-th object object
得到关系特征R(n)后,最后一步是融合Nr关系特征,然后与图像特征
Figure FDA0003388882290000047
进行融合,
Figure FDA0003388882290000046
After obtaining the relational feature R(n), the last step is to fuse the Nr relational feature, and then combine it with the image feature
Figure FDA0003388882290000047
to fuse,
Figure FDA0003388882290000046
9.根据权利要求8所述的一种面向医学图像诊断的视觉问答方法,其特征在于,所述引入交叉引导的多模态特征融合堆叠方式,捕捉多模态之间的复杂交互关系,具体包括:9 . The visual question answering method for medical image diagnosis according to claim 8 , wherein the multi-modal feature fusion stacking method of introducing cross-guidance is used to capture the complex interactive relationship between the multi-modalities. 10 . include: 交叉引导模块由问题引导的图片注意模块和图片引导的问题注意模块组成;通过建立两种不同模式之间的语义关联关系来更新图像区域特征和问题文本特征,以获得更细化的特征;将所述样本图像特征信息和所述样本问题特征信息进行交叉融合特征提,得到携带有样本问题信息的图像特征向量和携带有样本图像信息的样本问题特征向量;The cross-guided module consists of a question-guided picture attention module and a picture-guided question attention module; it updates image region features and question text features by establishing semantic associations between the two different modes to obtain more refined features; Cross-fusion feature extraction is performed on the sample image feature information and the sample problem feature information to obtain an image feature vector carrying the sample problem information and a sample problem feature vector carrying the sample image information; 交叉引导注意模块核心也是注意力机制,输入也表示为Q,K,V;以问题引导影像的注意模型为例,将输入图像的自我识别特征
Figure FDA0003388882290000048
与问题的自我识别特征
Figure FDA0003388882290000049
映射,得到图像交互注意模型输出以及问题交互注意模型输出;
The core of the cross-guided attention module is also the attention mechanism, and the input is also expressed as Q, K, V; taking the attention model of the problem-guided image as an example, the self-identification features of the input image are
Figure FDA0003388882290000048
Self-identifying traits with problems
Figure FDA0003388882290000049
Mapping to get the image interactive attention model output and the question interactive attention model output;
在得到携带有样本问题信息的图像特征向量和携带有样本图像信息的样本问题特征向量后,将进行一个层数的堆叠,N是注意模型的层数,前一个注意层的输出作为下一个注意层的输入;将多个注意模型层与更深层次的模型连接起来,可以引导注意模型的嵌入,逐步细化待处理的图像和问题特征,增强模型的表征能力。After obtaining the image feature vector carrying the sample problem information and the sample problem feature vector carrying the sample image information, a stack of layers will be performed, N is the number of layers of the attention model, and the output of the previous attention layer is used as the next attention layer. layer input; connecting multiple attention model layers with deeper models can guide the embedding of the attention model, gradually refine the image and question features to be processed, and enhance the representational ability of the model.
10.根据权利要求8所述的一种面向医学图像诊断的视觉问答方法,其特征在于,所述设计选取融合方式和分类器,运用到医学问答中,实现面向医学图像诊断的视觉问答研究,具体包括:10. A kind of visual question answering method oriented to medical image diagnosis according to claim 8, is characterized in that, described design selects fusion mode and classifier, is applied in medical question answering, realizes the visual question answering research oriented to medical image diagnosis, Specifically include: 在得到有效特征
Figure FDA0003388882290000054
Figure FDA0003388882290000055
后,送入线性多模态融合网络;然后,将融合后的特征f通过一个s形函数映射到向量空间s∈RL,其中L为训练集中最频繁答案的个数;
in obtaining effective features
Figure FDA0003388882290000054
and
Figure FDA0003388882290000055
Then, the fused feature f is mapped to the vector space s∈R L through a sigmoid function, where L is the number of the most frequent answers in the training set;
Figure FDA0003388882290000051
Figure FDA0003388882290000051
s=Linear(f)s=Linear(f) A=sigmoid(s)A=sigmoid(s) A表示模型预测答案。A means the model predicts the answer. 最后的预测阶段可以看作是预测每个候选答案正确性的逻辑回归;从所有预测的答案中选择概率最高的答案作为最终预测;使用二元交叉熵函数来回归预测;根据真实答案与预测答案确定损失函数的损失值,根据损失值对模型进行更新。The final prediction stage can be viewed as a logistic regression that predicts the correctness of each candidate answer; selects the answer with the highest probability from all predicted answers as the final prediction; uses a binary cross-entropy function to regress the prediction; compares the true answer with the predicted answer Determine the loss value of the loss function and update the model according to the loss value.
Figure FDA0003388882290000052
Figure FDA0003388882290000052
M表示训练问题M represents the training problem N表示候选答案N is the candidate answer
Figure FDA0003388882290000053
表示模型输出的预测答案
Figure FDA0003388882290000053
the predicted answer representing the model output
szk表示模型的真实答案s zk represents the true answer of the model Z、K分别训练时的值。The values of Z and K during training respectively.
CN202111461563.7A 2021-12-02 2021-12-02 Visual question-answering method for medical image diagnosis Active CN114201592B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111461563.7A CN114201592B (en) 2021-12-02 2021-12-02 Visual question-answering method for medical image diagnosis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111461563.7A CN114201592B (en) 2021-12-02 2021-12-02 Visual question-answering method for medical image diagnosis

Publications (2)

Publication Number Publication Date
CN114201592A true CN114201592A (en) 2022-03-18
CN114201592B CN114201592B (en) 2024-07-23

Family

ID=80650233

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111461563.7A Active CN114201592B (en) 2021-12-02 2021-12-02 Visual question-answering method for medical image diagnosis

Country Status (1)

Country Link
CN (1) CN114201592B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780701A (en) * 2022-04-20 2022-07-22 平安科技(深圳)有限公司 Automatic question-answer matching method, device, computer equipment and storage medium
CN114780775A (en) * 2022-04-24 2022-07-22 西安交通大学 Image description text generation method based on content selection and guide mechanism
CN114821245A (en) * 2022-05-30 2022-07-29 大连大学 Medical visual question-answering method based on global visual information intervention
CN117235670A (en) * 2023-11-10 2023-12-15 南京信息工程大学 Visual solution method for medical imaging problems based on fine-grained cross-attention
CN117407541A (en) * 2023-12-15 2024-01-16 中国科学技术大学 Knowledge graph question-answering method based on knowledge enhancement
CN117648976A (en) * 2023-11-08 2024-03-05 北京医准医疗科技有限公司 Answer generation method, device, equipment and storage medium based on medical image
CN118471487A (en) * 2024-07-12 2024-08-09 福建自贸试验区厦门片区Manteia数据科技有限公司 Diagnosis and treatment plan generation device and electronic device based on multi-source heterogeneous data
CN119090895A (en) * 2024-11-11 2024-12-06 浙江杜比医疗科技有限公司 A method and device for processing mammary gland optical images, storage medium and electronic equipment
CN117648976B (en) * 2023-11-08 2025-02-21 北京医准医疗科技有限公司 Medical image-based answer generation method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019211250A1 (en) * 2018-04-30 2019-11-07 Koninklijke Philips N.V. Visual question answering using on-image annotations
WO2020263711A1 (en) * 2019-06-28 2020-12-30 Facebook Technologies, Llc Memory grounded conversational reasoning and question answering for assistant systems
CN112818889A (en) * 2021-02-09 2021-05-18 北京工业大学 Dynamic attention-based method for integrating accuracy of visual question-answer answers by hyper-network
CN112926655A (en) * 2021-02-25 2021-06-08 电子科技大学 Image content understanding and visual question and answer VQA method, storage medium and terminal
CN113240046A (en) * 2021-06-02 2021-08-10 哈尔滨工程大学 Knowledge-based multi-mode information fusion method under visual question-answering task
CN113392253A (en) * 2021-06-28 2021-09-14 北京百度网讯科技有限公司 Visual question-answering model training and visual question-answering method, device, equipment and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019211250A1 (en) * 2018-04-30 2019-11-07 Koninklijke Philips N.V. Visual question answering using on-image annotations
WO2020263711A1 (en) * 2019-06-28 2020-12-30 Facebook Technologies, Llc Memory grounded conversational reasoning and question answering for assistant systems
CN112818889A (en) * 2021-02-09 2021-05-18 北京工业大学 Dynamic attention-based method for integrating accuracy of visual question-answer answers by hyper-network
CN112926655A (en) * 2021-02-25 2021-06-08 电子科技大学 Image content understanding and visual question and answer VQA method, storage medium and terminal
CN113240046A (en) * 2021-06-02 2021-08-10 哈尔滨工程大学 Knowledge-based multi-mode information fusion method under visual question-answering task
CN113392253A (en) * 2021-06-28 2021-09-14 北京百度网讯科技有限公司 Visual question-answering model training and visual question-answering method, device, equipment and medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering", 《THE JOURNAL OF SUPERCOMPUTING》, 29 March 2023 (2023-03-29), pages 13696 *
A LUBNA等: "MoBVQA: A Modality based Medical Image Visual Question Answering System", 《TENCON 2019 - 2019 IEEE REGION 10 CONFERENCE (TENCON)》, 20 October 2019 (2019-10-20), pages 727 - 732, XP033672617, DOI: 10.1109/TENCON.2019.8929456 *
张礼阳: "结合视觉内容理解与文本信息分析的视觉问答方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 07, 15 July 2020 (2020-07-15), pages 138 - 844 *
陈珂佳: "基于深度学习的视觉问答研究", 《重庆邮电大学硕士学位论文》, 16 April 2024 (2024-04-16), pages 1 - 86 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780701A (en) * 2022-04-20 2022-07-22 平安科技(深圳)有限公司 Automatic question-answer matching method, device, computer equipment and storage medium
CN114780701B (en) * 2022-04-20 2024-07-02 平安科技(深圳)有限公司 Automatic question-answer matching method, device, computer equipment and storage medium
CN114780775A (en) * 2022-04-24 2022-07-22 西安交通大学 Image description text generation method based on content selection and guide mechanism
CN114821245A (en) * 2022-05-30 2022-07-29 大连大学 Medical visual question-answering method based on global visual information intervention
CN114821245B (en) * 2022-05-30 2024-03-26 大连大学 Medical visual question-answering method based on global visual information intervention
CN117648976A (en) * 2023-11-08 2024-03-05 北京医准医疗科技有限公司 Answer generation method, device, equipment and storage medium based on medical image
CN117648976B (en) * 2023-11-08 2025-02-21 北京医准医疗科技有限公司 Medical image-based answer generation method, device, equipment and storage medium
CN117235670A (en) * 2023-11-10 2023-12-15 南京信息工程大学 Visual solution method for medical imaging problems based on fine-grained cross-attention
CN117407541A (en) * 2023-12-15 2024-01-16 中国科学技术大学 Knowledge graph question-answering method based on knowledge enhancement
CN117407541B (en) * 2023-12-15 2024-03-29 中国科学技术大学 A knowledge graph question answering method based on knowledge enhancement
CN118471487A (en) * 2024-07-12 2024-08-09 福建自贸试验区厦门片区Manteia数据科技有限公司 Diagnosis and treatment plan generation device and electronic device based on multi-source heterogeneous data
CN119090895A (en) * 2024-11-11 2024-12-06 浙江杜比医疗科技有限公司 A method and device for processing mammary gland optical images, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN114201592B (en) 2024-07-23

Similar Documents

Publication Publication Date Title
CN114201592A (en) Visual Question Answering Method for Medical Image Diagnosis
Arevalo et al. Gated multimodal networks
Sharma et al. MedFuseNet: An attention-based multimodal deep learning model for visual question answering in the medical domain
CN110491502A (en) Microscope video stream processing method, system, computer equipment and storage medium
CN113886626B (en) Visual question-answering method of dynamic memory network model based on multi-attention mechanism
Bodapati et al. Msenet: Multi-modal squeeze-and-excitation network for brain tumor severity prediction
Halvardsson et al. Interpretation of swedish sign language using convolutional neural networks and transfer learning
CN114612902B (en) Image semantic segmentation method, device, equipment, storage medium and program product
CN113806609A (en) A Multimodal Sentiment Analysis Method Based on MIT and FSM
CN116129141B (en) Medical data processing method, apparatus, device, medium and computer program product
CN110705490A (en) Visual Emotion Recognition Methods
CN119088945A (en) A large-scale language model question answering system based on health care knowledge
Thangavel et al. A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models
Arun Prasath et al. Prediction of sign language recognition based on multi layered CNN
CN116484042A (en) Visual question-answering method combining autocorrelation and interactive guided attention mechanism
Chen et al. Breast cancer classification with electronic medical records using hierarchical attention bidirectional networks
Nahar et al. A robust model for translating arabic sign language into spoken arabic using deep learning
Renjith et al. Sign language recognition by using spatio-temporal features
CN113779298A (en) A compound loss-based method for medical visual question answering
Prusty et al. Enhancing medical image classification with generative AI using latent denoising diffusion probabilistic model and wiener filtering approach
Wu et al. Question-driven multiple attention (dqma) model for visual question answer
CN118334549A (en) Short video label prediction method and system for multi-mode collaborative interaction
CN117035019A (en) Data processing method and related equipment
Cai et al. Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering
Habib et al. Exploring Progress in Text-to-Image Synthesis: An In-Depth Survey on the Evolution of Generative Adversarial Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant