CN114201592A

CN114201592A - Visual Question Answering Method for Medical Image Diagnosis

Info

Publication number: CN114201592A
Application number: CN202111461563.7A
Authority: CN
Inventors: 蔡林沁; 陈珂佳; 方豪度; 赖廷杰
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-12-02
Filing date: 2021-12-02
Publication date: 2022-03-18
Anticipated expiration: 2041-12-02
Also published as: CN114201592B

Abstract

The invention requests to protect a visual question-answering method for medical image diagnosis, which belongs to the field of medical image processing, natural language processing and multi-mode fusion and comprises the following steps: acquiring medical images and corresponding related medical problems; respectively extracting features of the image focus target and the medical question text, capturing the dependency relationship among the question words, and performing text representation learning to obtain the correlation between each image area and the question; processing the same focus target by interacting with the image features and the position features, realizing relational association modeling, obtaining the relative position relationship of different targets, and matching the multi-modal features; a cross-guided multi-modal feature fusion stacking mode is introduced to capture complex interaction relations among multiple modalities; a fusion mode and a classifier are designed and selected, and the method is applied to medical question answering to realize the visual question answering research oriented to medical image diagnosis.

Description

Visual question-answering method for medical image diagnosis

Technical Field

The invention belongs to the field of medical image processing, natural language processing and multi-mode fusion, and particularly relates to a visual question-answering method for medical image diagnosis.

Background

Health is always one of the most concerned issues for human beings, and with the continuous development of deep learning, it becomes more important to use different tools and techniques to help doctors diagnose and patients better understand their own physical conditions. Medical imaging is an extremely important tool for physicians to understand patient condition in clinical analysis and diagnosis. However, the information obtained from medical images by different doctors may vary and the number of doctors is much smaller than the number of patients, resulting in that doctors often face problems of physical and mental fatigue, and thus it is difficult to manually answer all the problems of patients.

The Visual Question Answering (VQA) is given a picture, then the question content is input, and the system can select the appropriate answer according to the characteristic information of the picture to output the natural language answer. A good visual question-answering model facing medical image diagnosis can automatically extract information contained in a medical image, capture the position of a focus and the like, can provide a second opinion for image analysis for a radiologist, realizes auxiliary diagnosis, and is beneficial to enhancing the confidence of the radiologist in explaining complex medical images. Meanwhile, the VQA model can help the patient to preliminarily know the own physical condition, thereby being beneficial to selecting a more targeted medical scheme.

However, the current mainstream visual question-answering models often ignore fine-grained interactions between images and questions. In fact, learning keywords in questions and obtaining location information for different regions of the image may provide useful clues for answer reasoning. There are still some disadvantages if the mainstream model is directly used in medical image diagnosis. Firstly, the existing method mostly only realizes the rough interaction between the image and the problem, and can not capture the correlation between each region of the image and the problem; second, the inherent dependencies between words in different positions in a sentence cannot be captured efficiently. Thirdly, the existing method only extracts the image features in the image and lacks the spatial features. These methods do not solve the problem of correlation of different objects in the image.

Through retrieval, the publication number is CN113516182A, and the method and the device for training the visual question-answering model and the visual question-answering model are provided. The method comprises the following steps: acquiring a picture sample and a question sample for training a visual question-answering model; performing feature extraction on the picture sample to obtain picture sample features, and performing feature extraction on the problem sample to obtain problem sample features; determining a relation hidden variable between the picture sample characteristic and the question sample characteristic; the relation hidden variable is used for representing whether the picture sample and the question sample are related or not; training a visual question-answer model according to the relation hidden variables, the picture sample characteristics and the question sample characteristics to obtain a target visual question-answer model; the target visual question-answering model is used for carrying out visual question-answering. By adopting the method, the answer with higher accuracy can be still given when the fuzzy question is answered. Compared with the method, the method has the advantages that the graph convolution can better understand semantic information, but the complexity is higher. In addition, the technology adopts a first weight mode and a second weight mode to respectively obtain the picture characteristics and the text characteristics, and the more exquisite degree interaction of layer number stacking is lacked. In addition, the invention also introduces a position correlation module, and pays attention to the position relation between different objects while deeply interacting the image characteristic and the problem characteristic.

CN110321946A, a multimodal medical image recognition method and apparatus based on deep learning, which uses medical imaging equipment to collect medical image data; the image enhancement algorithm carries out enhancement processing on the acquired image; extracting the collected image characteristics by an extraction program; identifying the extracted features by using an identification program; converting medical images of different modes by using a conversion program; the printer prints the collected image; and displaying the acquired medical image data information by using a display. According to the invention, the image feature extraction effect is improved through the image feature extraction module; meanwhile, the mode conversion module adopts a three-dimensional reconstruction, registration and segmentation mode, so that the corresponding image height matching of the first mode image and the second mode image is ensured; in addition, the invention divides the training image into a plurality of image blocks, thereby reducing the requirement of the whole input training image on hardware equipment. The technology utilizes a feature recognition program to recognize the extracted features and uses an image enhancement algorithm to improve the recognition capability. But neglect modal interaction, namely lack of interaction ability with doctors or patients, and can not intelligently answer questions of patients and efficiently assist doctors in diagnosis assistance. The invention improves the image recognition capability and simultaneously considers the interaction with the user, so that the invention is more intelligent and improves the participation of the user.

Therefore, in order to better assist the doctor in making an auxiliary diagnosis and to allow the patient to use it also to obtain basic information of the image without consulting the doctor. There is a need to design an explicit mechanism to learn the correlation between questions and images, and to build a model to process image features and location features and apply it to the visual question-answering task oriented to medical image diagnosis.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. A visual question-answering method for medical image diagnosis is provided. The technical scheme of the invention is as follows:

a visual question-answering method oriented to medical image diagnosis comprises the following steps:

acquiring medical images and corresponding related medical problems;

respectively extracting features of the image focus target and the medical question text, capturing the dependency relationship among the question words, and performing text representation learning to obtain the correlation between each image area and the question;

processing the same focus target by interacting with the image features and the position features, realizing relational association modeling, obtaining the relative position relationship of different targets, and matching the multi-modal features;

a cross-guided multi-modal feature fusion stacking mode is introduced to capture complex interaction relations among multiple modalities;

a fusion mode and a classifier are designed and selected, and the method is applied to medical question answering to realize the visual question answering research oriented to medical image diagnosis.

Further, the medical image and the corresponding related medical problems specifically include the following steps:

downloading medical related image data and question labels on the network, wherein the image comprises pictures, mainly CT and MRI scanned images, and real answers corresponding to questions and questions matched with the pictures, and forming a group of objects of the pictures, the questions and the answers.

Further, the feature extraction is respectively performed on the image focus target and the medical problem text, and specifically includes:

and (3) carrying out feature extraction on the pictures and the problems: inputting a scanning image, and extracting a relevant area in the image by using a target detection algorithm of Faster R-CNN based on ResNet-101; inputting an English sentence, and obtaining problem characteristics through word embedding and a recurrent neural network.

Further, the image feature obtaining specifically includes: image information is processed in a manner of combining fast-RCNN with Resnet 101: firstly, extracting global image features in an image by using a residual error network Resnet101, and then identifying and extracting local features of the image according to a target detection algorithm, namely fast-RCNN to obtain corresponding focus information; and (3) not only using an object detector but also using an attribute classifier for each region in the image, wherein each object bounding box has a corresponding attribute class, so that binary description of the object can be obtained, extracting K object regions from each image, and each object region is represented by a 2048-dimensional vector and used as the input of a subsequent network.

Further, the problem feature acquisition specifically includes: the input medical problem is firstly processed into a single word, the longest word is intercepted into 14 words, redundant discarding is carried out, and less than 14 words are filled with zeros; and then capturing semantic features of words by combining a 300-dimensional GloVe word vector model, converting the semantic features into a vector mode, and encoding text features by using an LSTM network so as to extract problem semantic feature information as input of a subsequent network.

Furthermore, a self-recognition module is arranged to obtain characteristics among image regions and characteristics among question words, and the self-recognition module is an attention model and obtains the characteristics among the image regions and the characteristics among the question words through self-correlation learning; the core of the self-recognition module is an attention mechanism; the input consists of a query key of dimension d _ key and a value of dimension d _ value; firstly, calculating the dot product of a query key and all keys, and dividing each key by √ d; then, applying a softmax function to obtain the weight of the required value; in practice, to calculate attention weights for a set of query keys synchronously, they are packed into a matrix Q; the keys and values are also packed into matrices K and V.

Further, the attention model adopts an attention mechanism model of H parallel heads, which allows the model to focus on information from different representation subspaces at different positions simultaneously, and calculates an output feature matrix as:

F＝MultiHead(Q,K,V)＝Concat([head¹,head²,…head^H])W⁰

headⁱ＝Attention(QW_i ^Q,KW_i ^k,VW_i ^v)

the self-recognition module consists of an attention mechanism model and a feed-forward network and is used for extracting the fine characteristics of the image or the medical problem;

outputting the problem characteristics after the learning attention characteristics obtain the weight; then inputting them into LayerNorm layer; the feedforward layer comprises two full-connection layers, a ReLu function and a Dropout function, and finally a LayerNorm layer, and the final characteristics are obtained through self-attention

Further, the processing of the same lesion target by interacting with the image features and the position features to realize relational modeling and obtain the relative position relationship of different targets specifically includes:

inputting object by image features

And a position feature P, and a position feature,

is a feature obtained by a self-recognition module, and P is a four-dimensional object frame;

to calculate the position feature weights, one object coordinate is represented as { x }_i,y_i,h_i,w_iIn which x_iPosition of abscissa, y, representing center of object_iIndicating the position of the ordinate of the center of the object, w_iWidth of the object frame, h_iIndicating the height of the object box. First, the coordinates of P are transformed as follows,

m and n respectively represent two object frames and carry out scale normalization and logarithm operation; the input N objects can be expressed as

Next, the geometric features of the two objects are embedded into the high-dimensional features and expressed as epsilon_GW is to be_GMultiplying by the embedded feature to obtain a weight, W_GIs also realized by a full connection layer; the final max operation is similar to the relu layer, whose main purpose is to impose a certain limit on the location feature weights;

representing the position feature weight between two objects.

ε_GRepresenting the embedding of geometric features into high-dimensional features.

P^m,PⁿRepresenting the geometrical characteristics of the m and n objects.

The object relationship between the nth object and the whole set can be obtained through the following formula;

r (n) represents the object relationship between the nth object and the entire set.

Representing image features of the m-th object, w^mnOutputting W for the weight of the relation between different objects_VThe weighted sum of the image characteristics of other objects after linear change;

the following is w^mnAnd

and (4) calculating a formula.

Representing image feature weights between m, n two objects

Representing the relative position characteristic weight between m and n objects

k represents the number of object objects

Representing the relative position feature weight before the kth object and the nth object

After obtaining the relation characteristic R (n), the last step is to fuse the Nr relation characteristic and then to the image characteristic

The fusion is carried out, and the fusion is carried out,

further, the method for capturing the complex interaction relationship among multiple modalities by introducing the cross-guided multi-modality feature fusion stacking manner specifically includes:

the cross guide module consists of a problem guide picture attention module and a problem attention module; updating image region characteristics and problem text characteristics by establishing semantic association between two different modes to obtain more detailed characteristics; performing cross fusion characteristic extraction on the sample image characteristic information and the sample problem characteristic information to obtain an image characteristic vector carrying the sample problem information and a sample problem characteristic vector carrying the sample image information;

the core of the cross-guide attention module is also an attention mechanism, and the input is also represented as Q, K and V; taking the attention model of the problem guide image as an example, the self-recognition feature of the input image

Self-identification of questions

Mapping to obtain the input and output of the image interactive attention model and the problem interactive attention modelOutputting the model;

after obtaining the image characteristic vector carrying the sample problem information and the sample problem characteristic vector carrying the sample image information, stacking a layer number, wherein N is the layer number of the attention model, and the output of the previous attention layer is used as the input of the next attention layer; the multiple attention model layers are connected with the model of a deeper layer, so that the embedding of the attention model can be guided, the image to be processed and the problem feature can be gradually refined, and the representation capability of the model can be enhanced.

Further, the design selects a fusion mode and a classifier, and the fusion mode and the classifier are applied to medical question answering to realize the visual question answering research facing medical image diagnosis, and the visual question answering research specifically comprises the following steps:

in obtaining the effective characteristics

And

then sending the data to a linear multi-mode fusion network; then, mapping the fused feature f to a vector space s epsilon R through an s-shaped function^LWherein L is the number of the most frequent answers in the training set;

s＝Linear(f)

A＝sigmoid(s)

a represents the model predicted answer.

The final prediction stage can be viewed as a logistic regression that predicts the correctness of each candidate answer; selecting the answer with the highest probability from all predicted answers as a final prediction; returning and predicting by using a binary cross entropy function; and determining a loss value of the loss function according to the real answer and the predicted answer, and updating the model according to the loss value.

M represents the training problem

N represents a candidate answer

Predicted answer representing model output

s_zkTrue answer representing model

Z, K values when training separately

The invention has the following advantages and beneficial effects:

in the present invention, we propose a visual question-answering method for medical image diagnosis. Given that many of the existing attention-based VQA methods can only learn the rough interaction of multimodal instances, our models can direct each other's attention and get a correlation between each image region and the problem. The other core idea of the invention is to increase the position attention, which can improve a judgment on the position relation of the object in the image and improve the counting performance of the object in the image. The invention can be used as an effective reference for assisting diagnosis of doctors, thereby greatly improving the diagnosis efficiency; the invention can help the patient to preliminarily know the self physical condition, thereby being beneficial to selecting a more targeted medical scheme.

The method of claims 6-7. The invention firstly considers that the extracted image features and the medical text features are mutually independent features, and firstly uses a self-identification module to emphasize respective emphasis of the image and the text in order to obtain the features with more fineness. The conventional model only considers picture recognition, namely a self-attention model and the like are only used on the picture, but the invention highlights that text features are also important and have key points and keywords in the problem, so that the self-recognition module is not only applied to picture processing but also applied to medical text problems. The finer single-mode model can better perform subsequent model fusion.

The method of claim 8. The common visual question-answering models are questions in some open domains, and it can be found in related data that answers of basic models are often bad and popular when answering related questions about positions. The obtained picture features not only contain original features but also contain rich inter-object position relations.

The method of claim 9. The common visual question-answering model usually adopts a text-guided picture mode to perform multi-mode fusion, and the text information can be guided by neglecting picture information. The cross-guided multi-modal feature fusion stacking mode designed by the invention can capture the complex interaction relationship among multiple modes. Updating image region characteristics and problem text characteristics by establishing semantic association between two different modes to obtain more detailed characteristics; and performing cross fusion characteristic extraction on the sample image characteristic information and the sample problem characteristic information to obtain an image characteristic vector carrying the sample problem information and a sample problem characteristic vector carrying the sample image information.

Drawings

FIG. 1 is a flow chart of a visual question-answering method for medical image diagnosis according to a preferred embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

the first embodiment: as shown in fig. 1, the present invention provides a visual question-answering method for medical image diagnosis, which implements feature fusion for two cross-modal data, namely, picture and text, helps a doctor to perform an auxiliary diagnosis, and enables a patient to use it to obtain basic information of an image without consulting the doctor.

Firstly, we need to download the data set related to the medical image from the internet, and combine the question and the answer of the object to generate a group of objects of pictures, questions and answers, which is convenient for the subsequent learning and training.

Then, the image and the text are preprocessed, namely sample pictures and problem characteristic information are obtained and then input to the main network of the user.

For pictures, a fast-RCNN + Resnet101 network is used as a network feature extraction network to be used, a residual error network Resnet101 is used for extracting global image features in the pictures, and then local features of the extracted images are identified according to a target detection algorithm, namely fast-RCNN. Input picture features are described as X ═ X₁,x₂,…x_m]∈R^m ^×2048And m represents the number of object objects in the picture.

For the question, the text is preprocessed, and the sentence is written as a word with the length not exceeding 14. Words in the question are embedded by using a GloVe word with 300 dimensions, and text characteristics are encoded by using an LSTM network so as to extract semantic characteristic information of the question as input of a subsequent network. Describing the problem feature of the input as Y ═ Y₁,y₂,…y_n]∈R^n×512And n is the number of words in the sentence.

For the subject network in the figure. Firstly, the self-recognition module is used for extracting the characteristics of the image target and the question text, so that the interference of redundant information in the image target can be reduced, the dependency relationship between the question words can be effectively captured for text representation learning, and the subsequent obtaining of the correlation between each image area and the question is facilitated.

The self-recognition module is mainly realized by an attention mechanism, wherein the attention mechanism calculates the correlation between the inputs, then performs weighted summation on all vectors in the inputs, and calculates the concerned features as the output of multi-head attention. This output is then fed into a feedforward neural network consisting of fully-connected layer functions, resulting in the output from the attention module. Problem characteristic Y ═ Y₁,y₂,…y_n]∈R^n×512Obtained after passing through a self-identification module

Picture characteristic X ═ X₁,x₂,…x_m]∈R^m ^×2048Obtained after passing through a self-identification module

Secondly, the image features are processed by a position correlation module, the same target is processed by interacting with the image features and the position features, relational correlation modeling is achieved, relative position relations of different targets are obtained, and matching capability of multi-modal features is enhanced.

Firstly, coordinate information of an object is obtained, and scale normalization and logarithm operation are carried out on the coordinate information. By passing

The object relation between different objects can be obtained, and after the object relation is obtained, the relation characteristic fusion is carried out with the picture characteristic

Obtaining the final picture characteristics

Then, a cross-guided multi-modal feature fusion mode is introduced, and the complex interaction relation among multiple modes can be captured. The cross-guide model is similar to the self-recognition model, except that the input features are not in the same group, but are image features and text features, respectively, and the final features are determined by mutual guidance

And

then, a plurality of attention model layers are connected with a model with a deeper layer through deepening the layer number of the main network, so that the embedding of the attention model can be guided, the image to be processed and the problem feature can be gradually refined, and the representation capability of the model can be enhanced. Through the fusion of N layers, the image features are represented as X^(N)Problems ofThe characteristic is represented as Y^(N)。

And finally, designing and selecting a fusion mode and a classifier to achieve a better effect. The learned joint characterization is used for answer prediction. By a_x＝softmax(MLP(X^(N)) And a_y＝softmax(MLP(Y^(N)) And respectively obtaining attention weights when the two features are subjected to weighted summation. Multiplying attention weight by picture feature

To obtain final characteristics

The final characteristics of the problem are obtained by the same way

We adopt a linear multi-modal fusion mode

Then, mapping the fused feature f to a vector space s ∈ R through a function^LAnd L is the number of answers with the highest occurrence frequency in the training set. And finally outputting the predicted answer with the highest probability as the final predicted answer. And determining a loss value of the loss function according to the real answer and the predicted answer, and updating the model according to the loss value.

Second embodiment:

1 obtaining sample image and medical problem characteristic information

First, we need to download medical related image data and question labels including pictures, mainly CT and MRI scanned images, and real answers corresponding to questions and questions matched with the pictures, to form a group of objects of pictures, questions and answers.

And then extracting the features of the pictures and the questions. Inputting a scanning image, and extracting a relevant area in the image by using a target detection algorithm of Faster R-CNN based on ResNet-101; inputting an English sentence, and obtaining problem characteristics through word embedding and a recurrent neural network. The specific operation is as follows:

acquiring picture characteristics: in order to better extract the required picture features, the image information is processed in a manner of combining fast-RCNN with Resnet 101. Firstly, global image features in an image are extracted by using a residual error network Resnet101, and then local features of the extracted image are identified according to a target detection algorithm, namely fast-RCNN, so that corresponding focus information is obtained. Not only an object detector but also an attribute classifier is used for each region in the image, each object bounding box has a corresponding attribute class, so that a binary description of the object can be obtained. K object regions are extracted from each image, each object region being represented by a 2048-dimensional vector as input to the subsequent network.

Problem feature acquisition: the entered medical question is first treated as a single word, with a maximum of 14 words truncated, redundant discard, and fewer than 14 filled with zeros. And then capturing semantic features of words by combining a 300-dimensional GloVe word vector model, converting the semantic features into a vector mode, and encoding text features by using an LSTM network so as to extract problem semantic feature information as input of a subsequent network.

2 image and medical problem self-identification

The self-recognition module is an attention model and obtains characteristics among image areas and characteristics among question words through self-correlation learning. The core of the self-recognition module is an attention mechanism. The input consists of a query key of dimension d _ key and a value of dimension d _ value. Typically d _ key and d _ value are both written as d. First, we compute the dot product of the query key and all keys and divide each key by √ d. Then, the softmax function is applied to obtain the weights of these values. In practice, we compute a set of attention functions for the query key at the same time and pack them into the matrix Q. The keys and values are also packed into matrices K and V. We compute the output matrix as:

still further, an attention model of H parallel heads is employed, which allows the model to focus on information from different representation subspaces from different locations at the same time, and thus a wider area can be focused on at the same time. We compute the output feature matrix as:

F＝MultiHead(Q,K,V)＝Concat([head¹,head²,…head^H])W⁰

headⁱ＝Attention(QW_i ^Q,KW_i ^k,VW_i ^v)

intra-modal self-identification unit. They consist of an attention model and a feed-forward network for extracting subtle features of an image or medical problem. Taking the problem feature as an example, the problem feature is Y ═ Y₁,y₂,…y_n]∈R^n×512The input of the self-recognition module can be obtained by the following formula:

and outputting the problem characteristics after the learning attention characteristics obtain the weight. They are then fed into the LayerNorm layer.

^L _Y＝LayerNorm(Y+MultiHead(Q_Y,K_Y,V_Y))

The feedforward layer comprises two full-connection layers, a ReLu function and a Dropout function, and finally a LayerNorm layer, and the final characteristics are obtained through self-attention

L′Y＝FF(L_Y)＝max(0,L_YW₁+b₁)W₂+b₂

After the self-recognition module is used, the medical images and the medical texts focus on the self-emphasis, redundant information can be eliminated, and subsequent modal interaction and feature fusion can be conveniently carried out.

3 image focus object position relation modeling

In order to better acquire image characteristics and position relations among different targets, after self-identification characteristics of image information are acquired, output characteristics are sent to a position association unit working together with a self-identification module, and the position relations of the image characteristics and different objects are modeled simultaneously. Therefore, better understanding of image images is facilitated, and problems of position relations, such as front, back, left, right, foreground, background and the like, can be effectively processed through position relation modeling, so that a focus area is conveniently positioned, and effective auxiliary diagnosis is conveniently provided for doctors.

Inputting object by image features

And a position feature P, and a position feature,

is the feature obtained by the self-recognition module, and P is a four-dimensional object box.

To calculate the position feature weight, first, the coordinates of P are transformed as follows,

the method mainly performs scale normalization and logarithm operation, and aims to increase scale invariance so that training divergence caused by overlarge change range of values is avoided. Thus the N objects entered can be represented as

Then, W is added_GAnd insertCharacteristic multiplication of W in_GIs also realized by a full connection layer. The final max operation is similar to the relu layer, whose main purpose is to impose a certain limit on the location feature weights.

The object relationship between the nth object and the entire set can be obtained by the following formula.

Representing image features of the m-th object, w^mnOutputting W for the weight of the relation between different objects_VIs the weighted sum of the image characteristics of other objects after linear change.

The following is w^mnAnd

and (4) calculating a formula.

The fusion is carried out, and the fusion is carried out,

the main reason for using concat here is the calculationThe number is small because the channel dimension of each R (n) is

1/Nr times, dimension after concat and

the same is true.

4 image problem cross-guide

The cross guide module is composed of a question guide picture attention module and a picture guide question attention module. The mutually-guided attention unit pays more attention to the interaction between the modes, and the image region feature and the question text feature are updated by establishing semantic association between two different modes so as to obtain more refined features. And performing cross fusion characteristic extraction on the sample image characteristic information and the sample problem characteristic information to obtain an image characteristic vector carrying the sample problem information and a sample problem characteristic vector carrying the sample image information.

Similar to the self-identity module, the cross-lead attention module core is also the attention mechanism, with inputs also denoted as Q, K, V. Taking the attention model of the problem guide image as an example, the self-recognition feature of the input image

Self-identification of questions

And mapping to obtain an image interaction attention model input and a question interaction attention model output.

After obtaining the image feature vector carrying the sample problem information and the sample problem feature vector carrying the sample image information, we stack a layer number, where N is the layer number of the attention model, and the output of the previous attention layer is used as the input of the next attention layer. The multiple attention model layers are connected with the model of a deeper layer, so that the embedding of the attention model can be guided, the image to be processed and the problem feature can be gradually refined, and the representation capability of the model can be enhanced.

Model 5 fusion and classifier

Through the learning of intra-modality self-attention and cross-directed attention mechanisms, features containing rich images and question information can be obtained. And simply fusing the image characteristic vector carrying the sample question information and the sample question characteristic vector carrying the sample image information, inputting the fused image characteristic vectors into a model classifier, and obtaining a predicted answer through the classifier.

In obtaining the effective characteristics

And

and then sending the data into a linear multi-mode fusion network. Then, mapping the fused feature f to a vector space s epsilon R through an s-shaped function^LAnd L is the number of the most frequent answers in the training set.

s＝Linear(f)

A＝sigmoid(s)

The final prediction stage can be viewed as a logistic regression that predicts the correctness of each candidate answer. We select the answer with the highest probability from all the predicted answers as the final prediction. Therefore, we come back to the prediction using a binary cross-entropy function. And determining a loss value of the loss function according to the real answer and the predicted answer, and updating the model according to the loss value.

The visual question-answering method for medical image diagnosis disclosed by the invention has the capability of visual question-answering, can better help doctors to perform auxiliary diagnosis particularly for judging the position relation of a focus, and enables patients to use the visual question-answering method to obtain the basic information of the image without consulting the doctors.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. a visual question answering method for medical image diagnosis, is characterized in that, comprises the following steps:

Obtaining medical images and corresponding medical problems;

Perform feature extraction on image lesion targets and medical question texts, capture the dependencies between question words, and perform text representation learning to obtain the correlation between each image area and the question;

By interacting with image features and location features, the same lesion target is processed to achieve relationship association modeling, and the relative positional relationship of different targets can be obtained for multi-modal feature matching;

The cross-guided multi-modal feature fusion stacking method is introduced to capture the complex interaction between multi-modalities;

Design and select fusion methods and classifiers, and apply them to medical question answering to realize visual question answering research for medical image diagnosis.

2. A kind of visual question answering method oriented to medical image diagnosis according to claim 1, is characterized in that, described medical image and corresponding relevant medical question, specifically comprise the following steps:

Download medical-related imaging materials and question labels online, including pictures, mainly scanned images including CT and MRI, as well as real answers to questions and questions that match the pictures, forming a group of objects for pictures, questions, and answers .

3. The visual question answering method for medical image diagnosis according to claim 1 or 2, wherein the feature extraction is performed on the image focus target and the medical question text respectively, specifically comprising:

Feature extraction for pictures and questions: input a scanned image, use the ResNet-101-based Faster R-CNN target detection algorithm to extract relevant areas in the image; input an English sentence, get the question after word embedding and recurrent neural network feature.

4. a kind of visual question answering method oriented to medical image diagnosis according to claim 3, it is characterized in that, described picture characteristic acquisition specifically comprises: adopt the mode of Faster-RCNN and Resnet101 combination to process image information: first utilize residual error The network Resnet101 extracts the global image features in the image, and then identifies the local features of the extracted image according to the target detection algorithm Faster-RCNN to obtain the corresponding lesion information; for each area in the image, not only the object detector but also the attribute classifier is used. Each object bounding box has a corresponding attribute class, so that the binary description of the object can be obtained. K object regions are extracted from each image, and each object region is represented by a 2048-dimensional vector as the input of the subsequent network.

5. A kind of visual question answering method oriented to medical image diagnosis according to claim 3 or 4, it is characterized in that, described question characteristic acquisition specifically comprises: the medical question of input can be processed as a single word at first, and the longest interception is 14 words, discard the redundant ones, fill with zeros for less than 14 words; then combine the 300-dimensional GloVe word vector model to capture the semantic features of the words, convert them into vector patterns, and then use the LSTM network to encode the text features to extract the semantics of the problem feature information, as the input of the subsequent network.

6. a kind of visual question answering method for medical image diagnosis according to claim 5, is characterized in that, also by setting a self-identification module to obtain the feature between image regions and the feature between question words, the self-identification module is a kind of The attention model obtains the features between image regions and between words in the question through autocorrelation learning; the core of the self-recognition module is the attention mechanism; the input consists of a query key with dimension d_key and a value with dimension d_value; first, calculate Do the dot product of the query key with all keys, and divide each key by √d; then, apply the softmax function to get the weights of the desired values; in fact, to simultaneously compute the attention weights for a set of query keys, pack them into a matrix into Q; keys and values are also packed into matrices K and V.

7. A visual question answering method for medical image diagnosis according to claim 6, wherein the attention model adopts the attention mechanism model of H parallel heads, which allows the model to simultaneously pay attention to different representations from different positions information of the subspace, the output feature matrix is computed as:

F=MultiHead(Q,K,V)=Concat([head ¹ ,head ² ,...head ^H ])W ⁰

The self-recognition module consists of an attention mechanism model and a feed-forward network to extract subtle features of imaging or medical problems;

After learning the attention features and getting the weights, the problem features are output; then they are input into the LayerNorm layer; the feedforward layer contains two fully connected layers and the ReLu function and the Dropout function, and finally the LayerNorm layer, after self-attention to get the final features

8. A visual question answering method for medical image diagnosis according to claim 7, characterized in that, by interacting with image features and location features, the same lesion target is processed to realize relational association modeling and obtain different The relative positional relationship of the target, including:

The input object consists of image features

and the position feature P,

is the feature obtained by the self-identification module, and P is a four-dimensional object frame;

In order to calculate the position feature weight, the coordinates of an object are expressed as {x _i , y _i , h _i , w _i }, where x _i represents the abscissa position of the object center, y _i represents the ordinate position of the object center, and _wi represents the The width of the object box, _hi represents the height of the object box. First, the coordinates of P are transformed as follows,

m and n represent two object boxes, respectively, for scale normalization and logarithmic operations; the input N objects can be expressed as

Next, the geometric features of the two objects are embedded into the high-dimensional features and expressed as ε _G , and _WG is multiplied by the embedded feature to obtain a weight, where _WG is also implemented by a fully connected layer; the final max operation Similar to the relu layer, its main purpose is to impose certain restrictions on the position feature weights;

Represents the positional feature weight between two objects.

_εG represents the embedding of geometric features into high-dimensional features.

P ^m , P ⁿ represent the geometric features of m and n objects.

The object relationship between the nth object and the entire collection can be obtained by the following formula;

R(n) represents the object relationship between the nth object and the entire collection.

Represents the image feature of the mth object, w ^mn is the weight of the relationship between different objects, W _V is used for linear change, and finally the weighted sum of the image features of other objects is obtained;

The following are w ^mn and

calculation formula.

Represents the image feature weight between m and n two objects

Represents the relative position feature weight between m and n two objects

k represents the number of objects

Represents the relative position feature weight between the k-th object object and the n-th object object

After obtaining the relational feature R(n), the last step is to fuse the Nr relational feature, and then combine it with the image feature

to fuse,

9 . The visual question answering method for medical image diagnosis according to claim 8 , wherein the multi-modal feature fusion stacking method of introducing cross-guidance is used to capture the complex interactive relationship between the multi-modalities. 10 . include:

The cross-guided module consists of a question-guided picture attention module and a picture-guided question attention module; it updates image region features and question text features by establishing semantic associations between the two different modes to obtain more refined features; Cross-fusion feature extraction is performed on the sample image feature information and the sample problem feature information to obtain an image feature vector carrying the sample problem information and a sample problem feature vector carrying the sample image information;

The core of the cross-guided attention module is also the attention mechanism, and the input is also expressed as Q, K, V; taking the attention model of the problem-guided image as an example, the self-identification features of the input image are

Self-identifying traits with problems

Mapping to get the image interactive attention model output and the question interactive attention model output;

After obtaining the image feature vector carrying the sample problem information and the sample problem feature vector carrying the sample image information, a stack of layers will be performed, N is the number of layers of the attention model, and the output of the previous attention layer is used as the next attention layer. layer input; connecting multiple attention model layers with deeper models can guide the embedding of the attention model, gradually refine the image and question features to be processed, and enhance the representational ability of the model.

10. A kind of visual question answering method oriented to medical image diagnosis according to claim 8, is characterized in that, described design selects fusion mode and classifier, is applied in medical question answering, realizes the visual question answering research oriented to medical image diagnosis, Specifically include:

in obtaining effective features

and

Then, the fused feature f is mapped to the vector space s∈R ^L through a sigmoid function, where L is the number of the most frequent answers in the training set;

s=Linear(f)

A=sigmoid(s)

A means the model predicts the answer.

The final prediction stage can be viewed as a logistic regression that predicts the correctness of each candidate answer; selects the answer with the highest probability from all predicted answers as the final prediction; uses a binary cross-entropy function to regress the prediction; compares the true answer with the predicted answer Determine the loss value of the loss function and update the model according to the loss value.

M represents the training problem

N is the candidate answer

the predicted answer representing the model output

s _zk represents the true answer of the model

The values of Z and K during training respectively.