CN114201592A - Visual question-answering method for medical image diagnosis - Google Patents
Visual question-answering method for medical image diagnosis Download PDFInfo
- Publication number
- CN114201592A CN114201592A CN202111461563.7A CN202111461563A CN114201592A CN 114201592 A CN114201592 A CN 114201592A CN 202111461563 A CN202111461563 A CN 202111461563A CN 114201592 A CN114201592 A CN 114201592A
- Authority
- CN
- China
- Prior art keywords
- image
- question
- features
- medical
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 230000000007 visual effect Effects 0.000 title claims abstract description 41
- 238000003745 diagnosis Methods 0.000 title claims abstract description 37
- 230000004927 fusion Effects 0.000 claims abstract description 37
- 230000003993 interaction Effects 0.000 claims abstract description 20
- 238000012545 processing Methods 0.000 claims abstract description 9
- 238000011160 research Methods 0.000 claims abstract description 6
- 239000013598 vector Substances 0.000 claims description 31
- 230000006870 function Effects 0.000 claims description 23
- 230000007246 mechanism Effects 0.000 claims description 14
- 238000012549 training Methods 0.000 claims description 14
- 238000000605 extraction Methods 0.000 claims description 13
- 238000001514 detection method Methods 0.000 claims description 7
- 238000013507 mapping Methods 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 7
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 230000008859 change Effects 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 4
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 238000013461 design Methods 0.000 claims description 3
- 238000007477 logistic regression Methods 0.000 claims description 3
- 230000000306 recurrent effect Effects 0.000 claims description 3
- 230000003902 lesion Effects 0.000 claims description 2
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 230000009286 beneficial effect Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 239000000284 extract Substances 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 238000002059 diagnostic imaging Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 208000019914 Mental Fatigue Diseases 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Probability & Statistics with Applications (AREA)
- Pathology (AREA)
- Epidemiology (AREA)
- Primary Health Care (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
- Medical Treatment And Welfare Office Work (AREA)
- Measuring And Recording Apparatus For Diagnosis (AREA)
Abstract
The invention requests to protect a visual question-answering method for medical image diagnosis, which belongs to the field of medical image processing, natural language processing and multi-mode fusion and comprises the following steps: acquiring medical images and corresponding related medical problems; respectively extracting features of the image focus target and the medical question text, capturing the dependency relationship among the question words, and performing text representation learning to obtain the correlation between each image area and the question; processing the same focus target by interacting with the image features and the position features, realizing relational association modeling, obtaining the relative position relationship of different targets, and matching the multi-modal features; a cross-guided multi-modal feature fusion stacking mode is introduced to capture complex interaction relations among multiple modalities; a fusion mode and a classifier are designed and selected, and the method is applied to medical question answering to realize the visual question answering research oriented to medical image diagnosis.
Description
Technical Field
The invention belongs to the field of medical image processing, natural language processing and multi-mode fusion, and particularly relates to a visual question-answering method for medical image diagnosis.
Background
Health is always one of the most concerned issues for human beings, and with the continuous development of deep learning, it becomes more important to use different tools and techniques to help doctors diagnose and patients better understand their own physical conditions. Medical imaging is an extremely important tool for physicians to understand patient condition in clinical analysis and diagnosis. However, the information obtained from medical images by different doctors may vary and the number of doctors is much smaller than the number of patients, resulting in that doctors often face problems of physical and mental fatigue, and thus it is difficult to manually answer all the problems of patients.
The Visual Question Answering (VQA) is given a picture, then the question content is input, and the system can select the appropriate answer according to the characteristic information of the picture to output the natural language answer. A good visual question-answering model facing medical image diagnosis can automatically extract information contained in a medical image, capture the position of a focus and the like, can provide a second opinion for image analysis for a radiologist, realizes auxiliary diagnosis, and is beneficial to enhancing the confidence of the radiologist in explaining complex medical images. Meanwhile, the VQA model can help the patient to preliminarily know the own physical condition, thereby being beneficial to selecting a more targeted medical scheme.
However, the current mainstream visual question-answering models often ignore fine-grained interactions between images and questions. In fact, learning keywords in questions and obtaining location information for different regions of the image may provide useful clues for answer reasoning. There are still some disadvantages if the mainstream model is directly used in medical image diagnosis. Firstly, the existing method mostly only realizes the rough interaction between the image and the problem, and can not capture the correlation between each region of the image and the problem; second, the inherent dependencies between words in different positions in a sentence cannot be captured efficiently. Thirdly, the existing method only extracts the image features in the image and lacks the spatial features. These methods do not solve the problem of correlation of different objects in the image.
Through retrieval, the publication number is CN113516182A, and the method and the device for training the visual question-answering model and the visual question-answering model are provided. The method comprises the following steps: acquiring a picture sample and a question sample for training a visual question-answering model; performing feature extraction on the picture sample to obtain picture sample features, and performing feature extraction on the problem sample to obtain problem sample features; determining a relation hidden variable between the picture sample characteristic and the question sample characteristic; the relation hidden variable is used for representing whether the picture sample and the question sample are related or not; training a visual question-answer model according to the relation hidden variables, the picture sample characteristics and the question sample characteristics to obtain a target visual question-answer model; the target visual question-answering model is used for carrying out visual question-answering. By adopting the method, the answer with higher accuracy can be still given when the fuzzy question is answered. Compared with the method, the method has the advantages that the graph convolution can better understand semantic information, but the complexity is higher. In addition, the technology adopts a first weight mode and a second weight mode to respectively obtain the picture characteristics and the text characteristics, and the more exquisite degree interaction of layer number stacking is lacked. In addition, the invention also introduces a position correlation module, and pays attention to the position relation between different objects while deeply interacting the image characteristic and the problem characteristic.
CN110321946A, a multimodal medical image recognition method and apparatus based on deep learning, which uses medical imaging equipment to collect medical image data; the image enhancement algorithm carries out enhancement processing on the acquired image; extracting the collected image characteristics by an extraction program; identifying the extracted features by using an identification program; converting medical images of different modes by using a conversion program; the printer prints the collected image; and displaying the acquired medical image data information by using a display. According to the invention, the image feature extraction effect is improved through the image feature extraction module; meanwhile, the mode conversion module adopts a three-dimensional reconstruction, registration and segmentation mode, so that the corresponding image height matching of the first mode image and the second mode image is ensured; in addition, the invention divides the training image into a plurality of image blocks, thereby reducing the requirement of the whole input training image on hardware equipment. The technology utilizes a feature recognition program to recognize the extracted features and uses an image enhancement algorithm to improve the recognition capability. But neglect modal interaction, namely lack of interaction ability with doctors or patients, and can not intelligently answer questions of patients and efficiently assist doctors in diagnosis assistance. The invention improves the image recognition capability and simultaneously considers the interaction with the user, so that the invention is more intelligent and improves the participation of the user.
Therefore, in order to better assist the doctor in making an auxiliary diagnosis and to allow the patient to use it also to obtain basic information of the image without consulting the doctor. There is a need to design an explicit mechanism to learn the correlation between questions and images, and to build a model to process image features and location features and apply it to the visual question-answering task oriented to medical image diagnosis.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. A visual question-answering method for medical image diagnosis is provided. The technical scheme of the invention is as follows:
a visual question-answering method oriented to medical image diagnosis comprises the following steps:
acquiring medical images and corresponding related medical problems;
respectively extracting features of the image focus target and the medical question text, capturing the dependency relationship among the question words, and performing text representation learning to obtain the correlation between each image area and the question;
processing the same focus target by interacting with the image features and the position features, realizing relational association modeling, obtaining the relative position relationship of different targets, and matching the multi-modal features;
a cross-guided multi-modal feature fusion stacking mode is introduced to capture complex interaction relations among multiple modalities;
a fusion mode and a classifier are designed and selected, and the method is applied to medical question answering to realize the visual question answering research oriented to medical image diagnosis.
Further, the medical image and the corresponding related medical problems specifically include the following steps:
downloading medical related image data and question labels on the network, wherein the image comprises pictures, mainly CT and MRI scanned images, and real answers corresponding to questions and questions matched with the pictures, and forming a group of objects of the pictures, the questions and the answers.
Further, the feature extraction is respectively performed on the image focus target and the medical problem text, and specifically includes:
and (3) carrying out feature extraction on the pictures and the problems: inputting a scanning image, and extracting a relevant area in the image by using a target detection algorithm of Faster R-CNN based on ResNet-101; inputting an English sentence, and obtaining problem characteristics through word embedding and a recurrent neural network.
Further, the image feature obtaining specifically includes: image information is processed in a manner of combining fast-RCNN with Resnet 101: firstly, extracting global image features in an image by using a residual error network Resnet101, and then identifying and extracting local features of the image according to a target detection algorithm, namely fast-RCNN to obtain corresponding focus information; and (3) not only using an object detector but also using an attribute classifier for each region in the image, wherein each object bounding box has a corresponding attribute class, so that binary description of the object can be obtained, extracting K object regions from each image, and each object region is represented by a 2048-dimensional vector and used as the input of a subsequent network.
Further, the problem feature acquisition specifically includes: the input medical problem is firstly processed into a single word, the longest word is intercepted into 14 words, redundant discarding is carried out, and less than 14 words are filled with zeros; and then capturing semantic features of words by combining a 300-dimensional GloVe word vector model, converting the semantic features into a vector mode, and encoding text features by using an LSTM network so as to extract problem semantic feature information as input of a subsequent network.
Furthermore, a self-recognition module is arranged to obtain characteristics among image regions and characteristics among question words, and the self-recognition module is an attention model and obtains the characteristics among the image regions and the characteristics among the question words through self-correlation learning; the core of the self-recognition module is an attention mechanism; the input consists of a query key of dimension d _ key and a value of dimension d _ value; firstly, calculating the dot product of a query key and all keys, and dividing each key by √ d; then, applying a softmax function to obtain the weight of the required value; in practice, to calculate attention weights for a set of query keys synchronously, they are packed into a matrix Q; the keys and values are also packed into matrices K and V.
Further, the attention model adopts an attention mechanism model of H parallel heads, which allows the model to focus on information from different representation subspaces at different positions simultaneously, and calculates an output feature matrix as:
F=MultiHead(Q,K,V)=Concat([head1,head2,…headH])W0
headi=Attention(QWi Q,KWi k,VWi v)
the self-recognition module consists of an attention mechanism model and a feed-forward network and is used for extracting the fine characteristics of the image or the medical problem;
outputting the problem characteristics after the learning attention characteristics obtain the weight; then inputting them into LayerNorm layer; the feedforward layer comprises two full-connection layers, a ReLu function and a Dropout function, and finally a LayerNorm layer, and the final characteristics are obtained through self-attention
Further, the processing of the same lesion target by interacting with the image features and the position features to realize relational modeling and obtain the relative position relationship of different targets specifically includes:
inputting object by image featuresAnd a position feature P, and a position feature,is a feature obtained by a self-recognition module, and P is a four-dimensional object frame;
to calculate the position feature weights, one object coordinate is represented as { x }i,yi,hi,wiIn which xiPosition of abscissa, y, representing center of objectiIndicating the position of the ordinate of the center of the object, wiWidth of the object frame, hiIndicating the height of the object box. First, the coordinates of P are transformed as follows,m and n respectively represent two object frames and carry out scale normalization and logarithm operation; the input N objects can be expressed as
Next, the geometric features of the two objects are embedded into the high-dimensional features and expressed as epsilonGW is to beGMultiplying by the embedded feature to obtain a weight, WGIs also realized by a full connection layer; the final max operation is similar to the relu layer, whose main purpose is to impose a certain limit on the location feature weights;
εGRepresenting the embedding of geometric features into high-dimensional features.
Pm,PnRepresenting the geometrical characteristics of the m and n objects.
The object relationship between the nth object and the whole set can be obtained through the following formula;
r (n) represents the object relationship between the nth object and the entire set.
Representing image features of the m-th object, wmnOutputting W for the weight of the relation between different objectsVThe weighted sum of the image characteristics of other objects after linear change;
k represents the number of object objects
After obtaining the relation characteristic R (n), the last step is to fuse the Nr relation characteristic and then to the image characteristicThe fusion is carried out, and the fusion is carried out,
further, the method for capturing the complex interaction relationship among multiple modalities by introducing the cross-guided multi-modality feature fusion stacking manner specifically includes:
the cross guide module consists of a problem guide picture attention module and a problem attention module; updating image region characteristics and problem text characteristics by establishing semantic association between two different modes to obtain more detailed characteristics; performing cross fusion characteristic extraction on the sample image characteristic information and the sample problem characteristic information to obtain an image characteristic vector carrying the sample problem information and a sample problem characteristic vector carrying the sample image information;
the core of the cross-guide attention module is also an attention mechanism, and the input is also represented as Q, K and V; taking the attention model of the problem guide image as an example, the self-recognition feature of the input imageSelf-identification of questionsMapping to obtain the input and output of the image interactive attention model and the problem interactive attention modelOutputting the model;
after obtaining the image characteristic vector carrying the sample problem information and the sample problem characteristic vector carrying the sample image information, stacking a layer number, wherein N is the layer number of the attention model, and the output of the previous attention layer is used as the input of the next attention layer; the multiple attention model layers are connected with the model of a deeper layer, so that the embedding of the attention model can be guided, the image to be processed and the problem feature can be gradually refined, and the representation capability of the model can be enhanced.
Further, the design selects a fusion mode and a classifier, and the fusion mode and the classifier are applied to medical question answering to realize the visual question answering research facing medical image diagnosis, and the visual question answering research specifically comprises the following steps:
in obtaining the effective characteristicsAndthen sending the data to a linear multi-mode fusion network; then, mapping the fused feature f to a vector space s epsilon R through an s-shaped functionLWherein L is the number of the most frequent answers in the training set;
s=Linear(f)
A=sigmoid(s)
a represents the model predicted answer.
The final prediction stage can be viewed as a logistic regression that predicts the correctness of each candidate answer; selecting the answer with the highest probability from all predicted answers as a final prediction; returning and predicting by using a binary cross entropy function; and determining a loss value of the loss function according to the real answer and the predicted answer, and updating the model according to the loss value.
M represents the training problem
N represents a candidate answer
szkTrue answer representing model
Z, K values when training separately
The invention has the following advantages and beneficial effects:
in the present invention, we propose a visual question-answering method for medical image diagnosis. Given that many of the existing attention-based VQA methods can only learn the rough interaction of multimodal instances, our models can direct each other's attention and get a correlation between each image region and the problem. The other core idea of the invention is to increase the position attention, which can improve a judgment on the position relation of the object in the image and improve the counting performance of the object in the image. The invention can be used as an effective reference for assisting diagnosis of doctors, thereby greatly improving the diagnosis efficiency; the invention can help the patient to preliminarily know the self physical condition, thereby being beneficial to selecting a more targeted medical scheme.
The method of claims 6-7. The invention firstly considers that the extracted image features and the medical text features are mutually independent features, and firstly uses a self-identification module to emphasize respective emphasis of the image and the text in order to obtain the features with more fineness. The conventional model only considers picture recognition, namely a self-attention model and the like are only used on the picture, but the invention highlights that text features are also important and have key points and keywords in the problem, so that the self-recognition module is not only applied to picture processing but also applied to medical text problems. The finer single-mode model can better perform subsequent model fusion.
The method of claim 8. The common visual question-answering models are questions in some open domains, and it can be found in related data that answers of basic models are often bad and popular when answering related questions about positions. The obtained picture features not only contain original features but also contain rich inter-object position relations.
The method of claim 9. The common visual question-answering model usually adopts a text-guided picture mode to perform multi-mode fusion, and the text information can be guided by neglecting picture information. The cross-guided multi-modal feature fusion stacking mode designed by the invention can capture the complex interaction relationship among multiple modes. Updating image region characteristics and problem text characteristics by establishing semantic association between two different modes to obtain more detailed characteristics; and performing cross fusion characteristic extraction on the sample image characteristic information and the sample problem characteristic information to obtain an image characteristic vector carrying the sample problem information and a sample problem characteristic vector carrying the sample image information.
Drawings
FIG. 1 is a flow chart of a visual question-answering method for medical image diagnosis according to a preferred embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
the first embodiment: as shown in fig. 1, the present invention provides a visual question-answering method for medical image diagnosis, which implements feature fusion for two cross-modal data, namely, picture and text, helps a doctor to perform an auxiliary diagnosis, and enables a patient to use it to obtain basic information of an image without consulting the doctor.
Firstly, we need to download the data set related to the medical image from the internet, and combine the question and the answer of the object to generate a group of objects of pictures, questions and answers, which is convenient for the subsequent learning and training.
Then, the image and the text are preprocessed, namely sample pictures and problem characteristic information are obtained and then input to the main network of the user.
For pictures, a fast-RCNN + Resnet101 network is used as a network feature extraction network to be used, a residual error network Resnet101 is used for extracting global image features in the pictures, and then local features of the extracted images are identified according to a target detection algorithm, namely fast-RCNN. Input picture features are described as X ═ X1,x2,…xm]∈Rm ×2048And m represents the number of object objects in the picture.
For the question, the text is preprocessed, and the sentence is written as a word with the length not exceeding 14. Words in the question are embedded by using a GloVe word with 300 dimensions, and text characteristics are encoded by using an LSTM network so as to extract semantic characteristic information of the question as input of a subsequent network. Describing the problem feature of the input as Y ═ Y1,y2,…yn]∈Rn×512And n is the number of words in the sentence.
For the subject network in the figure. Firstly, the self-recognition module is used for extracting the characteristics of the image target and the question text, so that the interference of redundant information in the image target can be reduced, the dependency relationship between the question words can be effectively captured for text representation learning, and the subsequent obtaining of the correlation between each image area and the question is facilitated.
The self-recognition module is mainly realized by an attention mechanism, wherein the attention mechanism calculates the correlation between the inputs, then performs weighted summation on all vectors in the inputs, and calculates the concerned features as the output of multi-head attention. This output is then fed into a feedforward neural network consisting of fully-connected layer functions, resulting in the output from the attention module. Problem characteristic Y ═ Y1,y2,…yn]∈Rn×512Obtained after passing through a self-identification modulePicture characteristic X ═ X1,x2,…xm]∈Rm ×2048Obtained after passing through a self-identification module
Secondly, the image features are processed by a position correlation module, the same target is processed by interacting with the image features and the position features, relational correlation modeling is achieved, relative position relations of different targets are obtained, and matching capability of multi-modal features is enhanced.
Firstly, coordinate information of an object is obtained, and scale normalization and logarithm operation are carried out on the coordinate information. By passingThe object relation between different objects can be obtained, and after the object relation is obtained, the relation characteristic fusion is carried out with the picture characteristicObtaining the final picture characteristics
Then, a cross-guided multi-modal feature fusion mode is introduced, and the complex interaction relation among multiple modes can be captured. The cross-guide model is similar to the self-recognition model, except that the input features are not in the same group, but are image features and text features, respectively, and the final features are determined by mutual guidanceAnd
then, a plurality of attention model layers are connected with a model with a deeper layer through deepening the layer number of the main network, so that the embedding of the attention model can be guided, the image to be processed and the problem feature can be gradually refined, and the representation capability of the model can be enhanced. Through the fusion of N layers, the image features are represented as X(N)Problems ofThe characteristic is represented as Y(N)。
And finally, designing and selecting a fusion mode and a classifier to achieve a better effect. The learned joint characterization is used for answer prediction. By ax=softmax(MLP(X(N)) And ay=softmax(MLP(Y(N)) And respectively obtaining attention weights when the two features are subjected to weighted summation. Multiplying attention weight by picture featureTo obtain final characteristicsThe final characteristics of the problem are obtained by the same wayWe adopt a linear multi-modal fusion modeThen, mapping the fused feature f to a vector space s ∈ R through a functionLAnd L is the number of answers with the highest occurrence frequency in the training set. And finally outputting the predicted answer with the highest probability as the final predicted answer. And determining a loss value of the loss function according to the real answer and the predicted answer, and updating the model according to the loss value.
Second embodiment:
1 obtaining sample image and medical problem characteristic information
First, we need to download medical related image data and question labels including pictures, mainly CT and MRI scanned images, and real answers corresponding to questions and questions matched with the pictures, to form a group of objects of pictures, questions and answers.
And then extracting the features of the pictures and the questions. Inputting a scanning image, and extracting a relevant area in the image by using a target detection algorithm of Faster R-CNN based on ResNet-101; inputting an English sentence, and obtaining problem characteristics through word embedding and a recurrent neural network. The specific operation is as follows:
acquiring picture characteristics: in order to better extract the required picture features, the image information is processed in a manner of combining fast-RCNN with Resnet 101. Firstly, global image features in an image are extracted by using a residual error network Resnet101, and then local features of the extracted image are identified according to a target detection algorithm, namely fast-RCNN, so that corresponding focus information is obtained. Not only an object detector but also an attribute classifier is used for each region in the image, each object bounding box has a corresponding attribute class, so that a binary description of the object can be obtained. K object regions are extracted from each image, each object region being represented by a 2048-dimensional vector as input to the subsequent network.
Problem feature acquisition: the entered medical question is first treated as a single word, with a maximum of 14 words truncated, redundant discard, and fewer than 14 filled with zeros. And then capturing semantic features of words by combining a 300-dimensional GloVe word vector model, converting the semantic features into a vector mode, and encoding text features by using an LSTM network so as to extract problem semantic feature information as input of a subsequent network.
2 image and medical problem self-identification
The self-recognition module is an attention model and obtains characteristics among image areas and characteristics among question words through self-correlation learning. The core of the self-recognition module is an attention mechanism. The input consists of a query key of dimension d _ key and a value of dimension d _ value. Typically d _ key and d _ value are both written as d. First, we compute the dot product of the query key and all keys and divide each key by √ d. Then, the softmax function is applied to obtain the weights of these values. In practice, we compute a set of attention functions for the query key at the same time and pack them into the matrix Q. The keys and values are also packed into matrices K and V. We compute the output matrix as:
still further, an attention model of H parallel heads is employed, which allows the model to focus on information from different representation subspaces from different locations at the same time, and thus a wider area can be focused on at the same time. We compute the output feature matrix as:
F=MultiHead(Q,K,V)=Concat([head1,head2,…headH])W0
headi=Attention(QWi Q,KWi k,VWi v)
intra-modal self-identification unit. They consist of an attention model and a feed-forward network for extracting subtle features of an image or medical problem. Taking the problem feature as an example, the problem feature is Y ═ Y1,y2,…yn]∈Rn×512The input of the self-recognition module can be obtained by the following formula:
and outputting the problem characteristics after the learning attention characteristics obtain the weight. They are then fed into the LayerNorm layer.
L Y=LayerNorm(Y+MultiHead(QY,KY,VY))
The feedforward layer comprises two full-connection layers, a ReLu function and a Dropout function, and finally a LayerNorm layer, and the final characteristics are obtained through self-attention
L′Y=FF(LY)=max(0,LYW1+b1)W2+b2
After the self-recognition module is used, the medical images and the medical texts focus on the self-emphasis, redundant information can be eliminated, and subsequent modal interaction and feature fusion can be conveniently carried out.
3 image focus object position relation modeling
In order to better acquire image characteristics and position relations among different targets, after self-identification characteristics of image information are acquired, output characteristics are sent to a position association unit working together with a self-identification module, and the position relations of the image characteristics and different objects are modeled simultaneously. Therefore, better understanding of image images is facilitated, and problems of position relations, such as front, back, left, right, foreground, background and the like, can be effectively processed through position relation modeling, so that a focus area is conveniently positioned, and effective auxiliary diagnosis is conveniently provided for doctors.
Inputting object by image featuresAnd a position feature P, and a position feature,is the feature obtained by the self-recognition module, and P is a four-dimensional object box.
To calculate the position feature weight, first, the coordinates of P are transformed as follows,the method mainly performs scale normalization and logarithm operation, and aims to increase scale invariance so that training divergence caused by overlarge change range of values is avoided. Thus the N objects entered can be represented as
Then, W is addedGAnd insertCharacteristic multiplication of W inGIs also realized by a full connection layer. The final max operation is similar to the relu layer, whose main purpose is to impose a certain limit on the location feature weights.
The object relationship between the nth object and the entire set can be obtained by the following formula.
Representing image features of the m-th object, wmnOutputting W for the weight of the relation between different objectsVIs the weighted sum of the image characteristics of other objects after linear change.
After obtaining the relation characteristic R (n), the last step is to fuse the Nr relation characteristic and then to the image characteristicThe fusion is carried out, and the fusion is carried out,the main reason for using concat here is the calculationThe number is small because the channel dimension of each R (n) is1/Nr times, dimension after concat andthe same is true.
4 image problem cross-guide
The cross guide module is composed of a question guide picture attention module and a picture guide question attention module. The mutually-guided attention unit pays more attention to the interaction between the modes, and the image region feature and the question text feature are updated by establishing semantic association between two different modes so as to obtain more refined features. And performing cross fusion characteristic extraction on the sample image characteristic information and the sample problem characteristic information to obtain an image characteristic vector carrying the sample problem information and a sample problem characteristic vector carrying the sample image information.
Similar to the self-identity module, the cross-lead attention module core is also the attention mechanism, with inputs also denoted as Q, K, V. Taking the attention model of the problem guide image as an example, the self-recognition feature of the input imageSelf-identification of questionsAnd mapping to obtain an image interaction attention model input and a question interaction attention model output.
After obtaining the image feature vector carrying the sample problem information and the sample problem feature vector carrying the sample image information, we stack a layer number, where N is the layer number of the attention model, and the output of the previous attention layer is used as the input of the next attention layer. The multiple attention model layers are connected with the model of a deeper layer, so that the embedding of the attention model can be guided, the image to be processed and the problem feature can be gradually refined, and the representation capability of the model can be enhanced.
Model 5 fusion and classifier
Through the learning of intra-modality self-attention and cross-directed attention mechanisms, features containing rich images and question information can be obtained. And simply fusing the image characteristic vector carrying the sample question information and the sample question characteristic vector carrying the sample image information, inputting the fused image characteristic vectors into a model classifier, and obtaining a predicted answer through the classifier.
In obtaining the effective characteristicsAndand then sending the data into a linear multi-mode fusion network. Then, mapping the fused feature f to a vector space s epsilon R through an s-shaped functionLAnd L is the number of the most frequent answers in the training set.
s=Linear(f)
A=sigmoid(s)
The final prediction stage can be viewed as a logistic regression that predicts the correctness of each candidate answer. We select the answer with the highest probability from all the predicted answers as the final prediction. Therefore, we come back to the prediction using a binary cross-entropy function. And determining a loss value of the loss function according to the real answer and the predicted answer, and updating the model according to the loss value.
The visual question-answering method for medical image diagnosis disclosed by the invention has the capability of visual question-answering, can better help doctors to perform auxiliary diagnosis particularly for judging the position relation of a focus, and enables patients to use the visual question-answering method to obtain the basic information of the image without consulting the doctors.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.
Claims (10)
1. A visual question-answering method for medical image diagnosis is characterized by comprising the following steps:
acquiring medical images and corresponding related medical problems;
respectively extracting features of the image focus target and the medical question text, capturing the dependency relationship among the question words, and performing text representation learning to obtain the correlation between each image area and the question;
processing the same focus target by interacting with the image features and the position features, realizing relational association modeling, obtaining the relative position relationship of different targets, and matching the multi-modal features;
a cross-guided multi-modal feature fusion stacking mode is introduced to capture complex interaction relations among multiple modalities;
a fusion mode and a classifier are designed and selected, and the method is applied to medical question answering to realize the visual question answering research oriented to medical image diagnosis.
2. The visual question-answering method oriented to medical image diagnosis according to claim 1, wherein the medical image and the corresponding related medical question specifically include the following steps:
downloading medical related image data and question labels on the network, wherein the image comprises pictures, mainly CT and MRI scanned images, and real answers corresponding to questions and questions matched with the pictures, and forming a group of objects of the pictures, the questions and the answers.
3. The visual question-answering method for medical image diagnosis according to claim 1 or 2, wherein the feature extraction is performed on the image focus target and the medical question text respectively, and specifically comprises:
and (3) carrying out feature extraction on the pictures and the problems: inputting a scanning image, and extracting a relevant area in the image by using a target detection algorithm of Faster R-CNN based on ResNet-101; inputting an English sentence, and obtaining problem characteristics through word embedding and a recurrent neural network.
4. The visual question-answering method for medical image diagnosis according to claim 3, wherein the picture feature acquisition specifically comprises: image information is processed in a manner of combining fast-RCNN with Resnet 101: firstly, extracting global image features in an image by using a residual error network Resnet101, and then identifying and extracting local features of the image according to a target detection algorithm, namely fast-RCNN to obtain corresponding focus information; and (3) not only using an object detector but also using an attribute classifier for each region in the image, wherein each object bounding box has a corresponding attribute class, so that binary description of the object can be obtained, extracting K object regions from each image, and each object region is represented by a 2048-dimensional vector and used as the input of a subsequent network.
5. The visual question-answering method for medical image diagnosis according to claim 3 or 4, wherein the question feature acquisition specifically comprises: the input medical problem is firstly processed into a single word, the longest word is intercepted into 14 words, redundant discarding is carried out, and less than 14 words are filled with zeros; and then capturing semantic features of words by combining a 300-dimensional GloVe word vector model, converting the semantic features into a vector mode, and encoding text features by using an LSTM network so as to extract problem semantic feature information as input of a subsequent network.
6. The visual question-answering method oriented to medical image diagnosis according to claim 5, wherein a self-recognition module is further arranged to obtain image inter-region features and question inter-word features, the self-recognition module is an attention model, and the image inter-region features and the question inter-word features are obtained through self-correlation learning; the core of the self-recognition module is an attention mechanism; the input consists of a query key of dimension d _ key and a value of dimension d _ value; firstly, calculating the dot product of a query key and all keys, and dividing each key by √ d; then, applying a softmax function to obtain the weight of the required value; in practice, to calculate attention weights for a set of query keys synchronously, they are packed into a matrix Q; the keys and values are also packed into matrices K and V.
7. The visual question-answering method oriented to medical image diagnosis according to claim 6, wherein the attention model adopts an attention mechanism model of H parallel heads, which allows the model to focus on information of different representation subspaces from different positions simultaneously, and the output feature matrix is calculated as:
F=MultiHead(Q,K,V)=Concat([head1,head2,…headH])W0
the self-recognition module consists of an attention mechanism model and a feed-forward network and is used for extracting the fine characteristics of the image or the medical problem;
outputting the problem characteristics after the learning attention characteristics obtain the weight; then inputting them into LayerNorm layer; the feedforward layer comprises two full-connection layers, a ReLu function and a Dropout function, and finally a LayerNorm layer, and the final characteristics are obtained through self-attention
8. The visual question-answering method for medical image diagnosis according to claim 7, wherein the processing of the same lesion target by interacting with the image features and the position features to realize relational modeling and obtain the relative position relationship of different targets specifically comprises:
inputting object by image featuresAnd a position feature P, and a position feature,is a feature obtained by a self-recognition module, and P is a four-dimensional object frame;
to calculate the position feature weights, one object coordinate is represented as { x }i,yi,hi,wiIn which xiPosition of abscissa, y, representing center of objectiIndicating the position of the ordinate of the center of the object, wiWidth of the object frame, hiIndicating the height of the object box. First, the coordinates of P are transformed as follows,m and n respectively represent two object frames and carry out scale normalization and logarithm operation; the input N objects can be expressed as
Next, the geometric features of the two objects are embedded into the high-dimensional features and expressed as epsilonGW is to beGMultiplying by the embedded feature to obtain a weight, WGIs also realized by a full connection layer; the final max operation is similar to the relu layer, whose main purpose is to impose a certain limit on the location feature weights;
εGRepresenting the embedding of geometric features into high-dimensional features.
Pm,PnRepresenting the geometrical characteristics of the m and n objects.
The object relationship between the nth object and the whole set can be obtained through the following formula;
r (n) represents the object relationship between the nth object and the entire set.
Representing image features of the m-th object, wmnAs a weight of the relation between different objects, WVFor linear change ofConverting to obtain the weighted sum of the image characteristics of other objects;
k represents the number of object objects
9. the visual question-answering method for medical image diagnosis according to claim 8, wherein the method for capturing the complex interaction relationship among multiple modalities by introducing the cross-guided multi-modality feature fusion stacking manner specifically comprises:
the cross guide module consists of a problem guide picture attention module and a problem attention module; updating image region characteristics and problem text characteristics by establishing semantic association between two different modes to obtain more detailed characteristics; performing cross fusion characteristic extraction on the sample image characteristic information and the sample problem characteristic information to obtain an image characteristic vector carrying the sample problem information and a sample problem characteristic vector carrying the sample image information;
the core of the cross-guide attention module is also an attention mechanism, and the input is also represented as Q, K and V; taking the attention model of the problem guide image as an example, the self-recognition feature of the input imageSelf-identification of questionsMapping to obtain the output of the image interaction attention model and the output of the problem interaction attention model;
after obtaining the image characteristic vector carrying the sample problem information and the sample problem characteristic vector carrying the sample image information, stacking a layer number, wherein N is the layer number of the attention model, and the output of the previous attention layer is used as the input of the next attention layer; the multiple attention model layers are connected with the model of a deeper layer, so that the embedding of the attention model can be guided, the image to be processed and the problem feature can be gradually refined, and the representation capability of the model can be enhanced.
10. The visual question-answer method for medical image diagnosis according to claim 8, wherein the design selects a fusion mode and a classifier, and the method is applied to a medical question-answer to realize the visual question-answer research for medical image diagnosis, and specifically comprises the following steps:
in obtaining the effective characteristicsAndthen sending the data to a linear multi-mode fusion network; then, mapping the fused feature f to a vector space s epsilon R through an s-shaped functionLWherein L is the number of the most frequent answers in the training set;
s=Linear(f)
A=sigmoid(s)
a represents the model predicted answer.
The final prediction stage can be viewed as a logistic regression that predicts the correctness of each candidate answer; selecting the answer with the highest probability from all predicted answers as a final prediction; returning and predicting by using a binary cross entropy function; and determining a loss value of the loss function according to the real answer and the predicted answer, and updating the model according to the loss value.
M represents the training problem
N represents a candidate answer
szkTrue answer representing model
Z, K respectively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111461563.7A CN114201592B (en) | 2021-12-02 | 2021-12-02 | Visual question-answering method for medical image diagnosis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111461563.7A CN114201592B (en) | 2021-12-02 | 2021-12-02 | Visual question-answering method for medical image diagnosis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114201592A true CN114201592A (en) | 2022-03-18 |
CN114201592B CN114201592B (en) | 2024-07-23 |
Family
ID=80650233
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111461563.7A Active CN114201592B (en) | 2021-12-02 | 2021-12-02 | Visual question-answering method for medical image diagnosis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114201592B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114780775A (en) * | 2022-04-24 | 2022-07-22 | 西安交通大学 | Image description text generation method based on content selection and guide mechanism |
CN114780701A (en) * | 2022-04-20 | 2022-07-22 | 平安科技(深圳)有限公司 | Automatic question-answer matching method, device, computer equipment and storage medium |
CN114821245A (en) * | 2022-05-30 | 2022-07-29 | 大连大学 | Medical visual question-answering method based on global visual information intervention |
CN117235670A (en) * | 2023-11-10 | 2023-12-15 | 南京信息工程大学 | Medical image problem vision solving method based on fine granularity cross attention |
CN117407541A (en) * | 2023-12-15 | 2024-01-16 | 中国科学技术大学 | Knowledge graph question-answering method based on knowledge enhancement |
CN117648976A (en) * | 2023-11-08 | 2024-03-05 | 北京医准医疗科技有限公司 | Answer generation method, device, equipment and storage medium based on medical image |
CN118471487A (en) * | 2024-07-12 | 2024-08-09 | 福建自贸试验区厦门片区Manteia数据科技有限公司 | Diagnosis and treatment scheme generating device based on multi-source heterogeneous data and electronic equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019211250A1 (en) * | 2018-04-30 | 2019-11-07 | Koninklijke Philips N.V. | Visual question answering using on-image annotations |
WO2020263711A1 (en) * | 2019-06-28 | 2020-12-30 | Facebook Technologies, Llc | Memory grounded conversational reasoning and question answering for assistant systems |
CN112818889A (en) * | 2021-02-09 | 2021-05-18 | 北京工业大学 | Dynamic attention-based method for integrating accuracy of visual question-answer answers by hyper-network |
CN112926655A (en) * | 2021-02-25 | 2021-06-08 | 电子科技大学 | Image content understanding and visual question and answer VQA method, storage medium and terminal |
CN113240046A (en) * | 2021-06-02 | 2021-08-10 | 哈尔滨工程大学 | Knowledge-based multi-mode information fusion method under visual question-answering task |
CN113392253A (en) * | 2021-06-28 | 2021-09-14 | 北京百度网讯科技有限公司 | Visual question-answering model training and visual question-answering method, device, equipment and medium |
-
2021
- 2021-12-02 CN CN202111461563.7A patent/CN114201592B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019211250A1 (en) * | 2018-04-30 | 2019-11-07 | Koninklijke Philips N.V. | Visual question answering using on-image annotations |
WO2020263711A1 (en) * | 2019-06-28 | 2020-12-30 | Facebook Technologies, Llc | Memory grounded conversational reasoning and question answering for assistant systems |
CN112818889A (en) * | 2021-02-09 | 2021-05-18 | 北京工业大学 | Dynamic attention-based method for integrating accuracy of visual question-answer answers by hyper-network |
CN112926655A (en) * | 2021-02-25 | 2021-06-08 | 电子科技大学 | Image content understanding and visual question and answer VQA method, storage medium and terminal |
CN113240046A (en) * | 2021-06-02 | 2021-08-10 | 哈尔滨工程大学 | Knowledge-based multi-mode information fusion method under visual question-answering task |
CN113392253A (en) * | 2021-06-28 | 2021-09-14 | 北京百度网讯科技有限公司 | Visual question-answering model training and visual question-answering method, device, equipment and medium |
Non-Patent Citations (4)
Title |
---|
"Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering", 《THE JOURNAL OF SUPERCOMPUTING》, 29 March 2023 (2023-03-29), pages 13696 * |
A LUBNA等: "MoBVQA: A Modality based Medical Image Visual Question Answering System", 《TENCON 2019 - 2019 IEEE REGION 10 CONFERENCE (TENCON)》, 20 October 2019 (2019-10-20), pages 727 - 732, XP033672617, DOI: 10.1109/TENCON.2019.8929456 * |
张礼阳: "结合视觉内容理解与文本信息分析的视觉问答方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 07, 15 July 2020 (2020-07-15), pages 138 - 844 * |
陈珂佳: "基于深度学习的视觉问答研究", 《重庆邮电大学硕士学位论文》, 16 April 2024 (2024-04-16), pages 1 - 86 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114780701A (en) * | 2022-04-20 | 2022-07-22 | 平安科技(深圳)有限公司 | Automatic question-answer matching method, device, computer equipment and storage medium |
CN114780701B (en) * | 2022-04-20 | 2024-07-02 | 平安科技(深圳)有限公司 | Automatic question-answer matching method, device, computer equipment and storage medium |
CN114780775A (en) * | 2022-04-24 | 2022-07-22 | 西安交通大学 | Image description text generation method based on content selection and guide mechanism |
CN114821245A (en) * | 2022-05-30 | 2022-07-29 | 大连大学 | Medical visual question-answering method based on global visual information intervention |
CN114821245B (en) * | 2022-05-30 | 2024-03-26 | 大连大学 | Medical visual question-answering method based on global visual information intervention |
CN117648976A (en) * | 2023-11-08 | 2024-03-05 | 北京医准医疗科技有限公司 | Answer generation method, device, equipment and storage medium based on medical image |
CN117235670A (en) * | 2023-11-10 | 2023-12-15 | 南京信息工程大学 | Medical image problem vision solving method based on fine granularity cross attention |
CN117407541A (en) * | 2023-12-15 | 2024-01-16 | 中国科学技术大学 | Knowledge graph question-answering method based on knowledge enhancement |
CN117407541B (en) * | 2023-12-15 | 2024-03-29 | 中国科学技术大学 | Knowledge graph question-answering method based on knowledge enhancement |
CN118471487A (en) * | 2024-07-12 | 2024-08-09 | 福建自贸试验区厦门片区Manteia数据科技有限公司 | Diagnosis and treatment scheme generating device based on multi-source heterogeneous data and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN114201592B (en) | 2024-07-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114201592B (en) | Visual question-answering method for medical image diagnosis | |
Arevalo et al. | Gated multimodal networks | |
CN110750959B (en) | Text information processing method, model training method and related device | |
CN110491502A (en) | Microscope video stream processing method, system, computer equipment and storage medium | |
CN114936623B (en) | Aspect-level emotion analysis method integrating multi-mode data | |
CN113806609B (en) | Multi-modal emotion analysis method based on MIT and FSM | |
CN112651940B (en) | Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network | |
CN113886626B (en) | Visual question-answering method of dynamic memory network model based on multi-attention mechanism | |
CN116129141B (en) | Medical data processing method, apparatus, device, medium and computer program product | |
KR20200010672A (en) | Smart merchandise searching method and system using deep learning | |
Halvardsson et al. | Interpretation of swedish sign language using convolutional neural networks and transfer learning | |
CN115761757A (en) | Multi-mode text page classification method based on decoupling feature guidance | |
CN110705490A (en) | Visual emotion recognition method | |
CN116821391A (en) | Cross-modal image-text retrieval method based on multi-level semantic alignment | |
CN115410254A (en) | Multi-feature expression recognition method based on deep learning | |
CN116187349A (en) | Visual question-answering method based on scene graph relation information enhancement | |
Thangavel et al. | A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models | |
Nahar et al. | A robust model for translating arabic sign language into spoken arabic using deep learning | |
CN116578738B (en) | Graph-text retrieval method and device based on graph attention and generating countermeasure network | |
Gasimova | Automated enriched medical concept generation for chest X-ray images | |
Shahadat et al. | Cross channel weight sharing for image classification | |
CN116311518A (en) | Hierarchical character interaction detection method based on human interaction intention information | |
Abu-Jamie et al. | Classification of Sign-Language Using Deep Learning-A Comparison between Inception and Xception models | |
Wu et al. | Question-driven multiple attention (dqma) model for visual question answer | |
Liu et al. | Multi-type decision fusion network for visual Q&A |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |