CN112100346B - Visual question-answering method based on fusion of fine-grained image features and external knowledge - Google Patents

Visual question-answering method based on fusion of fine-grained image features and external knowledge Download PDF

Info

Publication number
CN112100346B
CN112100346B CN202010883275.XA CN202010883275A CN112100346B CN 112100346 B CN112100346 B CN 112100346B CN 202010883275 A CN202010883275 A CN 202010883275A CN 112100346 B CN112100346 B CN 112100346B
Authority
CN
China
Prior art keywords
vector
image
feature
knowledge
segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010883275.XA
Other languages
Chinese (zh)
Other versions
CN112100346A (en
Inventor
宋凌云
李建鳌
尚学群
俞梦真
彭杨柳
李伟
李战怀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202010883275.XA priority Critical patent/CN112100346B/en
Publication of CN112100346A publication Critical patent/CN112100346A/en
Application granted granted Critical
Publication of CN112100346B publication Critical patent/CN112100346B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/751Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a visual question-answering method based on fusion of fine-grained image features and external knowledge, which comprises the following four steps: fine-grained image feature extraction, text processing and feature extraction, problem knowledge retrieval based on an external knowledge base, multi-modal feature fusion and answer prediction. Extracting the regional visual features of the image by using fine-grained image features; processing a visual question and a question sentence and obtaining the overall characteristics of the question sentence by text processing and characteristic extraction; problem knowledge retrieval based on an external knowledge base supplements necessary common knowledge or specific knowledge for prediction of visual problem answers by introducing a Freebase knowledge map as an external knowledge base of a model; in the multi-modal feature fusion and answer prediction, a similarity-based feature fusion method is used for multi-modal feature fusion, and the fused visual question features are used for predicting the answers to the questions. The method has better performance and higher prediction accuracy rate on the answers of the visual questions.

Description

Visual question-answering method based on fusion of fine-grained image features and external knowledge
Technical Field
The invention belongs to the field of intelligent information processing, and particularly relates to a visual question answering method.
Background
Visual Question Answering (VQA) is a interdisciplinary study combining computer vision and natural language processing studies with the goal of allowing computers to predict answers to Visual questions. The specific process is to input an image and an open question related to the image into a computer, and the visual question-answering system firstly needs to understand the semantics of the text of the visual question and then combines the visual information of the image related to the question to predict the answer. The visual question-answering task requires a computer to deeply understand the content of images in visual questions and the semantics of the questions, and the answering of partial questions also requires the computer to master related common knowledge or specific knowledge, so that a plurality of artificial intelligence technologies including fine-grained identification, object identification, behavior identification, natural language processing and the like are involved in the visual question-answering research, so that the visual question-answering has higher requirements and larger challenges in the aspect of image semantic understanding than the traditional computer visual research.
Some researches on visual question answering exist in the prior art, but the global image features are used, so that fine-grained visual features with high correlation degree with question texts cannot be obtained, and the applicability to fine-grained visual problems is poor; most methods only focus on the content of the visual problem, and the application scene is greatly limited; meanwhile, the answering effect on the visual questions of the fine-grained images is poor, and certain reasoning cannot be carried out on the basis of the visual questions.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a visual question-answering method based on fusion of fine-grained image features and external knowledge, which comprises the following four steps: fine-grained image feature extraction, text processing and feature extraction, problem knowledge retrieval based on an external knowledge base, multi-modal feature fusion and answer prediction. Extracting the regional visual features of the image by using fine-grained image features; processing a visual question and a question sentence and obtaining the overall characteristics of the question sentence by text processing and characteristic extraction; problem knowledge retrieval based on an external knowledge base supplements necessary common knowledge or specific knowledge for prediction of visual problem answers by introducing a Freebase knowledge map as an external knowledge base of a model; in the multi-modal feature fusion and answer prediction, a similarity-based feature fusion method is used for multi-modal feature fusion, and the fused visual question features are used for predicting the answers to the questions. The method has better performance, and has higher prediction accuracy on the answer of the visual question on the aspect of the accuracy of the evaluation index of the visual question and answer.
The technical scheme adopted by the invention for solving the technical problem comprises the following steps:
step 1: extracting fine-grained image features;
step 1-1: taking an original image as input, and adopting an unsupervised image segmentation algorithm to segment the image, wherein each segmented region is marked by different RGB color values; then the image size is changed to d1×d1×3;
Step 1-2: selecting a pre-trained VGG-16 network, and removing a full connection layer and a Softmax layer of the VGG-16 network to serve as an image feature extractor; inputting the original image into an image feature extractor, and taking the output of the last convolution layer of the image feature extractor as the feature map of the original image, wherein the size of the output feature map is d2×d2×512;
Step 1-3: mapping the segmentation region of the original image in the step 1-1 to a feature map by using an ROI projection method, performing region segmentation on the feature map according to a mapping result, and establishing a one-to-one correspondence relationship between the segmentation region of the original image and the segmentation region of the feature map; performing maximum pooling operation on the feature map to obtain image feature vectors of all the segmentation areas in the feature map, wherein the image feature vector of each segmentation area is 512-dimensional, and the value of the image feature vector of each segmentation area is the maximum value of the feature map of each dimension in the segmentation area;
step 2: text processing and feature extraction;
step 2-1: using an NLTK toolkit to perform word segmentation on the visual question and converting a word segmentation result into a one-hot word vector;
step 2-2: embedding one-hot word vectors into a word vector space by adopting a word embedding technology GloVe;
step 2-3: then, the word vector is coded by adopting an LSTM network, and the hidden state vector of the LSTM unit at the last moment is used as a problem text feature vector q;
and step 3: problem knowledge retrieval based on an external knowledge base;
step 3-1: using an NLTK part-of-speech tagging tool to perform word segmentation and part-of-speech tagging on the visual question, and tagging nouns and verbs in the visual question; then, the word forms of the nouns and the verbs are restored, the single nouns and the plural nouns are unified into a single form, and the tense variants of the verbs are unified into the original form of the verbs;
step 3-2: inputting the nouns and verbs marked in the step 3-1 as a knowledge graph of an external knowledge base, and searching by using the nouns and the verbs as key words to match corresponding knowledge triples for the question sentences of the visual problems;
step 3-3: and adopting a TransE algorithm to encode the entities and the relations in the knowledge graph to obtain a coding vector corresponding to the entities and the relations contained in each matched knowledge triple, wherein the dimension of the coding vector is H1(ii) a And splicing each matched knowledge triple according to the sequence of the head entity vector, the relation vector and the tail entity vector to obtain an H2A dimensional knowledge triplet feature vector k;
and 4, step 4: multi-modal feature fusion and answer prediction;
step 4-1: calculating the cosine similarity between the image characteristic vector of each segmentation area of the original image and the problem text characteristic vector q;
step 4-2: taking the cosine similarity obtained by calculation of each segmentation region as a weighting coefficient, and multiplying the weighting coefficient by the image feature vector of the corresponding segmentation region; adding the results of weighted multiplication of each divided area to obtain an image integral characteristic vector v;
step 4-3: and carrying out feature fusion on the image overall feature vector v, the problem text feature vector q and the knowledge triple feature vector k by using a bilinear pooling model MLB, wherein the fusion process is represented as follows:
f1=MLBFusion(k,q)
f2=MLBFusion(k,v)
f=MLBFusion(f1,f2)
wherein f is the finally obtained fusion feature vector, and MLBFusion (·) represents the feature fusion process;
step 4-4: a multi-class classifier is constructed by using a multilayer perceptron and a Sigmoid function, a vector f is input into the classifier, and the output of the classifier is as follows:
α=Sigmoid(MLP(f))
wherein α is an output vector of the classifier, the value of each dimension of the output vector is the probability value that each answer in the candidate answer set is the correct answer, and the answer with the highest probability is output as the predicted answer of the question.
Preferably, d is1Has a value of 224, d2Has a value of 14.
Preferably, said H1Has a value of 100, H2The value of (d) is 300.
Preferably, the ROI projection method is: dividing horizontal and vertical coordinates of each pixel of the original image by the convolution kernel step length of the image feature extractor, rounding up the calculation result, combining the repeated results, and obtaining coordinates as the coordinates of each pixel point in the feature map; and establishing a mapping relation between the pixel points of the original image and the pixel points of the feature map according to the corresponding relation of the coordinates.
Preferably, the knowledge-map employs Freebase.
Due to the adoption of the visual question-answering method based on the fusion of the fine-grained image characteristics and the external knowledge, the visual question-answering method has the following beneficial effects:
1. compared with the global image features used by the traditional method, the fine-grained image feature extraction method provided by the invention can obtain the fine-grained visual features with higher correlation degree with the problem text, and improves the applicability to the fine-grained visual problem.
2. The knowledge related to the visual questions in the external knowledge base is introduced into the prediction process of the answers, and the external knowledge and the feature vectors of the visual questions are fused, so that the capability of answering the questions needing common knowledge or special knowledge is improved.
3. In contrast, most of the conventional methods only focus on the content of the visual problem itself, and the application scenarios thereof are greatly limited. The experimental result of the invention on the standard data set proves that the method of the invention obtains higher accuracy rate than the prior method.
Drawings
FIG. 1 is a system framework diagram of the present invention.
Fig. 2 is a schematic diagram of fine-grained image feature extraction according to the present invention.
FIG. 3 is a schematic diagram of text processing and feature extraction according to the present invention.
FIG. 4 is a diagram of problem knowledge retrieval based on an external knowledge base according to the present invention.
FIG. 5 is a schematic diagram of multi-modal feature fusion and answer prediction according to the present invention.
FIG. 6 is a graph of loss function loss values as a function of iteration number during training according to an embodiment of the present invention.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
As shown in fig. 1, a visual question-answering method based on fusion of fine-grained image features and external knowledge includes the following steps:
step 1: extracting fine-grained image features;
step 1-1: taking an original image as input, and adopting an unsupervised image segmentation algorithm to segment the image, wherein each segmented region is marked by different RGB color values; then the image size is changed to d1×d1×3;
Step 1-2: selecting a pre-trained VGG-16 network, and removing a full connection layer and a Softmax layer of the VGG-16 networkAs an image feature extractor; inputting the original image into an image feature extractor, and taking the output of the last convolution layer of the image feature extractor as the feature map of the original image, wherein the size of the output feature map is d2×d2×512;
Step 1-3: mapping the segmentation region of the original image in the step 1-1 to a feature map by using an ROI projection method, performing region segmentation on the feature map according to a mapping result, and establishing a one-to-one correspondence relationship between the segmentation region of the original image and the segmentation region of the feature map; performing maximum pooling operation on the feature map to obtain image feature vectors of all the segmentation areas in the feature map, wherein the image feature vector of each segmentation area is 512-dimensional, and the value of the image feature vector of each segmentation area is the maximum value of the feature map of each dimension in the segmentation area;
step 2: text processing and feature extraction;
step 2-1: using an NLTK toolkit to perform word segmentation on the visual question and converting a word segmentation result into a one-hot word vector;
step 2-2: embedding one-hot word vectors into a word vector space by adopting a word embedding technology GloVe;
step 2-3: then, the word vector is coded by adopting an LSTM network, and the hidden state vector of the LSTM unit at the last moment is used as a problem text feature vector q;
and step 3: problem knowledge retrieval based on an external knowledge base;
step 3-1: using an NLTK part-of-speech tagging tool to perform word segmentation and part-of-speech tagging on the visual question, and tagging nouns and verbs in the visual question; then, the word forms of the nouns and the verbs are restored, the single nouns and the plural nouns are unified into a single form, and the tense variants of the verbs are unified into the original form of the verbs;
step 3-2: inputting the nouns and verbs marked in the step 3-1 as a knowledge graph of an external knowledge base, and searching by using the nouns and the verbs as key words to match corresponding knowledge triples for the question sentences of the visual problems;
step 3-3: and coding the entities and the relations in the knowledge graph by adopting a TransE algorithm to obtain the entities contained in each matched knowledge tripleThe body and the relation correspond to a code vector, and the dimension of the code vector is H1(ii) a And splicing each matched knowledge triple according to the sequence of the head entity vector, the relation vector and the tail entity vector to obtain an H2A dimensional knowledge triplet feature vector k;
and 4, step 4: multi-modal feature fusion and answer prediction;
step 4-1: calculating the cosine similarity between the image characteristic vector of each segmentation area of the original image and the problem text characteristic vector q;
step 4-2: taking the cosine similarity obtained by calculation of each segmentation region as a weighting coefficient, and multiplying the weighting coefficient by the image feature vector of the corresponding segmentation region; adding the results of weighted multiplication of each divided area to obtain an image integral characteristic vector v;
step 4-3: and carrying out feature fusion on the image overall feature vector v, the problem text feature vector q and the knowledge triple feature vector k by using a bilinear pooling model MLB, wherein the fusion process is represented as follows:
f1=MLBFusion(k,q)
f2=MLBFusion(k,v)
f=MLBFusion(f1,f2)
wherein f is the finally obtained fusion feature vector, and MLBFusion (·) represents the feature fusion process;
step 4-4: a multi-class classifier is constructed by using a multilayer perceptron and a Sigmoid function, a vector f is input into the classifier, and the output of the classifier is as follows:
α=Sigmoid(MLP(f))
wherein α is an output vector of the classifier, the value of each dimension of the output vector is the probability value that each answer in the candidate answer set is the correct answer, and the answer with the highest probability is output as the predicted answer of the question.
The specific embodiment is as follows:
1. dividing words of the visual question sentences in the sample library by using an NLTK part-of-speech tagging tool, and establishing a dictionary for word division results, wherein each word in the dictionary corresponds to a unique number;
2. as shown in fig. 2, an unsupervised image segmentation algorithm is first used to perform region segmentation on an original image. And outputting an image for marking each segmentation area by using different RGB color values according to the segmentation result, obtaining pixel coordinate information of each segmentation area by using different RGB color values, and determining the part of the image feature map corresponding to each segmentation area of the original image according to the pixel coordinate information. The division result image sizes are processed to be unified into 224 × 224 × 3.
And adopting the VGG-16 network with the weight pre-trained completion on ImageNet with the full connection layer and the Softmax layer removed as an image feature extractor. And inputting the original image corresponding to the visual problem into a VGG-16 network for forward propagation, taking the output of the last convolution layer as the extracted image feature, wherein the size of the output feature map is 14 multiplied by 512.
According to the segmentation result of unsupervised image segmentation, a method of mapping ROI (region of interest) in the target detection SPP-Net to the feature map is adopted to segment the region of the feature map output by the image feature extractor. According to the principle of receptive field and coordinate transformation in convolutional neural networks, the ROI mapping method is as follows: and (3) carrying out coordinate transformation on all pixel points of the original image, wherein the coordinate transformation method is that the horizontal and vertical coordinates of the pixels are divided by the convolution kernel step length of the image feature extractor, and the calculation result is rounded upwards to be the corresponding coordinates of the pixel points on the original image on the feature map. And obtaining the segmentation areas on the feature map output by the image feature extractor through ROI mapping, performing maximum pooling operation on each segmentation area, and taking the maximum value of the feature map in each segmentation area as the value of the feature vector corresponding to the area in the dimension, thereby obtaining the 512-dimensional image feature vector corresponding to each segmentation area in the original image.
3. Part-of-speech tagging can be used for determining part-of-speech of an English word, segmenting a visual problem sentence by using an NLTK toolkit, and performing part-of-speech tagging on a segmentation result, so that knowledge triples related to nouns and verbs in a problem can be conveniently searched in a problem knowledge retrieval step based on an external knowledge base. And after the part-of-speech tagging is finished, performing part-of-speech reduction on the result, wherein the single and plural nouns are unified into a single form, and the tense variants of the verbs are unified into the original form of the verb.
As shown in fig. 3, the input visual problem sentence is segmented with punctuation marks and spaces according to the characteristics of the english language. And converting the result of word segmentation into a one-hot word vector, wherein the value of the one-hot word vector corresponding to the word in a certain dimension is 1, and the values of the other dimensions are 0. Then, using Word embedding technology GloVe (Global Vectors for Word replication), one-hot Word Vectors of words are embedded into the Word vector space. And finally, inputting the word vector space into an LSTM network, coding the problem text by adopting the LSTM, and taking the hidden state vector of the LSTM unit at the last moment as a problem text feature vector.
4. As shown in fig. 4, the main task of the problem knowledge retrieval based on the external knowledge base is to find related knowledge triples, specifically, verbs and nouns in the visual problem sentences obtained by part-of-speech tagging are found in the external knowledge base knowledge graph Freebase. And then locally storing the feature vectors of the knowledge triples matched with each visual problem according to the search result, wherein the feature vectors are fused with image features and text features in the multi-modal feature fusion step.
Specifically, for the knowledge graph, a knowledge representation learning method transit is adopted to learn vector representation of entities and relations in the knowledge graph, and a transit algorithm is to regard relations in knowledge items in a triple form as translation from one entity to another entity, represent and learn the entities and relations in the same semantic space, so that the sum of a head entity vector and a relation vector is as close as possible to a tail entity vector, and can be represented as follows:
head entity+relationship=tail entity
in the formula, head entry represents a head entity vector, relationship represents a relationship vector, and tail entry represents a tail entity vector.
And setting algorithm parameters to finally obtain the vector dimensions corresponding to the entities and the relations which are both 100-dimensional, and splicing the first entity vector, the relation vector and the tail entity vector of each triplet in sequence to obtain a 300-dimensional knowledge triplet characteristic vector.
5. When the fine-grained image features are extracted, the image feature vectors of each segmentation area of the visual problem image are obtained. For a specific visual problem, the correlation degree between different divided regions in the original image and the problem is different, so before the image features are fused, the similarity between the visual problem and n feature vectors corresponding to n divided regions of the image needs to be calculated, and the features of each region are weighted according to the similarity, so that higher attention is given to the region with high problem correlation.
As shown in fig. 5, cosine similarity is used to calculate the similarity between the image features of different regions and the visual problem. The closer the cosine value is to 1, the higher the similarity of the two vectors is, namely the higher the similarity of the image features and the visual problem is, and the cosine similarity calculation method comprises the following steps:
Figure BDA0002654794150000071
wherein, A and B respectively represent two vectors participating in the calculation, and O is the dimension of the two vectors participating in the calculation. The similarity degree of the image feature vector of each segmentation area and the problem text feature vector can be determined by calculating the cosine similarity. Taking the cosine similarity obtained by calculation of each segmentation region as a weighting coefficient, and multiplying the weighting coefficient by the image feature vector of the corresponding segmentation region; adding the results of weighted multiplication of each divided area to obtain an image integral characteristic vector;
the method comprises the steps of performing feature fusion on an image overall feature vector, a problem text feature vector and a knowledge triple feature vector through a feature fusion method based on a Bilinear pooling Model (MLB), wherein the MLB performs feature fusion on the two feature vectors based on a Hadamard product and matrix decomposition, in the specific implementation process, firstly, the features of two different modes are projected to the same dimension by using a multilayer perceptron, then, the projected vectors are multiplied by the Hadamard product, and the calculation result is the fusion result. In the knowledge-based visual question-answering method, the specific fusion steps are that firstly, MLB is adopted to fuse the image overall characteristic vector and the knowledge triple characteristic vector to obtain an intermediate result image-knowledge characteristic, then MLB is adopted to fuse the question text characteristic vector and the knowledge triple characteristic vector to obtain an intermediate result text-knowledge characteristic, finally MLB is adopted to fuse two intermediate results to obtain a characteristic representation of a visual question, and finally a fusion characteristic vector f is obtained, wherein the vector can be used for predicting the answer of the visual question.
The fused feature vector f is used as the input of the answer prediction step, so that the answer of the visual question can be predicted. In this embodiment, with reference to a method for modeling a visual question-answering task as a classification question in mainstream visual question-answering research at present, a multi-class classifier is constructed, a vector f is input into the classifier, a numerical value of each dimension of the output vector is a probability value that each answer in a candidate answer set is a correct answer, and according to probability distribution, the output with the highest probability is a predicted answer to the question.
The classifier structure adopts a form of a multilayer perceptron MLP followed by a Sigmoid function to refine the classification questions into a plurality of categories of two-classification questions, so that the types of the visual questions are not limited, the questions in the data set can only have one correct answer in the candidate set, and the method is also suitable for the condition that one question corresponds to a plurality of answers and the candidate answer set does not have the answer to the question. The output of the classifier is:
α=Sigmoid(MLP(f))
the loss function of the classifier adopts a cross entropy loss function, and comprises the following steps:
Figure BDA0002654794150000081
wherein y isiThe marking mechanism marks the range of each candidate answer as [0,1 ] in the answer processing]Fractional value, p (y)i) For predicting the probability, N is the number of sample data participating in training, and i represents the ith sample data participating in training.
Before training, fine-grained image feature extraction and knowledge representation learning of a knowledge base are completed, and image features and knowledge item features corresponding to visual problems are locally stored in an HDF file format.
And (4) randomly initializing parameters, wherein random numbers used for initialization are fixed during each training in a mode of setting random number seeds. In the training process, after the training of an epoch is completed, the model can be verified on a verification set, and if the result of the current epoch on the verification set is superior to the result of the previous epoch, the parameters obtained by the training of the current epoch are stored to replace the previous parameters. AdaMax is adopted as the optimization algorithm, and the parameters of the optimization algorithm are all default values.
In order to prevent the overfitting phenomenon in training, a Dropout layer is added into the network, and gradient clipping is carried out in back propagation to prevent the gradient disappearance or the explosion phenomenon.
FIG. 6 is a graph of loss value versus number of training iterations for a loss function. After training to the 48 th epoch, convergence begins and the loss function eventually converges to 1.95.
For the evaluation of the prediction effect, the accuracy is calculated by adopting the following method: if the answer to the given question predicts that more than two of the ten labeled answers provided by the dataset are the same, then the predicted answer is deemed correct. This evaluation index is called the answer accuracy and is calculated by the formula:
Figure BDA0002654794150000091
wherein T is the number of ten labeled answers which is the same as the model predicted answer.
Shown in table 1 are the results of the evaluation of the knowledge-graph based visual question-answer model. The accuracy of the model on the three types of problems of 'yes or no', 'number' and 'other' is 73.35%, 47.87% and 34.01%, and the overall accuracy of the model is 52.14%.
Table 1: model accuracy
Figure BDA0002654794150000092
Table 2 compares the final accuracy of the methods of this document with those of other papers. CcVMS, CcVMS + Clustering, LcVMS + Clustering are experimental results of different methods on the same data set. Ours stands for our proposed method and it can be seen that the method of the present invention has a great improvement in performance compared to other methods.
Table 2: comparison of accuracy
Figure BDA0002654794150000101

Claims (3)

1. A visual question-answering method based on fusion of fine-grained image features and external knowledge is characterized by comprising the following steps:
step 1: extracting fine-grained image features;
step 1-1: taking an original image as input, and adopting an unsupervised image segmentation algorithm to segment the image, wherein each segmented region is marked by different RGB color values; then the image size is changed to 224 × 224 × 3;
step 1-2: selecting a pre-trained VGG-16 network, and removing a full connection layer and a Softmax layer of the VGG-16 network to serve as an image feature extractor; inputting an original image into an image feature extractor, taking the output of the last convolution layer of the image feature extractor as a feature map of the original image, wherein the size of the output feature map is 14 multiplied by 512;
step 1-3: mapping the segmentation region of the original image in the step 1-1 to a feature map by using an ROI projection method, performing region segmentation on the feature map according to a mapping result, and establishing a one-to-one correspondence relationship between the segmentation region of the original image and the segmentation region of the feature map; performing maximum pooling operation on the feature map to obtain image feature vectors of all the segmentation areas in the feature map, wherein the image feature vector of each segmentation area is 512-dimensional, and the value of the image feature vector of each segmentation area is the maximum value of the feature map of each dimension in the segmentation area;
the ROI projection method comprises the following steps: dividing horizontal and vertical coordinates of each pixel of the original image by the convolution kernel step length of the image feature extractor, rounding up the calculation result, combining the repeated results, and obtaining coordinates as the coordinates of each pixel point in the feature map; according to the corresponding relation of the coordinates, establishing a mapping relation between the pixel points of the original image and the pixel points of the feature map;
step 2: text processing and feature extraction;
step 2-1: using an NLTK toolkit to perform word segmentation on the visual question and converting a word segmentation result into a one-hot word vector;
step 2-2: embedding one-hot word vectors into a word vector space by adopting a word embedding technology GloVe;
step 2-3: then, the word vector is coded by adopting an LSTM network, and the hidden state vector of the LSTM unit at the last moment is used as a problem text feature vector q;
and step 3: problem knowledge retrieval based on an external knowledge base;
step 3-1: using an NLTK part-of-speech tagging tool to perform word segmentation and part-of-speech tagging on the visual question, and tagging nouns and verbs in the visual question; then, the word forms of the nouns and the verbs are restored, the single nouns and the plural nouns are unified into a single form, and the tense variants of the verbs are unified into the original form of the verbs;
step 3-2: inputting the nouns and verbs marked in the step 3-1 as a knowledge graph of an external knowledge base, and searching by using the nouns and the verbs as key words to match corresponding knowledge triples for the question sentences of the visual problems;
step 3-3: and adopting a TransE algorithm to encode the entities and the relations in the knowledge graph to obtain a coding vector corresponding to the entities and the relations contained in each matched knowledge triple, wherein the dimension of the coding vector is H1(ii) a And splicing each matched knowledge triple according to the sequence of the head entity vector, the relation vector and the tail entity vector to obtain an H2A dimensional knowledge triplet feature vector k;
and 4, step 4: multi-modal feature fusion and answer prediction;
step 4-1: calculating the cosine similarity between the image characteristic vector of each segmentation area of the original image and the problem text characteristic vector q;
step 4-2: taking the cosine similarity obtained by calculation of each segmentation region as a weighting coefficient, and multiplying the weighting coefficient by the image feature vector of the corresponding segmentation region; adding the results of weighted multiplication of each divided area to obtain an image integral characteristic vector v;
step 4-3: and carrying out feature fusion on the image overall feature vector v, the problem text feature vector q and the knowledge triple feature vector k by using a bilinear pooling model MLB, wherein the fusion process is represented as follows:
f1=MLBFusion(k,q)
f2=MLBFusion(k,v)
f=MLBFusion(f1,f2)
wherein f is the finally obtained fusion feature vector, and MLBFusion (·) represents the feature fusion process;
step 4-4: a multi-class classifier is constructed by using a multilayer perceptron and a Sigmoid function, a vector f is input into the classifier, and the output of the classifier is as follows:
α=Sigmoid(MLP(f))
wherein α is an output vector of the classifier, the value of each dimension of the output vector is the probability value that each answer in the candidate answer set is the correct answer, and the answer with the highest probability is output as the predicted answer of the question.
2. The visual question-answering method based on fusion of fine-grained image features and external knowledge according to claim 1, wherein the H is1Has a value of 100, H2The value of (d) is 300.
3. The visual question-answering method based on the fusion of the fine-grained image features and the external knowledge according to claim 1, wherein the knowledge graph adopts Freebase.
CN202010883275.XA 2020-08-28 2020-08-28 Visual question-answering method based on fusion of fine-grained image features and external knowledge Active CN112100346B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010883275.XA CN112100346B (en) 2020-08-28 2020-08-28 Visual question-answering method based on fusion of fine-grained image features and external knowledge

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010883275.XA CN112100346B (en) 2020-08-28 2020-08-28 Visual question-answering method based on fusion of fine-grained image features and external knowledge

Publications (2)

Publication Number Publication Date
CN112100346A CN112100346A (en) 2020-12-18
CN112100346B true CN112100346B (en) 2021-07-20

Family

ID=73758142

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010883275.XA Active CN112100346B (en) 2020-08-28 2020-08-28 Visual question-answering method based on fusion of fine-grained image features and external knowledge

Country Status (1)

Country Link
CN (1) CN112100346B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800191B (en) * 2020-12-31 2023-01-17 科大讯飞股份有限公司 Question and answer method and device based on picture and computer readable storage medium
CN112714032B (en) * 2021-03-29 2021-07-02 网络通信与安全紫金山实验室 Wireless network protocol knowledge graph construction analysis method, system, equipment and medium
CN113240046B (en) * 2021-06-02 2023-01-03 哈尔滨工程大学 Knowledge-based multi-mode information fusion method under visual question-answering task
CN113392253B (en) * 2021-06-28 2023-09-29 北京百度网讯科技有限公司 Visual question-answering model training and visual question-answering method, device, equipment and medium
CN113420833B (en) * 2021-07-21 2023-12-26 南京大学 Visual question answering method and device based on semantic mapping of questions
CN113626662A (en) * 2021-07-29 2021-11-09 山东新一代信息产业技术研究院有限公司 Method for realizing post-disaster image visual question answering
CN114282531A (en) * 2021-08-24 2022-04-05 腾讯科技(深圳)有限公司 Question detection method and device, electronic equipment and storage medium
CN114842368B (en) * 2022-05-07 2023-10-03 中国电信股份有限公司 Scene-based visual auxiliary information determination method, system, equipment and storage medium
CN117271818B (en) * 2023-11-22 2024-03-01 鹏城实验室 Visual question-answering method, system, electronic equipment and storage medium
CN117648429B (en) * 2024-01-30 2024-04-30 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Question-answering method and system based on multi-mode self-adaptive search type enhanced large model

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154235A (en) * 2017-12-04 2018-06-12 盈盛资讯科技有限公司 A kind of image question and answer inference method, system and device
CN109784163A (en) * 2018-12-12 2019-05-21 中国科学院深圳先进技术研究院 A kind of light weight vision question answering system and method
CN110134774A (en) * 2019-04-29 2019-08-16 华中科技大学 It is a kind of based on the image vision Question-Answering Model of attention decision, method and system
CN110163299A (en) * 2019-05-31 2019-08-23 合肥工业大学 A kind of vision answering method based on bottom-up attention mechanism and memory network
CN110532433A (en) * 2019-09-03 2019-12-03 北京百度网讯科技有限公司 Entity recognition method, device, electronic equipment and the medium of video scene
CN110717431A (en) * 2019-09-27 2020-01-21 华侨大学 Fine-grained visual question and answer method combined with multi-view attention mechanism
CN111259215A (en) * 2020-02-14 2020-06-09 北京百度网讯科技有限公司 Multi-modal-based topic classification method, device, equipment and storage medium
CN111291556A (en) * 2019-12-17 2020-06-16 东华大学 Chinese entity relation extraction method based on character and word feature fusion of entity meaning item
CN111506722A (en) * 2020-06-16 2020-08-07 平安科技(深圳)有限公司 Knowledge graph question-answering method, device and equipment based on deep learning technology

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10817749B2 (en) * 2018-01-18 2020-10-27 Accenture Global Solutions Limited Dynamically identifying object attributes via image analysis

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154235A (en) * 2017-12-04 2018-06-12 盈盛资讯科技有限公司 A kind of image question and answer inference method, system and device
CN109784163A (en) * 2018-12-12 2019-05-21 中国科学院深圳先进技术研究院 A kind of light weight vision question answering system and method
CN110134774A (en) * 2019-04-29 2019-08-16 华中科技大学 It is a kind of based on the image vision Question-Answering Model of attention decision, method and system
CN110163299A (en) * 2019-05-31 2019-08-23 合肥工业大学 A kind of vision answering method based on bottom-up attention mechanism and memory network
CN110532433A (en) * 2019-09-03 2019-12-03 北京百度网讯科技有限公司 Entity recognition method, device, electronic equipment and the medium of video scene
CN110717431A (en) * 2019-09-27 2020-01-21 华侨大学 Fine-grained visual question and answer method combined with multi-view attention mechanism
CN111291556A (en) * 2019-12-17 2020-06-16 东华大学 Chinese entity relation extraction method based on character and word feature fusion of entity meaning item
CN111259215A (en) * 2020-02-14 2020-06-09 北京百度网讯科技有限公司 Multi-modal-based topic classification method, device, equipment and storage medium
CN111506722A (en) * 2020-06-16 2020-08-07 平安科技(深圳)有限公司 Knowledge graph question-answering method, device and equipment based on deep learning technology

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
一种改进的基于TransE知识图谱表示方法;陈文杰等;《计算机工程》;20200531;第46卷(第5期);第63-69、77页 *
基于BiLSTM_CRF的细粒度知识图谱问答;张楚婷等;《计算机工程》;20200229;第46卷(第2期);第41-47页 *
深度区域网络方法的细粒度图像分类;翁雨辰等;《中国图像图形学报》;20171130;第22卷(第11期);第1521-1530页 *

Also Published As

Publication number Publication date
CN112100346A (en) 2020-12-18

Similar Documents

Publication Publication Date Title
CN112100346B (en) Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN108804530B (en) Subtitling areas of an image
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN107679580B (en) Heterogeneous migration image emotion polarity analysis method based on multi-mode depth potential correlation
CN110334705B (en) Language identification method of scene text image combining global and local information
CN111581961B (en) Automatic description method for image content constructed by Chinese visual vocabulary
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN110837836B (en) Semi-supervised semantic segmentation method based on maximized confidence
Sumbul et al. SD-RSIC: Summarization-driven deep remote sensing image captioning
CN108009148B (en) Text emotion classification representation method based on deep learning
CN112100351A (en) Method and equipment for constructing intelligent question-answering system through question generation data set
CN111291556B (en) Chinese entity relation extraction method based on character and word feature fusion of entity meaning item
CN110866140A (en) Image feature extraction model training method, image searching method and computer equipment
Peng et al. Research on image feature extraction and retrieval algorithms based on convolutional neural network
CN108734210B (en) Object detection method based on cross-modal multi-scale feature fusion
CN110704601A (en) Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network
CN112241468A (en) Cross-modal video retrieval method and system based on multi-head self-attention mechanism and storage medium
CN110969020A (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN111695052A (en) Label classification method, data processing device and readable storage medium
CN113886626B (en) Visual question-answering method of dynamic memory network model based on multi-attention mechanism
CN110598022B (en) Image retrieval system and method based on robust deep hash network
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN112800292A (en) Cross-modal retrieval method based on modal specificity and shared feature learning
Oluwasammi et al. Features to text: a comprehensive survey of deep learning on semantic segmentation and image captioning
CN115115969A (en) Video detection method, apparatus, device, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant