CN111860653A

CN111860653A - Visual question answering method and device, electronic equipment and storage medium

Info

Publication number: CN111860653A
Application number: CN202010711706.4A
Authority: CN
Inventors: 李晓川; 张润泽; 范宝余
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2020-10-30

Abstract

The application discloses a visual question answering method, a visual question answering device, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring a target question and a target image, extracting text features from the target question, and extracting image features from the target image by using a target detection frame; determining a position corresponding to the image feature, and determining an entity type and an attribute type corresponding to the image feature; performing feature fusion on the text feature, the image feature, the position, the entity type and the attribute type to obtain a fusion feature; and inputting the fusion features into VQA classifiers to obtain answers corresponding to the target question sentences. The visual question-answering method provided by the application classifies the entity and the attribute of the image features output by the target detection framework, fuses the output of the classifier with the text features and the image features and inputs the fused output into the VQA classifier, and improves the prediction accuracy of the VQA model by improving the utilization rate of the detection features.

Description

Visual question answering method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a visual question answering method and apparatus, an electronic device, and a computer-readable storage medium.

Background

VQA (Chinese full name: Visual Question Answering, English full name: Visual Question Answering) aims to make a computer obtain the capability of Answering Artificial questions according to the image content, and is a cross-modal AI (Chinese full name: Artificial Intelligence, English full name: intellectual science) processing technology. The VQA task fuses images and text, which is a cross-modal AI task.

Because of the characteristic of the VQA task, image features and text features need to be extracted respectively, the image features and the text features are input into an encoder network together through feature fusion to extract encoding features, and finally the encoding features are input into a classifier to predict a final answer. At present, an object detection framework is generally adopted as an extraction network of image features. As shown in fig. 1, the input question text is converted into L × M text features through text feature mapping, where L denotes the length of the question (i.e., the question includes L words), and M denotes the dimension of each word converted into a feature. It is then feature encoded by a text feature encoder. Similarly, feature coding is also performed after fusing the detected features of the image and the corresponding detected positions thereof. Then, the encoding is continued after fusing the two features, and then the final predicted answer is output by classifying through an VQA classifier.

Limited by the performance bottleneck of the target detection framework, the VQA model has difficulty in greatly improving the prediction accuracy by modifying the network structure of the model. Therefore, how to sufficiently extract the detection features to assist in improving the classification accuracy of the VQA task is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The application aims to provide a visual question answering method and device, an electronic device and a computer readable storage medium, which realize sufficient extraction of detection features and assist in improving the classification accuracy of VQA tasks.

In order to achieve the above object, the present application provides a visual question answering method, including:

acquiring a target question and a target image, extracting text features from the target question, and extracting image features from the target image by using a target detection frame;

determining a position corresponding to the image feature, and determining an entity type and an attribute type corresponding to the image feature;

performing feature fusion on the text feature, the image feature, the position, the entity type and the attribute type to obtain a fusion feature;

and inputting the fusion features into VQA classifiers to obtain answers corresponding to the target question sentences.

Wherein the determining the entity type and the attribute type corresponding to the image feature comprises:

inputting the image features into a trained entity classifier to obtain entity types corresponding to the image features;

and inputting the image features into a trained attribute classifier to obtain attribute types corresponding to the image features.

Wherein, still include:

determining all of the entity types and all of the attribute types from a candidate set of answers;

acquiring an image feature training set, and labeling an entity type and an attribute type corresponding to each training image feature in the image feature training set;

training an entity classifier by using the image feature training set and the entity type corresponding to each training image feature so as to obtain the trained entity classifier;

and training an attribute classifier by using the image feature training set and the attribute type corresponding to each training image feature so as to obtain the trained attribute classifier.

Extracting text features from the target question sentence, wherein the extracting text features from the target question sentence comprises the following steps:

and performing text feature mapping on the target question so as to extract text features from the target question.

Performing feature fusion on the text feature, the image feature, the position, the entity type and the attribute type to obtain a fusion feature, including:

coding the text features to obtain text coding features, and coding the positions to obtain position codes;

performing text feature mapping on the entity type to obtain entity features, and encoding the attribute type to obtain attribute features;

fusing the text-coding feature, the image feature, the position code, the entity feature, and the attribute feature into the fused feature.

Wherein fusing the text-coding feature, the image feature, the position code, the entity feature, and the attribute feature into the fused feature comprises:

fusing the position codes and the image features and then coding to obtain image coding features;

coding the entity characteristics to obtain entity coding characteristics, and coding the attribute characteristics to obtain attribute coding characteristics;

and performing feature fusion on the text coding features, the image coding features, the entity coding features and the attribute coding features to obtain fusion features.

fusing the entity characteristics, the attribute characteristics and the position codes into image detection characteristics, and coding the image detection characteristics to obtain image coding characteristics;

and performing feature fusion on the text coding features and the image coding features to obtain the fusion features.

To achieve the above object, the present application provides a visual question answering device, comprising:

the extraction module is used for acquiring a target question sentence and a target image, extracting text features from the target question sentence and extracting image features from the target image by using a target detection frame;

the first determining module is used for determining the position corresponding to the image feature and determining the entity type and the attribute type corresponding to the image feature;

the fusion module is used for carrying out feature fusion on the text feature, the image feature, the position, the entity type and the attribute type to obtain fusion features;

and the input module is used for inputting the fusion characteristics into VQA classifiers to obtain answers corresponding to the target question sentences.

To achieve the above object, the present application provides an electronic device including:

a memory for storing a computer program;

a processor for implementing the steps of the visual question-answering method as described above when executing the computer program.

To achieve the above object, the present application provides a computer-readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the above-mentioned visual question-answering method.

According to the scheme, the visual question answering method comprises the following steps: acquiring a target question and a target image, extracting text features from the target question, and extracting image features from the target image by using a target detection frame; determining a position corresponding to the image feature, and determining an entity type and an attribute type corresponding to the image feature; performing feature fusion on the text feature, the image feature, the position, the entity type and the attribute type to obtain a fusion feature; and inputting the fusion features into VQA classifiers to obtain answers corresponding to the target question sentences.

The visual question-answering method provided by the application classifies the entity and the attribute of the image features output by the target detection framework, fuses the output of the classifier with the text features and the image features and inputs the fused image features into the VQA classifier, does not change the target detection framework, but improves the prediction accuracy of the VQA model by improving the utilization rate of the detection features. Therefore, the visual question-answering method provided by the application expands the entities and attributes in the image into the detection features, expands the image features of the conventional VQA task, extracts the detection features as fully as possible, improves the integrity of the input features, breaks the bottleneck of the conventional detection framework, and assists in improving the classification accuracy of the VQA task. The application also discloses a visual question answering device, electronic equipment and a computer readable storage medium, which can also realize the technical effects.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a flow chart of VQA provided in the related art;

FIG. 2 is a flow diagram illustrating a method of visual question answering in accordance with one exemplary embodiment;

FIG. 3 is an VQA flow diagram illustrating an exemplary embodiment;

FIG. 4 is another VQA flow diagram shown in accordance with an exemplary embodiment;

FIG. 5 is a block diagram illustrating a visual question answering device in accordance with one exemplary embodiment;

FIG. 6 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application discloses a visual question-answering method, which realizes full extraction of detection features and assists in improving the classification precision of VQA tasks.

Referring to fig. 1, a flow diagram of a visual question answering method, as shown in fig. 1, is shown according to an exemplary embodiment, including:

s101: acquiring a target question and a target image, extracting text features from the target question, and extracting image features from the target image by using a target detection frame;

in this step, the input question text, i.e. the target question, is converted into a text feature of L × M, where L denotes the length of the question, i.e. the target question includes L words, and M denotes the dimension of each word converted into the feature. Preferably, the step of extracting text features from the target question sentence may include: and performing text feature mapping on the target question so as to extract text features from the target question. And simultaneously extracting image features from the target image by using a target detection framework.

S102: determining a position corresponding to the image feature, and determining an entity type and an attribute type corresponding to the image feature;

in this step, the position corresponding to the image feature is determined, and the entity and attribute classification is performed on the image feature. The entity in this embodiment is specifically an entity type of noun, such as cat, dog, tree, etc., and the attribute is specifically an attribute type of adjective, such as white.

Preferably, in this embodiment, the entity type and the attribute type corresponding to the image feature may be determined by using an entity classifier and an attribute classifier, that is, the step of determining the entity type and the attribute type corresponding to the image feature may include: inputting the image features into a trained entity classifier to obtain entity types corresponding to the image features; and inputting the image features into a trained attribute classifier to obtain attribute types corresponding to the image features.

In a specific implementation, image features are first extracted through an object detection framework, for example, a detection feature with a size of N × K is extracted for an image with an arbitrary size, where N represents the number of preset candidate frames in the detection framework, and K represents the feature dimension of each preset candidate frame. And then, respectively passing the N K-dimensional vectors through an entity classifier and an attribute classifier, and carrying out entity and attribute assignment on the target frame characteristics to obtain texts of the target frame entities and attributes.

The step of training the entity classifier and the attribute classifier comprises: determining all of the entity types and all of the attribute types from a candidate set of answers; acquiring an image feature training set, and labeling an entity type and an attribute type corresponding to each training image feature in the image feature training set; training an entity classifier by using the image feature training set and the entity type corresponding to each training image feature so as to obtain the trained entity classifier; and training an attribute classifier by using the image feature training set and the attribute type corresponding to each training image feature so as to obtain the trained attribute classifier. In a specific implementation, the entity categories and attribute categories are enumerated separately by analyzing the candidate set of answers. And training the classifier by taking the training image features in the image feature training set, the corresponding entities and the attribute labels as training samples.

S103: performing feature fusion on the text feature, the image feature, the position, the entity type and the attribute type to obtain a fusion feature;

s104: and inputting the fusion features into VQA classifiers to obtain answers corresponding to the target question sentences.

In specific implementation, the text features, the image features, the positions corresponding to the image features, the entity types and the attribute types are fused into fusion features, and the fusion features are encoded and input VQA into a classifier to obtain answers corresponding to the target question sentences.

Preferably, the performing feature fusion on the text feature, the image feature, the position, the entity type, and the attribute type to obtain a fusion feature includes: coding the text features to obtain text coding features, and coding the positions to obtain position codes; performing text feature mapping on the entity type to obtain entity features, and encoding the attribute type to obtain attribute features; fusing the text-coding feature, the image feature, the position code, the entity feature, and the attribute feature into the fused feature.

In specific implementation, a text feature encoder is used for carrying out feature coding on text features, similarly, positions corresponding to image features are coded to obtain position codes, text feature mapping is carried out on entity types to obtain N × M entity features, attribute types are coded to obtain N × M attribute features, and the text coding features, the image features, the position codes, the entity features and the attribute features are fused into fusion features.

As a possible implementation, the step of fusing the text encoding feature, the image feature, the position code, the entity feature and the attribute feature into the fused feature comprises: fusing the entity characteristics, the attribute characteristics and the position codes into image detection characteristics, and coding the image detection characteristics to obtain image coding characteristics; and performing feature fusion on the text coding features and the image coding features to obtain the fusion features.

In this embodiment, as shown in fig. 3, the attribute feature, the entity feature and the image feature are spliced together to generate an N × (2M + K) image detection feature, which is input into a subsequent encoder to obtain an image encoding feature. In this embodiment, the detection features in the related art can be fused with the entities and attributes thereof to generate new features with more comprehensive information quantity, so as to assist the subsequent VQA classification task.

As another possible implementation, the step of fusing the text encoding feature, the image feature, the position code, the entity feature and the attribute feature into the fused feature includes: fusing the position codes and the image features and then coding to obtain image coding features; coding the entity characteristics to obtain entity coding characteristics, and coding the attribute characteristics to obtain attribute coding characteristics; and performing feature fusion on the text coding features, the image coding features, the entity coding features and the attribute coding features to obtain fusion features.

In this embodiment, as shown in fig. 4, the image features and the corresponding positions thereof are fused and feature-coded, and at the same time, the text features, the entity features and the attribute features are respectively coded, and then the above feature fusion codes are fused and coded and VQA classified, which can reduce the difficulty of network learning.

The visual question-answering method provided by the embodiment of the application classifies the entity and the attribute of the image features output by the target detection framework, fuses the output of the classifier with the text features and the image features and inputs the fused image features into the VQA classifier, and improves the prediction accuracy of the VQA model by improving the utilization rate of the detection features without changing the target detection framework. Therefore, the visual question-answering method provided by the embodiment of the application expands the entities and attributes in the image into the detection features, expands the image features of the existing VQA task, extracts the detection features as fully as possible, improves the integrity of the input features, breaks the bottleneck of the existing detection framework, and assists in improving the classification accuracy of VQA tasks.

In the following, a visual question-answering device provided by an embodiment of the present application is introduced, and a visual question-answering device described below and a visual question-answering method described above may be referred to each other.

Referring to fig. 5, a block diagram of a visual question answering device according to an exemplary embodiment is shown, as shown in fig. 5, including:

an extracting module 501, configured to obtain a target question sentence and a target image, extract text features from the target question sentence, and extract image features from the target image by using a target detection framework;

a first determining module 502, configured to determine a position corresponding to the image feature, and determine an entity type and an attribute type corresponding to the image feature;

a fusion module 503, configured to perform feature fusion on the text feature, the image feature, the position, the entity type, and the attribute type to obtain a fusion feature;

an input module 504, configured to input VQA the fused features into a classifier to obtain an answer corresponding to the target question sentence.

The visual question-answering device provided by the embodiment of the application classifies the entity and the attribute of the image features output by the target detection framework, fuses the output of the classifier with the text features and the image features and inputs the fused image features into the VQA classifier, and improves the prediction accuracy of the VQA model by improving the utilization rate of the detection features without changing the target detection framework. Therefore, the visual question-answering device provided by the embodiment of the application expands the entities and attributes in the image into the detection features, expands the image features of the conventional VQA task, extracts the detection features as fully as possible, improves the integrity of the input features, breaks the bottleneck of the conventional detection framework, and assists in improving the classification accuracy of the VQA task.

On the basis of the foregoing embodiment, as a preferred implementation, the first determining module 502 includes:

the determining unit is used for determining the position corresponding to the image feature;

the first input unit is used for inputting the image features into a trained entity classifier to obtain entity types corresponding to the image features;

and the second input unit is used for inputting the image features into the trained attribute classifier to obtain the attribute types corresponding to the image features.

On the basis of the above embodiment, as a preferred implementation, the method further includes:

a second determining module for determining all of the entity types and all of the attribute types from an answer candidate set;

the system comprises an annotation module, a processing module and a processing module, wherein the annotation module is used for acquiring an image feature training set and annotating an entity type and an attribute type corresponding to each training image feature in the image feature training set;

the first training module is used for training an entity classifier by using the image feature training set and the entity type corresponding to each training image feature so as to obtain the trained entity classifier;

and the second training module is used for training the attribute classifier by using the image feature training set and the attribute type corresponding to each training image feature so as to obtain the trained attribute classifier.

On the basis of the foregoing embodiment, as a preferred implementation manner, the extracting module 501 is specifically a module that acquires a target question and a target image, performs text feature mapping on the target question so as to extract text features from the target question, and extracts image features from the target image by using a target detection framework.

On the basis of the above embodiment, as a preferred implementation, the fusion module 503 includes:

the coding unit is used for coding the text characteristics to obtain text coding characteristics and coding the position to obtain position codes;

the mapping unit is used for performing text feature mapping on the entity type so as to obtain entity features and encoding the attribute type so as to obtain attribute features;

and the fusion unit is used for fusing the text coding feature, the image feature, the position code, the entity feature and the attribute feature into the fusion feature.

On the basis of the above embodiment, as a preferred implementation, the fusion unit includes:

the first coding subunit is used for fusing the position codes and the image characteristics and then coding to obtain image coding characteristics;

the second coding subunit is used for coding the entity characteristics to obtain entity coding characteristics and coding the attribute characteristics to obtain attribute coding characteristics;

and the first fusion subunit is used for performing feature fusion on the text coding features, the image coding features, the entity coding features and the attribute coding features to obtain fusion features.

the third coding subunit is used for fusing the entity characteristics, the attribute characteristics and the position codes into image detection characteristics and coding the image detection characteristics to obtain image coding characteristics;

and the second fusion subunit is used for performing feature fusion on the text coding features and the image coding features to obtain the fusion features.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The present application further provides an electronic device, and referring to fig. 6, a structure diagram of an electronic device 600 provided in an embodiment of the present application may include a processor 11 and a memory 12, as shown in fig. 6. The electronic device 600 may also include one or more of a multimedia component 13, an input/output (I/O) interface 14, and a communication component 15.

The processor 11 is configured to control the overall operation of the electronic device 600, so as to complete all or part of the steps in the above-mentioned visual question answering method. The memory 12 is used to store various types of data to support operation of the electronic device 600, such as instructions for any application or method operating on the electronic device 600 and application-related data. The Memory 12 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia component 13 may comprise a screen, which may be a touch screen, for example. The I/O interface 14 provides an interface between the processor 11 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication module 15 is used for wired or wireless communication between the electronic device 600 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G or 4G, or a combination of one or more of them, so that the corresponding Communication component 15 may include: Wi-Fi module, bluetooth module, NFC module.

In an exemplary embodiment, the electronic Device 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the above-described visual question answering method.

In another exemplary embodiment, a computer readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the above-described visual question-answering method is also provided. For example, the computer readable storage medium may be the memory 12 described above including program instructions that are executable by the processor 11 of the electronic device 600 to perform the visual question-answering method described above.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A method of visual question answering, comprising:

2. The visual question answering method according to claim 1, wherein the determining of the entity type and the attribute type corresponding to the image feature comprises:

3. The visual question-answering method according to claim 2, further comprising:

4. The visual question-answering method according to claim 1, wherein extracting text features from the target question sentence comprises:

5. The visual question-answering method according to claim 1, wherein feature fusion of the text feature, the image feature, the position, the entity type and the attribute type is performed to obtain a fusion feature, and the method comprises:

6. The visual question-answering method according to claim 5, wherein fusing the text-coding features, the image features, the position codes, the entity features and the attribute features into the fused features comprises:

7. The visual question-answering method according to claim 5, wherein fusing the text-coding features, the image features, the position codes, the entity features and the attribute features into the fused features comprises:

8. A visual question answering apparatus, comprising:

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the visual question answering method according to any one of claims 1 to 7 when executing said computer program.

10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the visual question-answering method according to any one of claims 1 to 7.