CN111860653A - Visual question answering method and device, electronic equipment and storage medium - Google Patents

Visual question answering method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111860653A
CN111860653A CN202010711706.4A CN202010711706A CN111860653A CN 111860653 A CN111860653 A CN 111860653A CN 202010711706 A CN202010711706 A CN 202010711706A CN 111860653 A CN111860653 A CN 111860653A
Authority
CN
China
Prior art keywords
features
image
feature
attribute
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010711706.4A
Other languages
Chinese (zh)
Inventor
李晓川
张润泽
范宝余
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202010711706.4A priority Critical patent/CN111860653A/en
Publication of CN111860653A publication Critical patent/CN111860653A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a visual question answering method, a visual question answering device, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring a target question and a target image, extracting text features from the target question, and extracting image features from the target image by using a target detection frame; determining a position corresponding to the image feature, and determining an entity type and an attribute type corresponding to the image feature; performing feature fusion on the text feature, the image feature, the position, the entity type and the attribute type to obtain a fusion feature; and inputting the fusion features into VQA classifiers to obtain answers corresponding to the target question sentences. The visual question-answering method provided by the application classifies the entity and the attribute of the image features output by the target detection framework, fuses the output of the classifier with the text features and the image features and inputs the fused output into the VQA classifier, and improves the prediction accuracy of the VQA model by improving the utilization rate of the detection features.

Description

Visual question answering method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a visual question answering method and apparatus, an electronic device, and a computer-readable storage medium.
Background
VQA (Chinese full name: Visual Question Answering, English full name: Visual Question Answering) aims to make a computer obtain the capability of Answering Artificial questions according to the image content, and is a cross-modal AI (Chinese full name: Artificial Intelligence, English full name: intellectual science) processing technology. The VQA task fuses images and text, which is a cross-modal AI task.
Because of the characteristic of the VQA task, image features and text features need to be extracted respectively, the image features and the text features are input into an encoder network together through feature fusion to extract encoding features, and finally the encoding features are input into a classifier to predict a final answer. At present, an object detection framework is generally adopted as an extraction network of image features. As shown in fig. 1, the input question text is converted into L × M text features through text feature mapping, where L denotes the length of the question (i.e., the question includes L words), and M denotes the dimension of each word converted into a feature. It is then feature encoded by a text feature encoder. Similarly, feature coding is also performed after fusing the detected features of the image and the corresponding detected positions thereof. Then, the encoding is continued after fusing the two features, and then the final predicted answer is output by classifying through an VQA classifier.
Limited by the performance bottleneck of the target detection framework, the VQA model has difficulty in greatly improving the prediction accuracy by modifying the network structure of the model. Therefore, how to sufficiently extract the detection features to assist in improving the classification accuracy of the VQA task is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
The application aims to provide a visual question answering method and device, an electronic device and a computer readable storage medium, which realize sufficient extraction of detection features and assist in improving the classification accuracy of VQA tasks.
In order to achieve the above object, the present application provides a visual question answering method, including:
acquiring a target question and a target image, extracting text features from the target question, and extracting image features from the target image by using a target detection frame;
determining a position corresponding to the image feature, and determining an entity type and an attribute type corresponding to the image feature;
performing feature fusion on the text feature, the image feature, the position, the entity type and the attribute type to obtain a fusion feature;
and inputting the fusion features into VQA classifiers to obtain answers corresponding to the target question sentences.
Wherein the determining the entity type and the attribute type corresponding to the image feature comprises:
inputting the image features into a trained entity classifier to obtain entity types corresponding to the image features;
and inputting the image features into a trained attribute classifier to obtain attribute types corresponding to the image features.
Wherein, still include:
determining all of the entity types and all of the attribute types from a candidate set of answers;
acquiring an image feature training set, and labeling an entity type and an attribute type corresponding to each training image feature in the image feature training set;
training an entity classifier by using the image feature training set and the entity type corresponding to each training image feature so as to obtain the trained entity classifier;
and training an attribute classifier by using the image feature training set and the attribute type corresponding to each training image feature so as to obtain the trained attribute classifier.
Extracting text features from the target question sentence, wherein the extracting text features from the target question sentence comprises the following steps:
and performing text feature mapping on the target question so as to extract text features from the target question.
Performing feature fusion on the text feature, the image feature, the position, the entity type and the attribute type to obtain a fusion feature, including:
coding the text features to obtain text coding features, and coding the positions to obtain position codes;
performing text feature mapping on the entity type to obtain entity features, and encoding the attribute type to obtain attribute features;
fusing the text-coding feature, the image feature, the position code, the entity feature, and the attribute feature into the fused feature.
Wherein fusing the text-coding feature, the image feature, the position code, the entity feature, and the attribute feature into the fused feature comprises:
fusing the position codes and the image features and then coding to obtain image coding features;
coding the entity characteristics to obtain entity coding characteristics, and coding the attribute characteristics to obtain attribute coding characteristics;
and performing feature fusion on the text coding features, the image coding features, the entity coding features and the attribute coding features to obtain fusion features.
Wherein fusing the text-coding feature, the image feature, the position code, the entity feature, and the attribute feature into the fused feature comprises:
fusing the entity characteristics, the attribute characteristics and the position codes into image detection characteristics, and coding the image detection characteristics to obtain image coding characteristics;
and performing feature fusion on the text coding features and the image coding features to obtain the fusion features.
To achieve the above object, the present application provides a visual question answering device, comprising:
the extraction module is used for acquiring a target question sentence and a target image, extracting text features from the target question sentence and extracting image features from the target image by using a target detection frame;
the first determining module is used for determining the position corresponding to the image feature and determining the entity type and the attribute type corresponding to the image feature;
the fusion module is used for carrying out feature fusion on the text feature, the image feature, the position, the entity type and the attribute type to obtain fusion features;
and the input module is used for inputting the fusion characteristics into VQA classifiers to obtain answers corresponding to the target question sentences.
To achieve the above object, the present application provides an electronic device including:
a memory for storing a computer program;
a processor for implementing the steps of the visual question-answering method as described above when executing the computer program.
To achieve the above object, the present application provides a computer-readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the above-mentioned visual question-answering method.
According to the scheme, the visual question answering method comprises the following steps: acquiring a target question and a target image, extracting text features from the target question, and extracting image features from the target image by using a target detection frame; determining a position corresponding to the image feature, and determining an entity type and an attribute type corresponding to the image feature; performing feature fusion on the text feature, the image feature, the position, the entity type and the attribute type to obtain a fusion feature; and inputting the fusion features into VQA classifiers to obtain answers corresponding to the target question sentences.
The visual question-answering method provided by the application classifies the entity and the attribute of the image features output by the target detection framework, fuses the output of the classifier with the text features and the image features and inputs the fused image features into the VQA classifier, does not change the target detection framework, but improves the prediction accuracy of the VQA model by improving the utilization rate of the detection features. Therefore, the visual question-answering method provided by the application expands the entities and attributes in the image into the detection features, expands the image features of the conventional VQA task, extracts the detection features as fully as possible, improves the integrity of the input features, breaks the bottleneck of the conventional detection framework, and assists in improving the classification accuracy of the VQA task. The application also discloses a visual question answering device, electronic equipment and a computer readable storage medium, which can also realize the technical effects.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:
FIG. 1 is a flow chart of VQA provided in the related art;
FIG. 2 is a flow diagram illustrating a method of visual question answering in accordance with one exemplary embodiment;
FIG. 3 is an VQA flow diagram illustrating an exemplary embodiment;
FIG. 4 is another VQA flow diagram shown in accordance with an exemplary embodiment;
FIG. 5 is a block diagram illustrating a visual question answering device in accordance with one exemplary embodiment;
FIG. 6 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application discloses a visual question-answering method, which realizes full extraction of detection features and assists in improving the classification precision of VQA tasks.
Referring to fig. 1, a flow diagram of a visual question answering method, as shown in fig. 1, is shown according to an exemplary embodiment, including:
s101: acquiring a target question and a target image, extracting text features from the target question, and extracting image features from the target image by using a target detection frame;
in this step, the input question text, i.e. the target question, is converted into a text feature of L × M, where L denotes the length of the question, i.e. the target question includes L words, and M denotes the dimension of each word converted into the feature. Preferably, the step of extracting text features from the target question sentence may include: and performing text feature mapping on the target question so as to extract text features from the target question. And simultaneously extracting image features from the target image by using a target detection framework.
S102: determining a position corresponding to the image feature, and determining an entity type and an attribute type corresponding to the image feature;
in this step, the position corresponding to the image feature is determined, and the entity and attribute classification is performed on the image feature. The entity in this embodiment is specifically an entity type of noun, such as cat, dog, tree, etc., and the attribute is specifically an attribute type of adjective, such as white.
Preferably, in this embodiment, the entity type and the attribute type corresponding to the image feature may be determined by using an entity classifier and an attribute classifier, that is, the step of determining the entity type and the attribute type corresponding to the image feature may include: inputting the image features into a trained entity classifier to obtain entity types corresponding to the image features; and inputting the image features into a trained attribute classifier to obtain attribute types corresponding to the image features.
In a specific implementation, image features are first extracted through an object detection framework, for example, a detection feature with a size of N × K is extracted for an image with an arbitrary size, where N represents the number of preset candidate frames in the detection framework, and K represents the feature dimension of each preset candidate frame. And then, respectively passing the N K-dimensional vectors through an entity classifier and an attribute classifier, and carrying out entity and attribute assignment on the target frame characteristics to obtain texts of the target frame entities and attributes.
The step of training the entity classifier and the attribute classifier comprises: determining all of the entity types and all of the attribute types from a candidate set of answers; acquiring an image feature training set, and labeling an entity type and an attribute type corresponding to each training image feature in the image feature training set; training an entity classifier by using the image feature training set and the entity type corresponding to each training image feature so as to obtain the trained entity classifier; and training an attribute classifier by using the image feature training set and the attribute type corresponding to each training image feature so as to obtain the trained attribute classifier. In a specific implementation, the entity categories and attribute categories are enumerated separately by analyzing the candidate set of answers. And training the classifier by taking the training image features in the image feature training set, the corresponding entities and the attribute labels as training samples.
S103: performing feature fusion on the text feature, the image feature, the position, the entity type and the attribute type to obtain a fusion feature;
s104: and inputting the fusion features into VQA classifiers to obtain answers corresponding to the target question sentences.
In specific implementation, the text features, the image features, the positions corresponding to the image features, the entity types and the attribute types are fused into fusion features, and the fusion features are encoded and input VQA into a classifier to obtain answers corresponding to the target question sentences.
Preferably, the performing feature fusion on the text feature, the image feature, the position, the entity type, and the attribute type to obtain a fusion feature includes: coding the text features to obtain text coding features, and coding the positions to obtain position codes; performing text feature mapping on the entity type to obtain entity features, and encoding the attribute type to obtain attribute features; fusing the text-coding feature, the image feature, the position code, the entity feature, and the attribute feature into the fused feature.
In specific implementation, a text feature encoder is used for carrying out feature coding on text features, similarly, positions corresponding to image features are coded to obtain position codes, text feature mapping is carried out on entity types to obtain N × M entity features, attribute types are coded to obtain N × M attribute features, and the text coding features, the image features, the position codes, the entity features and the attribute features are fused into fusion features.
As a possible implementation, the step of fusing the text encoding feature, the image feature, the position code, the entity feature and the attribute feature into the fused feature comprises: fusing the entity characteristics, the attribute characteristics and the position codes into image detection characteristics, and coding the image detection characteristics to obtain image coding characteristics; and performing feature fusion on the text coding features and the image coding features to obtain the fusion features.
In this embodiment, as shown in fig. 3, the attribute feature, the entity feature and the image feature are spliced together to generate an N × (2M + K) image detection feature, which is input into a subsequent encoder to obtain an image encoding feature. In this embodiment, the detection features in the related art can be fused with the entities and attributes thereof to generate new features with more comprehensive information quantity, so as to assist the subsequent VQA classification task.
As another possible implementation, the step of fusing the text encoding feature, the image feature, the position code, the entity feature and the attribute feature into the fused feature includes: fusing the position codes and the image features and then coding to obtain image coding features; coding the entity characteristics to obtain entity coding characteristics, and coding the attribute characteristics to obtain attribute coding characteristics; and performing feature fusion on the text coding features, the image coding features, the entity coding features and the attribute coding features to obtain fusion features.
In this embodiment, as shown in fig. 4, the image features and the corresponding positions thereof are fused and feature-coded, and at the same time, the text features, the entity features and the attribute features are respectively coded, and then the above feature fusion codes are fused and coded and VQA classified, which can reduce the difficulty of network learning.
The visual question-answering method provided by the embodiment of the application classifies the entity and the attribute of the image features output by the target detection framework, fuses the output of the classifier with the text features and the image features and inputs the fused image features into the VQA classifier, and improves the prediction accuracy of the VQA model by improving the utilization rate of the detection features without changing the target detection framework. Therefore, the visual question-answering method provided by the embodiment of the application expands the entities and attributes in the image into the detection features, expands the image features of the existing VQA task, extracts the detection features as fully as possible, improves the integrity of the input features, breaks the bottleneck of the existing detection framework, and assists in improving the classification accuracy of VQA tasks.
In the following, a visual question-answering device provided by an embodiment of the present application is introduced, and a visual question-answering device described below and a visual question-answering method described above may be referred to each other.
Referring to fig. 5, a block diagram of a visual question answering device according to an exemplary embodiment is shown, as shown in fig. 5, including:
an extracting module 501, configured to obtain a target question sentence and a target image, extract text features from the target question sentence, and extract image features from the target image by using a target detection framework;
a first determining module 502, configured to determine a position corresponding to the image feature, and determine an entity type and an attribute type corresponding to the image feature;
a fusion module 503, configured to perform feature fusion on the text feature, the image feature, the position, the entity type, and the attribute type to obtain a fusion feature;
an input module 504, configured to input VQA the fused features into a classifier to obtain an answer corresponding to the target question sentence.
The visual question-answering device provided by the embodiment of the application classifies the entity and the attribute of the image features output by the target detection framework, fuses the output of the classifier with the text features and the image features and inputs the fused image features into the VQA classifier, and improves the prediction accuracy of the VQA model by improving the utilization rate of the detection features without changing the target detection framework. Therefore, the visual question-answering device provided by the embodiment of the application expands the entities and attributes in the image into the detection features, expands the image features of the conventional VQA task, extracts the detection features as fully as possible, improves the integrity of the input features, breaks the bottleneck of the conventional detection framework, and assists in improving the classification accuracy of the VQA task.
On the basis of the foregoing embodiment, as a preferred implementation, the first determining module 502 includes:
the determining unit is used for determining the position corresponding to the image feature;
the first input unit is used for inputting the image features into a trained entity classifier to obtain entity types corresponding to the image features;
and the second input unit is used for inputting the image features into the trained attribute classifier to obtain the attribute types corresponding to the image features.
On the basis of the above embodiment, as a preferred implementation, the method further includes:
a second determining module for determining all of the entity types and all of the attribute types from an answer candidate set;
the system comprises an annotation module, a processing module and a processing module, wherein the annotation module is used for acquiring an image feature training set and annotating an entity type and an attribute type corresponding to each training image feature in the image feature training set;
the first training module is used for training an entity classifier by using the image feature training set and the entity type corresponding to each training image feature so as to obtain the trained entity classifier;
and the second training module is used for training the attribute classifier by using the image feature training set and the attribute type corresponding to each training image feature so as to obtain the trained attribute classifier.
On the basis of the foregoing embodiment, as a preferred implementation manner, the extracting module 501 is specifically a module that acquires a target question and a target image, performs text feature mapping on the target question so as to extract text features from the target question, and extracts image features from the target image by using a target detection framework.
On the basis of the above embodiment, as a preferred implementation, the fusion module 503 includes:
the coding unit is used for coding the text characteristics to obtain text coding characteristics and coding the position to obtain position codes;
the mapping unit is used for performing text feature mapping on the entity type so as to obtain entity features and encoding the attribute type so as to obtain attribute features;
and the fusion unit is used for fusing the text coding feature, the image feature, the position code, the entity feature and the attribute feature into the fusion feature.
On the basis of the above embodiment, as a preferred implementation, the fusion unit includes:
the first coding subunit is used for fusing the position codes and the image characteristics and then coding to obtain image coding characteristics;
the second coding subunit is used for coding the entity characteristics to obtain entity coding characteristics and coding the attribute characteristics to obtain attribute coding characteristics;
and the first fusion subunit is used for performing feature fusion on the text coding features, the image coding features, the entity coding features and the attribute coding features to obtain fusion features.
On the basis of the above embodiment, as a preferred implementation, the fusion unit includes:
the third coding subunit is used for fusing the entity characteristics, the attribute characteristics and the position codes into image detection characteristics and coding the image detection characteristics to obtain image coding characteristics;
and the second fusion subunit is used for performing feature fusion on the text coding features and the image coding features to obtain the fusion features.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
The present application further provides an electronic device, and referring to fig. 6, a structure diagram of an electronic device 600 provided in an embodiment of the present application may include a processor 11 and a memory 12, as shown in fig. 6. The electronic device 600 may also include one or more of a multimedia component 13, an input/output (I/O) interface 14, and a communication component 15.
The processor 11 is configured to control the overall operation of the electronic device 600, so as to complete all or part of the steps in the above-mentioned visual question answering method. The memory 12 is used to store various types of data to support operation of the electronic device 600, such as instructions for any application or method operating on the electronic device 600 and application-related data. The Memory 12 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia component 13 may comprise a screen, which may be a touch screen, for example. The I/O interface 14 provides an interface between the processor 11 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication module 15 is used for wired or wireless communication between the electronic device 600 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G or 4G, or a combination of one or more of them, so that the corresponding Communication component 15 may include: Wi-Fi module, bluetooth module, NFC module.
In an exemplary embodiment, the electronic Device 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the above-described visual question answering method.
In another exemplary embodiment, a computer readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the above-described visual question-answering method is also provided. For example, the computer readable storage medium may be the memory 12 described above including program instructions that are executable by the processor 11 of the electronic device 600 to perform the visual question-answering method described above.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A method of visual question answering, comprising:
acquiring a target question and a target image, extracting text features from the target question, and extracting image features from the target image by using a target detection frame;
determining a position corresponding to the image feature, and determining an entity type and an attribute type corresponding to the image feature;
performing feature fusion on the text feature, the image feature, the position, the entity type and the attribute type to obtain a fusion feature;
and inputting the fusion features into VQA classifiers to obtain answers corresponding to the target question sentences.
2. The visual question answering method according to claim 1, wherein the determining of the entity type and the attribute type corresponding to the image feature comprises:
inputting the image features into a trained entity classifier to obtain entity types corresponding to the image features;
and inputting the image features into a trained attribute classifier to obtain attribute types corresponding to the image features.
3. The visual question-answering method according to claim 2, further comprising:
determining all of the entity types and all of the attribute types from a candidate set of answers;
acquiring an image feature training set, and labeling an entity type and an attribute type corresponding to each training image feature in the image feature training set;
training an entity classifier by using the image feature training set and the entity type corresponding to each training image feature so as to obtain the trained entity classifier;
and training an attribute classifier by using the image feature training set and the attribute type corresponding to each training image feature so as to obtain the trained attribute classifier.
4. The visual question-answering method according to claim 1, wherein extracting text features from the target question sentence comprises:
and performing text feature mapping on the target question so as to extract text features from the target question.
5. The visual question-answering method according to claim 1, wherein feature fusion of the text feature, the image feature, the position, the entity type and the attribute type is performed to obtain a fusion feature, and the method comprises:
coding the text features to obtain text coding features, and coding the positions to obtain position codes;
performing text feature mapping on the entity type to obtain entity features, and encoding the attribute type to obtain attribute features;
fusing the text-coding feature, the image feature, the position code, the entity feature, and the attribute feature into the fused feature.
6. The visual question-answering method according to claim 5, wherein fusing the text-coding features, the image features, the position codes, the entity features and the attribute features into the fused features comprises:
fusing the position codes and the image features and then coding to obtain image coding features;
coding the entity characteristics to obtain entity coding characteristics, and coding the attribute characteristics to obtain attribute coding characteristics;
and performing feature fusion on the text coding features, the image coding features, the entity coding features and the attribute coding features to obtain fusion features.
7. The visual question-answering method according to claim 5, wherein fusing the text-coding features, the image features, the position codes, the entity features and the attribute features into the fused features comprises:
fusing the entity characteristics, the attribute characteristics and the position codes into image detection characteristics, and coding the image detection characteristics to obtain image coding characteristics;
and performing feature fusion on the text coding features and the image coding features to obtain the fusion features.
8. A visual question answering apparatus, comprising:
the extraction module is used for acquiring a target question sentence and a target image, extracting text features from the target question sentence and extracting image features from the target image by using a target detection frame;
the first determining module is used for determining the position corresponding to the image feature and determining the entity type and the attribute type corresponding to the image feature;
the fusion module is used for carrying out feature fusion on the text feature, the image feature, the position, the entity type and the attribute type to obtain fusion features;
and the input module is used for inputting the fusion characteristics into VQA classifiers to obtain answers corresponding to the target question sentences.
9. An electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the visual question answering method according to any one of claims 1 to 7 when executing said computer program.
10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the visual question-answering method according to any one of claims 1 to 7.
CN202010711706.4A 2020-07-22 2020-07-22 Visual question answering method and device, electronic equipment and storage medium Pending CN111860653A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010711706.4A CN111860653A (en) 2020-07-22 2020-07-22 Visual question answering method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010711706.4A CN111860653A (en) 2020-07-22 2020-07-22 Visual question answering method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111860653A true CN111860653A (en) 2020-10-30

Family

ID=72949423

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010711706.4A Pending CN111860653A (en) 2020-07-22 2020-07-22 Visual question answering method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111860653A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780486A (en) * 2021-11-15 2021-12-10 湖南师范大学 Visual question answering method, device and medium
CN114443822A (en) * 2021-12-24 2022-05-06 科大讯飞(苏州)科技有限公司 Method, system and computing device for multi-modal question answering in the field of construction
CN115129848A (en) * 2022-09-02 2022-09-30 苏州浪潮智能科技有限公司 Method, device, equipment and medium for processing visual question-answering task
CN115393854A (en) * 2022-10-27 2022-11-25 粤港澳大湾区数字经济研究院(福田) Visual alignment processing method, terminal and storage medium
CN115905591A (en) * 2023-02-22 2023-04-04 浪潮电子信息产业股份有限公司 Visual question answering method, system, equipment and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110659398A (en) * 2019-07-11 2020-01-07 电子科技大学 Visual question-answering method based on mathematical chart data set
CN110717024A (en) * 2019-10-08 2020-01-21 苏州派维斯信息科技有限公司 Visual question-answering problem solving method based on image visual to text conversion
CN110717431A (en) * 2019-09-27 2020-01-21 华侨大学 Fine-grained visual question and answer method combined with multi-view attention mechanism

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110659398A (en) * 2019-07-11 2020-01-07 电子科技大学 Visual question-answering method based on mathematical chart data set
CN110717431A (en) * 2019-09-27 2020-01-21 华侨大学 Fine-grained visual question and answer method combined with multi-view attention mechanism
CN110717024A (en) * 2019-10-08 2020-01-21 苏州派维斯信息科技有限公司 Visual question-answering problem solving method based on image visual to text conversion

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780486A (en) * 2021-11-15 2021-12-10 湖南师范大学 Visual question answering method, device and medium
CN114443822A (en) * 2021-12-24 2022-05-06 科大讯飞(苏州)科技有限公司 Method, system and computing device for multi-modal question answering in the field of construction
CN114443822B (en) * 2021-12-24 2023-05-26 科大讯飞(苏州)科技有限公司 Method, system and computing device for multimodal question-answering in the building field
CN115129848A (en) * 2022-09-02 2022-09-30 苏州浪潮智能科技有限公司 Method, device, equipment and medium for processing visual question-answering task
CN115129848B (en) * 2022-09-02 2023-02-28 苏州浪潮智能科技有限公司 Method, device, equipment and medium for processing visual question-answering task
WO2024045444A1 (en) * 2022-09-02 2024-03-07 苏州浪潮智能科技有限公司 Processing method and apparatus for visual question answering task, and device and non-volatile readable storage medium
CN115393854A (en) * 2022-10-27 2022-11-25 粤港澳大湾区数字经济研究院(福田) Visual alignment processing method, terminal and storage medium
CN115393854B (en) * 2022-10-27 2023-02-21 粤港澳大湾区数字经济研究院(福田) Visual alignment processing method, terminal and storage medium
CN115905591A (en) * 2023-02-22 2023-04-04 浪潮电子信息产业股份有限公司 Visual question answering method, system, equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN111860653A (en) Visual question answering method and device, electronic equipment and storage medium
WO2022095682A1 (en) Text classification model training method, text classification method and apparatus, device, storage medium, and computer program product
WO2021072875A1 (en) Intelligent dialogue generation method, device, computer apparatus and computer storage medium
CN111783394B (en) Training method of event extraction model, event extraction method, system and equipment
CN112101041B (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
CN110704576B (en) Text-based entity relationship extraction method and device
CN111931517B (en) Text translation method, device, electronic equipment and storage medium
CN110717325A (en) Text emotion analysis method and device, electronic equipment and storage medium
CN114757176A (en) Method for obtaining target intention recognition model and intention recognition method
CN108776677B (en) Parallel sentence library creating method and device and computer readable storage medium
CN110991175A (en) Text generation method, system, device and storage medium under multiple modes
CN113761883A (en) Text information identification method and device, electronic equipment and storage medium
CN110377910B (en) Processing method, device, equipment and storage medium for table description
CN113221553A (en) Text processing method, device and equipment and readable storage medium
CN113076720A (en) Long text segmentation method and device, storage medium and electronic device
CN113505786A (en) Test question photographing and judging method and device and electronic equipment
CN112559725A (en) Text matching method, device, terminal and storage medium
CN111552819A (en) Entity extraction method and device and readable storage medium
CN115017271B (en) Method and system for intelligently generating RPA flow component block
CN115115432B (en) Product information recommendation method and device based on artificial intelligence
CN110851597A (en) Method and device for sentence annotation based on similar entity replacement
CN116483314A (en) Automatic intelligent activity diagram generation method
CN114298032A (en) Text punctuation detection method, computer device and storage medium
CN114020907A (en) Information extraction method and device, storage medium and electronic equipment
CN114154497A (en) Language disease identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201030

RJ01 Rejection of invention patent application after publication