CN115346095A - Visual question answering method, device, equipment and storage medium - Google Patents

Visual question answering method, device, equipment and storage medium Download PDF

Info

Publication number
CN115346095A
CN115346095A CN202211080483.1A CN202211080483A CN115346095A CN 115346095 A CN115346095 A CN 115346095A CN 202211080483 A CN202211080483 A CN 202211080483A CN 115346095 A CN115346095 A CN 115346095A
Authority
CN
China
Prior art keywords
features
image
question
predicted
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211080483.1A
Other languages
Chinese (zh)
Inventor
唐小初
黎铭
舒畅
陈又新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202211080483.1A priority Critical patent/CN115346095A/en
Publication of CN115346095A publication Critical patent/CN115346095A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/757Matching configurations of points or features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an artificial intelligence technology, and discloses a visual question-answering method, which comprises the following steps: extracting reference image features of a reference image, constructing a standard reference data set based on the reference image features and reference questions and answers, acquiring visual data to be predicted including a problem to be predicted and a problem to be predicted, extracting sample image features of the image to be predicted, matching target reference questions and answers in the standard reference data set based on the sample image features, performing feature fusion on the problem to be predicted, the sample image features and the target reference questions and answers by using a pre-constructed multi-modal feature fusion network to obtain fusion features, and performing feature prediction on the fusion features to obtain a prediction result. Furthermore, the invention relates to blockchain techniques, the prediction results being storable in a node of the blockchain. The invention also provides a visual question answering device, electronic equipment and a readable storage medium. The invention can improve the accuracy of the prediction result in the visual question answering.

Description

Visual question-answering method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a visual question answering method and device, electronic equipment and a readable storage medium.
Background
Visual Question Answering (VQA) is a multimodal learning task involving computer vision and natural language processing. The VQA system needs to take pictures and questions as input, extract features of images and question texts and associations between the features and the features, and output a reasonable prediction answer.
The intelligent visual question-answering system needs to be capable of extracting the respective characteristics of image and question text modal information and the association of the image and the question text modal information, and then can output a relatively accurate answer. In the prior art, after training a model in a collected visual question-answer data set is completed, the model with the highest accuracy in a verification set is used for prediction, however, similar sample reference information in the data set is not reused in the prediction process, so that the accuracy of a prediction result is low.
Disclosure of Invention
The invention provides a visual question answering method, a visual question answering device, electronic equipment and a readable storage medium, and mainly aims to improve the accuracy of a prediction result in a visual question answering.
In order to achieve the above object, the present invention provides a visual question answering method, which comprises:
acquiring an original reference data set containing a reference image and a reference question and answer, extracting reference image characteristics of the reference image, and constructing a standard reference data set based on the reference image characteristics and the reference question and answer;
acquiring visual data to be predicted, which comprises a to-be-predicted image and a to-be-predicted problem, extracting sample image features of the to-be-predicted image, and matching target reference questions and answers in the standard reference data set based on the sample image features;
performing feature fusion on the problem to be predicted, the sample image features and the target reference question and answer by using a pre-constructed multi-modal feature fusion network to obtain fusion features;
and performing feature prediction on the fusion features to obtain a prediction result.
Optionally, the extracting reference image features of the reference image, and constructing a standard reference data set based on the reference image features and the reference question and answer includes:
extracting reference image features of a reference image in the original reference data set by using a preset image encoder, and associating the reference image features with reference questions and answers corresponding to the reference image;
and summarizing all the associated reference image features and the reference questions and answers to construct the standard reference data set.
Optionally, the extracting sample image features of the image to be predicted, and matching the target reference question answer in the standard reference data set based on the sample image features includes:
extracting sample image features of the image to be predicted by using the image encoder;
calculating the similarity of the sample image features and the reference image features in the standard reference data set one by one;
and determining the reference image features with the similarity greater than a preset similarity threshold as matched target image features, and taking the reference question answers corresponding to the target image features as the target reference question answers.
Optionally, the similarity between the sample image feature and the reference image feature in the standard reference data set is calculated by the following formula:
Figure BDA0003832896960000021
wherein cos (A, B) represents the similarity of the sample image feature A and the reference image feature B, n represents the feature dimension, a i Representing the i-th dimension of feature vectors in sample image feature A, b i Representing the ith dimension feature vector in the sample image feature B.
Optionally, before the feature fusion is performed on the question to be predicted, the sample image features and the target reference question and answer by using the pre-constructed multi-modal feature fusion network, the method further includes:
connecting a preset first self-attention module, a preset first cross-attention module and a preset first forward propagation module in series to obtain an image processing sub-network;
connecting a preset second self-attention module, a preset second cross-attention module and a preset second forward propagation module in series to obtain a text processing sub-network;
performing a tandem process on the first self-attention module and the second cross-attention module, and performing a tandem process on the second self-attention module and the first cross-attention module;
and taking the serially-connected image processing subnetwork and text processing subnetwork as a mode fusion subnetwork, and stacking the mode fusion subnetwork for preset times to obtain the multi-mode feature fusion network.
Optionally, the performing feature fusion on the problem to be predicted, the sample image features, and the target reference question and answer by using a pre-constructed multi-modal feature fusion network to obtain fusion features includes:
performing word segmentation processing and vectorization processing on the problem to be predicted and the target reference question and answer respectively to obtain vectorization information;
extracting text features in the vectorization information by using a preset text encoder;
performing feature interaction on the text features and the sample image features by using the multi-modal feature fusion network to obtain standard text features and standard image features;
and performing feature fusion on the standard text features and the standard image features according to a preset channel dimension to obtain the fusion features.
Optionally, the performing feature prediction on the fusion feature to obtain a prediction result includes:
performing feature prediction on the fusion features by using a preset number of full-connection layers to obtain a prediction label;
and taking the prediction answer corresponding to the prediction label as the prediction result.
In order to solve the above problems, the present invention also provides a visual question answering apparatus, comprising:
the reference data construction module is used for acquiring an original reference data set containing a reference image and a reference question and answer, extracting the reference image characteristics of the reference image, and constructing a standard reference data set based on the reference image characteristics and the reference question and answer;
the characteristic matching module is used for acquiring visual data to be predicted, which comprises a to-be-predicted image and a to-be-predicted problem, extracting sample image characteristics of the to-be-predicted image, and matching a target reference question answer in the standard reference data set based on the sample image characteristics;
the characteristic fusion module is used for carrying out characteristic fusion on the question to be predicted, the sample image characteristics and the target reference question and answer by utilizing a pre-constructed multi-modal characteristic fusion network to obtain fusion characteristics;
and the characteristic prediction module is used for carrying out characteristic prediction on the fusion characteristics to obtain a prediction result.
In order to solve the above problem, the present invention also provides an electronic device, including:
a memory storing at least one computer program; and
and the processor executes the computer program stored in the memory to realize the visual question answering method.
In order to solve the above problem, the present invention also provides a computer-readable storage medium, in which at least one computer program is stored, the at least one computer program being executed by a processor in an electronic device to implement the visual question-answering method described above.
According to the method, the reference image features of the reference image are extracted, the standard reference data set is constructed on the basis of the reference image features and the reference question answers, so that the information of similar samples in the standard reference data set can be fully utilized in the prediction process, the target reference question answers are subjected to feature fusion and are merged into the prediction process by utilizing the pre-constructed multi-modal feature fusion network, and the prediction precision and accuracy are improved. Moreover, by introducing the multi-modal feature fusion network, the interaction between two different modal information of text features and image features can be enhanced, so that the two modal information is fully utilized for prediction, and the prediction precision and accuracy are further improved. Therefore, the visual question answering method, the visual question answering device, the electronic equipment and the computer readable storage medium can improve the accuracy of the prediction result in the visual question answering.
Drawings
FIG. 1 is a schematic flow chart of a visual question answering method according to an embodiment of the present invention;
FIG. 2 is a functional block diagram of a visual question answering device according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device for implementing the visual question answering method according to an embodiment of the present invention.
The implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the invention provides a visual question answering method. The executing subject of the visual question answering method includes, but is not limited to, at least one of electronic devices such as a server and a terminal, which can be configured to execute the method provided by the embodiment of the invention. In other words, the visual question answering method may be executed by software or hardware installed in a terminal device or a server device, and the software may be a block chain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.
Referring to fig. 1, a flow chart of a visual question answering method according to an embodiment of the present invention is shown.
In this embodiment, the visual question answering method includes the following steps S1 to S4:
s1, obtaining an original reference data set containing a reference image and a reference question and answer, extracting reference image features of the reference image, and constructing a standard reference data set based on the reference image features and the reference question and answer.
In the embodiment of the present invention, the original reference data set is reference data used for assisting in questioning and answering in visual questioning and answering, for example, in the financial field, and the original reference data set includes product images of products such as different funds and insurance, and answer of questions corresponding to the product images, such as a product image corresponding to the product XX1, and a questioning and answering (product introduction, use, etc.) corresponding to the product XX 1.
In detail, the extracting reference image features of the reference image, and constructing a standard reference data set based on the reference image features and the reference question and answer include:
extracting reference image characteristics of a reference image in the original reference data set by using a preset image encoder, and associating the reference image characteristics with reference questions and answers corresponding to the reference image;
and summarizing all the associated reference image features and the reference questions and answers to construct the standard reference data set.
In an alternative embodiment of the present invention, the image encoder may be a deep learning network, such as R-CNN, fastRCCNN, fasterRCNN, YOLO, SSD, etc. Meanwhile, key-Value pairs (Key-Value) and other methods are used for associating the reference image features with the reference questions and answers corresponding to the reference images, for example, the Key values are the image features, and the Value values are the reference questions and answers.
S2, obtaining visual data to be predicted containing images to be predicted and problems to be predicted, extracting sample image features of the images to be predicted, and matching target reference questions and answers in the standard reference data set based on the sample image features.
In the embodiment of the invention, the visual data to be predicted refers to the data to be queried, which is input by a user under a visual question and answer, for example, a product image of a certain product and a problem of inputting the product image, which are input by the user on an intelligent customer service page in the financial field.
In detail, the extracting sample image features of the image to be predicted, and matching the target reference question answer in the standard reference data set based on the sample image features includes:
extracting sample image features of the image to be predicted by using the image encoder;
calculating the similarity of the sample image features and the reference image features in the standard reference data set one by one;
and determining the reference image features with the similarity greater than a preset similarity threshold as matched target image features, and taking the reference question answers corresponding to the target image features as the target reference question answers.
In an optional embodiment of the invention, in the process of predicting the visual question answering, firstly, feature extraction is carried out on a to-be-predicted image, the extracted features are used for searching in a standard reference data set, namely, the features of a sample image are compared with the features of other reference images in the data set, the similarity is calculated, and a question and an answer text corresponding to the reference image with the similarity larger than a preset similarity threshold (0.7) are found and used as a target reference question answering.
In an optional embodiment of the present invention, the similarity between the sample image feature and the reference image feature in the standard reference data set is calculated by the following formula:
Figure BDA0003832896960000061
wherein cos (A, B) represents the similarity between the sample image feature A and the reference image feature B, n represents the feature dimension, a i Representing the i-th dimension of feature vectors in sample image feature A, b i Representing the i-th dimension feature vector in the sample image feature B.
In the embodiment of the invention, the sample image features are matched with the target reference question answers in the standard reference data set, so that the similar data in the standard reference data set can be fully utilized, and the accuracy of the visual question answers is improved.
And S3, performing feature fusion on the problem to be predicted, the sample image features and the target reference question and answer by using a pre-constructed multi-modal feature fusion network to obtain fusion features.
In the embodiment of the invention, the pre-constructed multi-modal feature fusion network is used for feature fusion of the image features and the text features, so that the hidden relation between the image features and the text features is better mined.
In detail, before the feature fusion is performed on the question to be predicted, the sample image features and the target reference question and answer by using a pre-constructed multi-modal feature fusion network, the method further includes:
connecting a preset first self-attention module, a preset first cross-attention module and a preset first forward propagation module in series to obtain an image processing sub-network;
connecting a preset second self-attention module, a preset second cross-attention module and a preset second forward propagation module in series to obtain a text processing sub-network;
performing a tandem process on the first self-attention module and the second cross-attention module, and performing a tandem process on the second self-attention module and the first cross-attention module;
and taking the image processing sub-network and the text processing sub-network which are processed in series as a mode fusion sub-network, and stacking the mode fusion sub-network for preset times to obtain the multi-mode feature fusion network.
In an optional embodiment of the invention, for image features and text features, the features respectively pass through a self-attention module, then a cross-attention mechanism is used to realize interaction between the two different modal features of an image and a text, finally a forward propagation module is used to output, and the modal fusion sub-networks are stacked for M times (M = 6) to form a final multi-modal feature fusion network, so that feature interaction and fusion are better performed.
In detail, the performing feature fusion on the question to be predicted, the sample image features and the target reference question and answer by using a pre-constructed multi-modal feature fusion network to obtain fusion features includes:
performing word segmentation processing and vectorization processing on the problem to be predicted and the target reference question and answer respectively to obtain vectorization information;
extracting text features in the vectorization information by using a preset text encoder;
performing feature interaction on the text features and the sample image features by using the multi-modal feature fusion network to obtain standard text features and standard image features;
and performing feature fusion on the standard text features and the standard image features according to a preset channel dimension to obtain fusion features.
In an optional embodiment of the present invention, the preset text encoder may be a Bert model. Respectively segmenting words of a target reference question and a question to be predicted, then embedding words of a text, mapping the words into a word vector form which is easy to process by a computer, namely, each word is represented by using a 512-dimensional word vector to obtain word embedding of a question text, then extracting features by using a text encoder to obtain text features, then performing feature interaction on the text features and the sample image features by using a multi-modal feature fusion network, and splicing the two features output by the multi-modal feature fusion network according to channel dimensions to obtain fusion features.
In the embodiment of the invention, the interaction between two different modal information of text characteristics and image characteristics is enhanced by introducing the multi-modal characteristic fusion network, so that the two modal information is fully utilized for prediction, and the prediction precision and accuracy are improved.
And S4, performing feature prediction on the fusion features to obtain a prediction result.
In detail, the performing feature prediction on the fusion features to obtain a prediction result includes:
performing feature prediction on the fusion features by using a preset number of full-connection layers to obtain a prediction label;
and taking the prediction answer corresponding to the prediction label as the prediction result.
In an optional embodiment of the present invention, the fusion features may be predicted and output through two fully connected layers to obtain a final prediction label, and the prediction answer text corresponding to the prediction label is used as a final prediction result, for example, if the prediction label is "1", the answer text corresponding to "1" is used as the prediction result.
According to the method, the reference image features of the reference image are extracted, the standard reference data set is constructed based on the reference image features and the reference question and answer, so that the information of similar samples in the standard reference data set can be fully utilized in the prediction process, the target reference question and answer is subjected to feature fusion and is merged into the prediction process by utilizing the pre-constructed multi-modal feature fusion network, and the prediction precision and accuracy are improved. Moreover, by introducing the multi-modal feature fusion network, interaction between two different modal information of text features and image features can be enhanced, so that the two modal information is fully utilized for prediction, and the prediction precision and accuracy are further improved. Therefore, the visual question answering method provided by the invention can improve the accuracy of the prediction result in the visual question answering.
Fig. 2 is a functional block diagram of a visual question answering device according to an embodiment of the present invention.
The visual question-answering apparatus 100 according to the present invention may be installed in an electronic device. According to the implemented functions, the visual question answering device 100 can include a reference data construction module 101, a feature matching module 102, a feature fusion module 103 and a feature prediction module 104. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and can perform a fixed function, and are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the reference data construction module 101 is configured to obtain an original reference data set including a reference image and a reference question and answer, extract a reference image feature of the reference image, and construct a standard reference data set based on the reference image feature and the reference question and answer;
the feature matching module 102 is configured to obtain to-be-predicted visual data including to-be-predicted images and to-be-predicted problems, extract sample image features of the to-be-predicted images, and match target reference questions and answers in the standard reference data set based on the sample image features;
the feature fusion module 103 is configured to perform feature fusion on the problem to be predicted, the sample image features, and the target reference question and answer by using a pre-constructed multi-modal feature fusion network to obtain fusion features;
the feature prediction module 104 is configured to perform feature prediction on the fusion features to obtain a prediction result.
In detail, the specific implementation of the modules of the visual question answering device 100 is as follows:
the method comprises the steps of firstly, obtaining an original reference data set containing reference images and reference questions and answers, extracting reference image features of the reference images, and constructing a standard reference data set based on the reference image features and the reference questions and answers.
In the embodiment of the present invention, the original reference data set is reference data used for assisting in questioning and answering in visual questioning and answering, for example, in the financial field, and the original reference data set includes product images of products such as different funds and insurance, and answer of questions corresponding to the product images, such as a product image corresponding to the product XX1, and a questioning and answering (product introduction, use, etc.) corresponding to the product XX 1.
In detail, the extracting reference image features of the reference image, and constructing a standard reference data set based on the reference image features and the reference question and answer include:
extracting reference image characteristics of a reference image in the original reference data set by using a preset image encoder, and associating the reference image characteristics with reference questions and answers corresponding to the reference image;
and summarizing all the associated reference image features and the reference questions and answers to construct the standard reference data set.
In an alternative embodiment of the present invention, the image encoder may be a deep learning network, such as R-CNN, fastRCCNN, fasterRCNN, YOLO, SSD, etc. Meanwhile, key-Value pairs (Key-Value) and other methods are used for associating the reference image features with the reference questions and answers corresponding to the reference images, for example, the Key values are the image features, and the Value values are the reference questions and answers.
And secondly, acquiring visual data to be predicted, which comprises a to-be-predicted image and a to-be-predicted problem, extracting sample image features of the to-be-predicted image, and matching a target reference question answer in the standard reference data set based on the sample image features.
In the embodiment of the invention, the visual data to be predicted refers to the data to be queried, which is input by a user under visual question and answer, for example, a product image of a certain product and a problem of inputting the product image, which are input by the user on an intelligent customer service page in the financial field.
In detail, the extracting sample image features of the image to be predicted, based on which the target reference question answer in the standard reference data set is matched, includes:
extracting sample image features of the image to be predicted by using the image encoder;
calculating the similarity of the sample image features and the reference image features in the standard reference data set one by one;
and determining the reference image features with the similarity larger than a preset similarity threshold as matched target image features, and taking the reference question answers corresponding to the target image features as the target reference question answers.
In an optional embodiment of the invention, in the process of predicting the visual question answering, firstly, feature extraction is carried out on a to-be-predicted image, the extracted features are used for searching in a standard reference data set, namely, the features of a sample image are compared with the features of other reference images in the data set, the similarity is calculated, and a question and answer text corresponding to the reference image with the similarity larger than a preset similarity threshold (0.7) is found to serve as a target reference question answering.
In an optional embodiment of the present invention, the similarity between the sample image feature and the reference image feature in the standard reference data set is calculated by the following formula:
Figure BDA0003832896960000101
wherein cos (a, B) represents the similarity between the sample image feature A and the reference image feature B, n represents the feature dimension, a i Representing the i-th dimension of feature vectors in sample image feature A, b i Representing the ith dimension feature vector in the sample image feature B.
In the embodiment of the invention, the sample image features are matched with the target reference question answers in the standard reference data set, so that the similar data in the standard reference data set can be fully utilized, and the accuracy of the visual question answers is improved.
And thirdly, performing feature fusion on the problem to be predicted, the sample image features and the target reference question and answer by using a pre-constructed multi-modal feature fusion network to obtain fusion features.
In the embodiment of the invention, the pre-constructed multi-modal feature fusion network is used for carrying out feature fusion on the image features and the text features, so that the hidden relation between the image features and the text features is better mined.
In detail, before the feature fusion is performed on the question to be predicted, the sample image features and the target reference question and answer by using a pre-constructed multi-modal feature fusion network, the method further includes:
connecting a preset first self-attention module, a preset first cross-attention module and a preset first forward propagation module in series to obtain an image processing sub-network;
connecting a preset second self-attention module, a preset second cross-attention module and a preset second forward propagation module in series to obtain a text processing sub-network;
performing a tandem process on the first self-attention module and the second cross-attention module, and performing a tandem process on the second self-attention module and the first cross-attention module;
and taking the serially-connected image processing subnetwork and text processing subnetwork as a mode fusion subnetwork, and stacking the mode fusion subnetwork for preset times to obtain the multi-mode feature fusion network.
In an optional embodiment of the invention, the image features and the text features respectively pass through a self-attention module, then a cross-attention mechanism is used to realize interaction between the image features and the text features, finally a forward propagation module is used to output, and the modal fusion sub-networks are stacked for M times (M = 6) to form a final multi-modal feature fusion network, so that feature interaction and fusion are better performed.
In detail, the performing feature fusion on the question to be predicted, the sample image features and the target reference question and answer by using a pre-constructed multi-modal feature fusion network to obtain fusion features includes:
performing word segmentation processing and vectorization processing on the problem to be predicted and the target reference question and answer respectively to obtain vectorization information;
extracting text features in the vectorization information by using a preset text encoder;
performing feature interaction on the text features and the sample image features by using the multi-modal feature fusion network to obtain standard text features and standard image features;
and performing feature fusion on the standard text features and the standard image features according to a preset channel dimension to obtain the fusion features.
In an optional embodiment of the present invention, the preset text encoder may be a Bert model. Respectively segmenting words of a target reference question and a question to be predicted, then embedding words of a text, mapping the words into a word vector form which is easy to process by a computer, namely, each word is represented by using a 512-dimensional word vector to obtain word embedding of a question text, then extracting features by using a text encoder to obtain text features, then performing feature interaction on the text features and the sample image features by using a multi-modal feature fusion network, and splicing the two features output by the multi-modal feature fusion network according to channel dimensions to obtain fusion features.
In the embodiment of the invention, the interaction between two different modal information of text characteristics and image characteristics is enhanced by introducing the multi-modal characteristic fusion network, so that the two modal information is fully utilized for prediction, and the prediction precision and accuracy are improved.
And fourthly, performing feature prediction on the fusion features to obtain a prediction result.
In detail, the performing feature prediction on the fusion features to obtain a prediction result includes:
performing feature prediction on the fusion features by using a preset number of full-connection layers to obtain a prediction label;
and taking the prediction answer corresponding to the prediction label as the prediction result.
In an optional embodiment of the present invention, the fusion features may be predicted and output through two fully connected layers to obtain a final prediction label, and the prediction answer text corresponding to the prediction label is used as a final prediction result, for example, if the prediction label is "1", the answer text corresponding to "1" is used as the prediction result.
According to the method, the reference image features of the reference image are extracted, the standard reference data set is constructed based on the reference image features and the reference question and answer, so that the information of similar samples in the standard reference data set can be fully utilized in the prediction process, the target reference question and answer is subjected to feature fusion and is merged into the prediction process by utilizing the pre-constructed multi-modal feature fusion network, and the prediction precision and accuracy are improved. Moreover, by introducing the multi-modal feature fusion network, the interaction between two different modal information of text features and image features can be enhanced, so that the two modal information is fully utilized for prediction, and the prediction precision and accuracy are further improved. Therefore, the visual question answering device provided by the invention can improve the accuracy of the prediction result in the visual question answering.
Fig. 3 is a schematic structural diagram of an electronic device implementing the visual question-answering method according to an embodiment of the present invention.
The electronic device may include a processor 10, a memory 11, a communication interface 12 and a bus 13, and may further include a computer program, such as a visual question and answer program, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, for example a removable hard disk of the electronic device. The memory 11 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only to store application software installed in the electronic device and various types of data, such as codes of a visual question answering program, etc., but also to temporarily store data that has been output or will be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions of the electronic device and processes data by running or executing programs or modules (e.g., a visual question and answer program, etc.) stored in the memory 11 and calling data stored in the memory 11.
The communication interface 12 is used for communication between the electronic device and other devices, and includes a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which are typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
The bus 13 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus 13 may be divided into an address bus, a data bus, a control bus, etc. The bus 13 is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 3 shows only an electronic device having components, and those skilled in the art will appreciate that the structure shown in fig. 3 does not constitute a limitation of the electronic device, and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management and the like are realized through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used to establish a communication connection between the electronic device and other electronic devices.
Optionally, the electronic device may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The memory 11 in the electronic device stores a visual question and answer program that is a combination of instructions that, when executed in the processor 10, may implement:
acquiring an original reference data set containing reference images and reference questions and answers, extracting reference image features of the reference images, and constructing a standard reference data set based on the reference image features and the reference questions and answers;
acquiring visual data to be predicted, which comprises a to-be-predicted image and a to-be-predicted problem, extracting sample image features of the to-be-predicted image, and matching target reference questions and answers in the standard reference data set based on the sample image features;
performing feature fusion on the problem to be predicted, the sample image features and the target reference question and answer by using a pre-constructed multi-modal feature fusion network to obtain fusion features;
and performing feature prediction on the fusion features to obtain a prediction result.
Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiment corresponding to the drawing, and is not repeated here.
Further, the electronic device integrated module/unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM).
The present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor of an electronic device, implements:
acquiring an original reference data set containing a reference image and a reference question and answer, extracting reference image characteristics of the reference image, and constructing a standard reference data set based on the reference image characteristics and the reference question and answer;
acquiring visual data to be predicted, which comprises a to-be-predicted image and a to-be-predicted problem, extracting sample image features of the to-be-predicted image, and matching target reference questions and answers in the standard reference data set based on the sample image features;
performing feature fusion on the problem to be predicted, the sample image features and the target reference question and answer by using a pre-constructed multi-modal feature fusion network to obtain fusion features;
and performing feature prediction on the fusion features to obtain a prediction result.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The embodiment of the invention can acquire and process related data based on an artificial intelligence technology. Among them, artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not to denote any particular order.
Finally, it should be noted that the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the same, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A visual question-answering method, comprising:
acquiring an original reference data set containing reference images and reference questions and answers, extracting reference image features of the reference images, and constructing a standard reference data set based on the reference image features and the reference questions and answers;
acquiring visual data to be predicted, which comprises a to-be-predicted image and a to-be-predicted problem, extracting sample image features of the to-be-predicted image, and matching target reference questions and answers in the standard reference data set based on the sample image features;
performing feature fusion on the problem to be predicted, the sample image features and the target reference question and answer by using a pre-constructed multi-modal feature fusion network to obtain fusion features;
and performing feature prediction on the fusion features to obtain a prediction result.
2. The visual question-answering method of claim 1, wherein the extracting of the reference image features of the reference image and the constructing of the standard reference data set based on the reference image features and the reference question-answering comprises:
extracting reference image characteristics of a reference image in the original reference data set by using a preset image encoder, and associating the reference image characteristics with reference questions and answers corresponding to the reference image;
and summarizing all the associated reference image features and the reference questions and answers to construct the standard reference data set.
3. The visual question-answering method according to claim 1, wherein said extracting sample image features of said image to be predicted, based on which said sample image features match target reference questions and answers in said standard reference data set, comprises:
extracting sample image features of the image to be predicted by using the image encoder;
calculating the similarity of the sample image features and the reference image features in the standard reference data set one by one;
and determining the reference image features with the similarity greater than a preset similarity threshold as matched target image features, and taking the reference question answers corresponding to the target image features as the target reference question answers.
4. The visual question answering method of claim 3, characterized in that the similarity of the sample image features to the reference image features in the standard reference data set is calculated by the following formula:
Figure FDA0003832896950000011
wherein cos (A, B) represents the similarity of the sample image feature A and the reference image feature B, n represents the feature dimension, a i Representing the i-th dimension of feature vectors in sample image feature A, b i Representing the i-th dimension feature vector in the sample image feature B.
5. The visual question-answering method according to claim 1, wherein before performing feature fusion on the question to be predicted, the sample image features and the target reference question-answering by using a pre-constructed multi-modal feature fusion network, the method further comprises:
connecting a preset first self-attention module, a preset first cross-attention module and a preset first forward propagation module in series to obtain an image processing sub-network;
connecting a preset second self-attention module, a preset second cross-attention module and a preset second forward propagation module in series to obtain a text processing sub-network;
performing a tandem process on the first self-attention module and the second cross-attention module, and performing a tandem process on the second self-attention module and the first cross-attention module;
and taking the image processing sub-network and the text processing sub-network which are processed in series as a mode fusion sub-network, and stacking the mode fusion sub-network for preset times to obtain the multi-mode feature fusion network.
6. The visual question-answering method according to claim 1, wherein the feature fusion of the question to be predicted, the sample image features and the target reference question-answering by using a pre-constructed multi-modal feature fusion network to obtain fusion features comprises:
performing word segmentation processing and vectorization processing on the problem to be predicted and the target reference question and answer respectively to obtain vectorization information;
extracting text features in the vectorization information by using a preset text encoder;
performing feature interaction on the text features and the sample image features by using the multi-modal feature fusion network to obtain standard text features and standard image features;
and performing feature fusion on the standard text features and the standard image features according to a preset channel dimension to obtain the fusion features.
7. The visual question-answering method of claim 1, wherein said performing feature prediction on said fused features to obtain a prediction result comprises:
performing feature prediction on the fusion features by using a preset number of full-connection layers to obtain a prediction label;
and taking the prediction answer corresponding to the prediction label as the prediction result.
8. A visual question answering apparatus, characterized in that the apparatus comprises:
the reference data construction module is used for acquiring an original reference data set containing a reference image and a reference question and answer, extracting the reference image characteristics of the reference image, and constructing a standard reference data set based on the reference image characteristics and the reference question and answer;
the characteristic matching module is used for acquiring visual data to be predicted, which comprises a to-be-predicted image and a to-be-predicted problem, extracting sample image characteristics of the to-be-predicted image, and matching a target reference question answer in the standard reference data set based on the sample image characteristics;
the characteristic fusion module is used for carrying out characteristic fusion on the problem to be predicted, the sample image characteristics and the target reference question and answer by utilizing a pre-constructed multi-modal characteristic fusion network to obtain fusion characteristics;
and the characteristic prediction module is used for carrying out characteristic prediction on the fusion characteristics to obtain a prediction result.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the visual question answering method according to any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the visual question answering method according to any one of claims 1 to 7.
CN202211080483.1A 2022-09-05 2022-09-05 Visual question answering method, device, equipment and storage medium Pending CN115346095A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211080483.1A CN115346095A (en) 2022-09-05 2022-09-05 Visual question answering method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211080483.1A CN115346095A (en) 2022-09-05 2022-09-05 Visual question answering method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115346095A true CN115346095A (en) 2022-11-15

Family

ID=83956853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211080483.1A Pending CN115346095A (en) 2022-09-05 2022-09-05 Visual question answering method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115346095A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115994212A (en) * 2023-03-15 2023-04-21 阿里巴巴达摩院(杭州)科技有限公司 Visual question-answering processing method, visual question-answering model training method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115994212A (en) * 2023-03-15 2023-04-21 阿里巴巴达摩院(杭州)科技有限公司 Visual question-answering processing method, visual question-answering model training method and device
CN115994212B (en) * 2023-03-15 2023-08-01 阿里巴巴达摩院(杭州)科技有限公司 Visual question-answering processing method, visual question-answering model training method and device

Similar Documents

Publication Publication Date Title
CN112860848B (en) Information retrieval method, device, equipment and medium
CN112396005A (en) Biological characteristic image recognition method and device, electronic equipment and readable storage medium
CN113821622B (en) Answer retrieval method and device based on artificial intelligence, electronic equipment and medium
CN112988963A (en) User intention prediction method, device, equipment and medium based on multi-process node
CN114781402A (en) Method and device for identifying inquiry intention, electronic equipment and readable storage medium
CN114511038A (en) False news detection method and device, electronic equipment and readable storage medium
CN113887941A (en) Business process generation method and device, electronic equipment and medium
CN114461777A (en) Intelligent question and answer method, device, equipment and storage medium
CN113378970A (en) Sentence similarity detection method and device, electronic equipment and storage medium
CN114416939A (en) Intelligent question and answer method, device, equipment and storage medium
CN112632264A (en) Intelligent question and answer method and device, electronic equipment and storage medium
CN115510188A (en) Text keyword association method, device, equipment and storage medium
CN112269875A (en) Text classification method and device, electronic equipment and storage medium
CN115238670A (en) Information text extraction method, device, equipment and storage medium
CN114840684A (en) Map construction method, device and equipment based on medical entity and storage medium
CN114595321A (en) Question marking method and device, electronic equipment and storage medium
CN115205225A (en) Training method, device and equipment of medical image recognition model and storage medium
CN114780701A (en) Automatic question-answer matching method, device, computer equipment and storage medium
CN115346095A (en) Visual question answering method, device, equipment and storage medium
CN113344125A (en) Long text matching identification method and device, electronic equipment and storage medium
CN113658002A (en) Decision tree-based transaction result generation method and device, electronic equipment and medium
CN113254814A (en) Network course video labeling method and device, electronic equipment and medium
CN112560855A (en) Image information extraction method and device, electronic equipment and storage medium
CN116628162A (en) Semantic question-answering method, device, equipment and storage medium
CN115146064A (en) Intention recognition model optimization method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination