CN113761153A

CN113761153A - Question and answer processing method and device based on picture, readable medium and electronic equipment

Info

Publication number: CN113761153A
Application number: CN202110548159.7A
Authority: CN
Inventors: 彭博
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-05-19
Filing date: 2021-05-19
Publication date: 2021-12-07
Anticipated expiration: 2041-05-19
Also published as: CN113761153B

Abstract

The embodiment of the application provides a question and answer processing method and device based on pictures, a readable medium and electronic equipment. The question and answer processing method based on the picture comprises the following steps: acquiring a target picture and a question sentence corresponding to the target picture; performing feature extraction on the target picture to obtain a first image feature of the target picture, and performing feature extraction on the question sentence to obtain a first text feature of the question sentence; generating a second text feature corresponding to the first text feature based on the attention mechanism of the first text feature, and generating a second image feature corresponding to the first image feature based on the attention mechanism of the second text feature; and combining the second text characteristic and the second image characteristic to obtain a joint characteristic, and generating an answer of the question sentence according to the joint characteristic. According to the technical scheme, the accuracy rate of the question answering of the pictures can be improved.

Description

Question and answer processing method and device based on picture, readable medium and electronic equipment

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a question and answer processing method and device based on pictures, a readable medium, and an electronic device.

Background

Visual Question Answering (VQA) refers to designing a model that can automatically answer questions related to picture contents using an attention mechanism given a picture.

However, in the related visual question-answering technology, the attention mechanism is completely learned by the model itself, proper guidance is lacked, the unsupervised attention mechanism is limited by data set bias, the difference between the area concerned by the model and the area concerned by people is large, namely the area concerned by people is concerned by wrong pictures, the generalization capability of the model is poor due to wrong attention, the interpretability is poor, and the question-answering accuracy is low.

Disclosure of Invention

Embodiments of the present application provide a question and answer processing method and apparatus based on an image, a readable medium, and an electronic device, so that the accuracy of question and answer of an image can be improved at least to a certain extent.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to an aspect of the embodiments of the present application, there is provided a question-answering processing method based on pictures, including: acquiring a target picture and a question sentence corresponding to the target picture; performing feature extraction on the target picture to obtain a first image feature of the target picture, and performing feature extraction on the question sentence to obtain a first text feature of the question sentence; generating a second text feature corresponding to the first text feature based on the attention mechanism of the first text feature, and generating a second image feature corresponding to the first image feature based on the attention mechanism of the second text feature; and combining the second text characteristic and the second image characteristic to obtain a joint characteristic, and generating an answer of the question sentence according to the joint characteristic.

According to an aspect of the embodiments of the present application, there is provided a question-answering processing device based on pictures, including: the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire a target picture and question sentences corresponding to the target picture; the extraction unit is configured to perform feature extraction on the target picture to obtain a first image feature of the target picture, and perform feature extraction on the question sentence to obtain a first text feature of the question sentence; the generating unit is configured to generate a second text feature corresponding to the first text feature based on the attention mechanism of the first text feature, and generate a second image feature corresponding to the first image feature based on the attention mechanism of the second text feature; and the merging unit is configured to perform feature merging on the second text feature and the second image feature to obtain a joint feature, and generate an answer of the question sentence according to the joint feature.

In some embodiments of the present application, based on the foregoing scheme, the generating unit includes: the first linear transformation subunit is configured to perform linear transformation on the first text feature by using a plurality of distribution weights respectively to obtain a plurality of first feature matrices, wherein one first feature matrix corresponds to one distribution weight; the first generation subunit is configured to generate a second feature matrix corresponding to each first feature matrix based on the attention mechanism of each first feature matrix to obtain a plurality of second feature matrices; and the first splicing subunit is configured to splice the plurality of second feature matrices to obtain a spliced feature matrix, and map the spliced feature matrix into a dimension which is the same as that of the first text feature to obtain a second text feature corresponding to the first text feature.

In some embodiments of the present application, based on the foregoing scheme, the first generating subunit is configured to: similarity calculation is carried out on each first feature matrix and a transposed matrix of each first feature matrix, and attention weight factors of each first feature matrix corresponding to the transposed matrix are obtained; normalizing the attention weight factor to obtain a corresponding attention weight; and performing weighted summation calculation on the feature points contained in each first feature matrix by using the attention weight to obtain a second feature matrix corresponding to each first feature matrix.

In some embodiments of the present application, based on the foregoing scheme, the generating unit includes: a second linear transformation subunit, configured to perform linear transformation on the second text feature and the first image feature respectively by using a plurality of distribution weights to obtain a plurality of third feature matrices and a plurality of fourth feature matrices, where one of the distribution weights corresponds to one of the third feature matrices and one of the fourth feature matrices; a second generating subunit, configured to generate, based on an attention mechanism of each third feature matrix, a fifth feature matrix corresponding to a fourth feature matrix associated with the each third feature matrix to obtain a plurality of fifth feature matrices, where the fourth feature matrix associated with the each third feature matrix is a fourth feature matrix with the same assigned weight as the fourth feature matrix corresponding to the each third feature matrix; and the second splicing subunit is configured to splice the fifth feature matrices to obtain spliced feature matrices, and map the spliced feature matrices to a dimension same as that of the first image features to obtain second image features corresponding to the first image features.

In some embodiments of the present application, based on the foregoing scheme, the second generating subunit is configured to: similarity calculation is carried out on the third feature matrixes and the transpose matrix of the associated fourth feature matrix, and attention weight factors of the third feature matrixes corresponding to the transpose matrix are obtained; normalizing the attention weight factor to obtain a corresponding attention weight; and performing weighted summation calculation on the feature points contained in the associated fourth feature matrixes by using the attention weight to obtain fifth feature matrixes corresponding to the fourth feature matrixes associated with the third feature matrixes.

In some embodiments of the present application, based on the foregoing solution, the merging unit includes: an input subunit, configured to input the joint features into a classification model, where the classification model is obtained by training according to a joint loss function, the joint loss function is obtained by constructing according to a loss value between an output result and an expected output result of the classification model and an attention weight corresponding to a target region in a sample picture, and the target region is a region determined in the sample picture according to a sample problem corresponding to the sample picture; and the determining subunit is configured to acquire the prediction probability of the question statement output by the classification model for each answer, and determine the answer of the question statement according to the prediction probability.

In some embodiments of the present application, based on the foregoing scheme, the determining subunit is configured to: obtaining an answer corresponding to the maximum prediction probability in the prediction probabilities; and taking the answer corresponding to the maximum prediction probability as the answer of the question statement.

In some embodiments of the present application, based on the foregoing solution, the apparatus further includes: the input unit is configured to input the sample picture and the sample question into the classification model to obtain an output result of the classification model, wherein the output result comprises output probabilities of the sample question for each answer; the construction unit is configured to construct a first loss function according to the labeled answer of the sample question and the output probability, and construct a second loss function according to the attention weight corresponding to the target area; and the training unit is configured to construct the joint loss function according to the first loss function and the second loss function, and train the classification model based on the joint loss function to obtain a trained classification model.

In some embodiments of the present application, based on the foregoing solution, the construction unit is configured to: generating labeling probability aiming at each answer according to the labeling answers of the sample questions; carrying out logarithmic operation on the output probability of the sample question aiming at each answer to obtain an operation result aiming at each answer; and determining the first loss function according to the sum of products of the labeling probability aiming at each answer and the operation result aiming at each answer.

In some embodiments of the present application, based on the foregoing solution, the construction unit is configured to: calculating a difference value between the attention weight corresponding to the target area and a preset threshold value; and constructing the second loss function according to the calculated difference.

In some embodiments of the present application, based on the foregoing solution, the training unit is configured to: calculating the product of a preset adjustment factor and the second loss function to obtain an operation result; and adding the operation result and the first loss function to obtain an addition result, and taking the addition result as the joint loss function.

In some embodiments of the present application, based on the foregoing solution, the input unit is configured to: performing feature extraction on the sample picture to obtain a first sample image feature of the sample picture, and performing feature extraction on the sample question to obtain a first sample image feature of the sample question; generating a second sample text feature corresponding to the first sample text feature based on the attention mechanism of the first sample text feature, and generating a second sample image feature corresponding to the first sample image feature based on the attention mechanism of the second sample text feature; and combining the second sample text characteristic and the second sample image characteristic to obtain a combined sample characteristic, and inputting the combined sample characteristic into the classification model.

In the technical solutions provided in some embodiments of the present application, a target picture and a question sentence corresponding to the target picture may be obtained first; then, performing feature extraction on the target picture to obtain a first image feature of the target picture, and performing feature extraction on the question sentence to obtain a first text feature of the question sentence; further, generating a second text feature corresponding to the first text feature based on the attention mechanism of the first text feature, and generating a second image feature corresponding to the first image feature based on the attention mechanism of the second text feature; and finally, combining the second text characteristic and the second image characteristic to obtain a joint characteristic, and generating an answer of the question sentence according to the joint characteristic. According to the technical scheme, the internal relevance between word vectors in the first text features is captured based on the attention mechanism, the second text features are obtained, meanwhile, the relevance between the first image features and the second text features is captured based on the attention mechanism, the second image features are obtained, the second image features comprise the relevance between the target picture and the question sentences, further, in the follow-up process of generating answers of the question sentences, more accurate results can be obtained according to the second text features and the second image features, and the accuracy of picture question answering is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 is a diagram illustrating an exemplary system architecture to which aspects of embodiments of the present application may be applied;

FIG. 2 illustrates a flow diagram of a method of picture-based question-answering processing according to one embodiment of the present application;

FIG. 3 illustrates a flow diagram for generating a second text feature according to an embodiment of the present application;

FIG. 4 illustrates a flow diagram for generating a second feature matrix according to an embodiment of the present application;

FIG. 5 shows a flow diagram for generating a second image feature according to an embodiment of the application;

FIG. 6 shows a flow diagram for generating a fifth feature matrix according to an embodiment of the present application;

FIG. 7 illustrates a flow diagram for generating answers to question statements according to one embodiment of the present application;

FIG. 8 shows a schematic view of a target area according to an embodiment of the present application;

FIG. 9 shows a flow diagram of classification model training according to an embodiment of the present application;

FIG. 10 shows a flow diagram for constructing a first loss function according to an embodiment of the present application;

FIG. 11 illustrates a flow diagram for constructing a second loss function according to an embodiment of the present application;

FIG. 12 shows a flow diagram of classification model training according to an embodiment of the present application;

FIG. 13 illustrates a logic diagram of a picture-based question-answering processing method according to one embodiment of the present application;

fig. 14 is a diagram illustrating an effect comparison of a picture-based question answering processing method according to an embodiment of the present application;

FIG. 15 shows a block diagram of a picture-based question answering processing apparatus according to one embodiment of the present application;

FIG. 16 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

It is to be noted that the terms used in the specification and claims of the present application and the above-described drawings are only for describing the embodiments and are not intended to limit the scope of the present application. It will be understood that the terms "comprises," "comprising," "includes," "including," "has," "having," and the like, when used herein, specify the presence of stated features, integers, steps, operations, elements, components, and/or groups thereof, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element without departing from the scope of the present invention. Similarly, a second element may be termed a first element. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It should be noted that: reference herein to "a plurality" means two or more. "and/or" describe the association relationship of the associated objects, meaning that there may be three relationships, e.g., A and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

With the research and development of Artificial Intelligence (AI) technology, the Artificial Intelligence technology is being researched and applied in many fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, intelligent medical services, intelligent customer service, etc. it is believed that with the development of technology, the Artificial Intelligence technology will be applied in more fields and play more and more important value.

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology mainly comprises a computer vision technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like.

It should be understood that the technical scheme provided by the application can be applied to a visual question and answer scene based on artificial intelligence, and the visual question and answer is the active interdisciplinary research field of computer visualization, natural language processing and machine learning. Given an image and a natural language question associated with the image, a visual question-answer answers the question using natural sentences. Visual question-answering is not only a fundamental step in building artificial intelligence, but is also extremely important for many applications such as image retrieval, blind navigation, and early childhood education. Visual question-answering is a challenging task because it requires sophisticated computational vision techniques to understand images deeply, advanced natural language processing techniques to extract question meaning, and a unified framework to integrate visual and semantic information efficiently.

Currently, the solution of Visual Question Answering includes Deep Modular Co-Attention Networks (Deep Modular Co-Attention Networks For Visual Question Answering, MCAN) For Visual Question Answering And Modulated convolution For Visual Counting And Beyond MOVIE.

The MCAN is a visual question and answer solution proposed in 2019, firstly, a model respectively processes a question and a picture, and extracts corresponding text features and visual features; secondly, the text feature and the visual feature pass through an attention mechanism module, wherein the text is a self-attention mechanism, and the image is a guide attention mechanism; finally, combining the text features and the image features, an answer to the question is generated. MOVIE is a visual question-answering solution proposed in 2020, and the model also extracts features for pictures and texts respectively, except that the extracted text features are densely stitched to each graphic feature directly and then enter the attention mechanism module. The method has the advantages that the problem features can be fused with the features corresponding to each image area, and obvious index improvement is realized on counting problems.

However, the current visual question-answering method has the following problems: (1) the attention mechanism is completely learned by the model, and lacks proper guidance; (2) the unsupervised attention mechanism is limited by data set bias, and the difference between the region concerned by the model and the region concerned by people is larger, namely the region concerned by the error picture is concerned; (3) the focus error of attention mechanism can result in poor generalization ability and poor interpretability of the model.

Based on this, an embodiment of the present application provides a question and answer processing method based on a picture, which includes obtaining a target picture and a question sentence corresponding to the target picture, performing feature extraction on the target picture to obtain a first image feature of the target picture, performing feature extraction on the question sentence to obtain a first text feature of the question sentence, further generating a second text feature corresponding to the first text feature based on an attention mechanism of the first text feature, generating a second image feature corresponding to the first image feature based on an attention mechanism of the second text feature, finally performing feature merging on the second text feature and the second image feature to obtain a joint feature, and generating an answer to the question sentence according to the joint feature. According to the technical scheme, the internal relevance between word vectors in the first text features is captured based on the attention mechanism, the second text features are obtained, meanwhile, the relevance between the first image features and the second text features is captured based on the attention mechanism, the second image features are obtained, the second image features comprise the relevance between the target picture and the question sentences, further, in the follow-up process of generating answers of the question sentences, more accurate results can be obtained according to the second text features and the second image features, and the accuracy of picture question answering is improved.

To facilitate understanding, an embodiment of the present application provides a question-answering processing method based on pictures, which is applied to the system architecture shown in fig. 1, please refer to fig. 1, where the system architecture 100 may include a terminal device 101, a network 102, a server 103, a target picture 104, and a question sentence 105 corresponding to the target picture. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include, but is not limited to: a wireless network, a wired network, including but not limited to at least one of: wide area networks, metropolitan area networks, and local area networks. The wireless network includes, but is not limited to, at least one of: bluetooth, WI-FI, Near Field Communication (NFC for short), cellular mobile Communication networks. A user may use the terminal 101 to interact with the server 103 via the network 102 to receive or send messages or the like.

The terminal device 101 may be any electronic product capable of performing human-Computer interaction with a user through one or more modes of a keyboard, a touch pad, a touch screen, a remote controller, voice interaction or handwriting equipment, for example, a PC (Personal Computer), a mobile phone, a smart phone, a PDA (Personal Digital Assistant), a wearable device, a pocket PC (pocket PC), a tablet Computer, a smart car, a smart television, a smart sound box, and the like.

The server 103 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

Those skilled in the art should understand that the terminal device 101 and the server 103 are only examples, and other existing or future terminal devices or servers may also be included in the scope of the present application, and are also included herein by reference.

It should be understood that the number of terminal devices 101, networks 102, and servers 103 in fig. 1 are illustrative. There may be any number of terminal devices 101, networks 102, and servers 103, as desired for implementation. For example, the server 103 may be a server cluster composed of a plurality of servers.

In one embodiment of the present application, a user may upload the target picture 104 and the question sentence 105 through an application on the terminal device 101, and send the target picture 104 and the question sentence 105 to the server 103 through the network 102 between the terminal device 101 and the server 103. Correspondingly, after receiving the target picture 104 and the question sentence 105, the server 103 performs feature extraction on the target picture 104 to obtain a first image feature of the target picture 104, performs feature extraction on the question sentence 105 to obtain a first text feature of the question sentence 105, then the server 103 generates a second text feature corresponding to the first text feature based on the attention mechanism of the first text feature, generates a second image feature corresponding to the first image feature based on the attention mechanism of the second text feature, and finally the server 103 performs feature combination on the second text feature and the second image feature to obtain a combined feature, and generates an answer to the question sentence 105 according to the combined feature.

The image-based question-answering processing method provided by the embodiment of the application is generally executed by the server 103, the server 103 is used for receiving the target image 104 and the question-answering sentence 105 uploaded by the terminal device 101, and generating the answer of the question-answering sentence 105 based on the target image 104, and accordingly, an image-based question-answering processing device is generally arranged in the server 103. However, it is easily understood by those skilled in the art that the question and answer processing method based on the picture provided in the embodiment of the present application may also be executed by the terminal device 101, and accordingly, the question and answer processing device based on the picture may also be disposed in the terminal device 101, which is not particularly limited in the exemplary embodiment. For example, in an exemplary embodiment, the terminal device 101 is configured to receive a target picture 104 and a question-answer sentence 105 uploaded by a user, and further generate an answer to the question-answer sentence 105 based on the target picture 104.

The implementation details of the technical solution of the embodiment of the present application are set forth in detail below:

fig. 2 shows a flowchart of a picture-based question-answering processing method according to an embodiment of the present application, which may be performed by a server, which may be the server 103 shown in fig. 1. Referring to fig. 2, the question-answering processing method based on pictures at least includes the following steps:

step S210, obtaining a target picture and a question sentence corresponding to the target picture;

step S220, performing feature extraction on the target picture to obtain a first image feature of the target picture, and performing feature extraction on the question sentence to obtain a first text feature of the question sentence;

step S230, generating a second text feature corresponding to the first text feature based on the attention mechanism of the first text feature, and generating a second image feature corresponding to the first image feature based on the attention mechanism of the second text feature;

and S240, combining the second text characteristic and the second image characteristic to obtain a joint characteristic, and generating an answer of the question sentence according to the joint characteristic.

These steps are described in detail below.

Step S210, a target picture and a question sentence corresponding to the target picture are obtained.

Specifically, the server may obtain the target picture, where the target picture may be obtained through a large-scale knowledge base or a picture database, or may be directly obtained from the internet, a block chain, or a distributed file system, which is not limited in this embodiment of the application. Here, the target picture may be a single picture or a frame in the video.

In the embodiment of the application, the server not only obtains the target picture, but also obtains the question sentence corresponding to the target picture. In the application scenario, the server obtains the target picture and simultaneously obtains the question sentence, and the question sentence and the target picture have strong relevance.

For example, if the content of the target picture is a table with three apples, the question sentence corresponding to the target picture may be "several apples on the table".

Step S220, feature extraction is carried out on the target picture to obtain a first image feature of the target picture, and feature extraction is carried out on the question sentence to obtain a first text feature of the question sentence.

The feature is data which can be extracted through measurement or processing, the main purpose of feature extraction is dimension reduction, and the main idea is to project an original image sample to a low-dimensional feature space to obtain low-dimensional image sample features which can reflect the essence of the image sample or distinguish the image sample.

Specifically, for pictures, each picture has its own features that can be distinguished from other pictures, and some of the features are natural features that can be intuitively perceived, such as brightness, edges, textures, colors, and the like; some of the image features are obtained by transformation or processing, such as moments, histograms, principal components, and the like, in the embodiment of the present application, the first image feature may be expressed by a feature vector expression, for example, f ═ x1, x2 … xn }, and a common method for extracting image features includes: (1) the geometric method is a texture feature analysis method based on the picture texture element theory. (2) And (3) extracting the characteristics of a model method, wherein the model method is based on a structural model of the picture, and parameters of the model are used as texture characteristics, such as a convolutional neural network model. (3) The method mainly comprises the following steps of extracting the characteristics of a signal processing method, and extracting and matching the texture characteristics: gray level co-occurrence matrix, autoregressive texture model, wavelet transform, etc.

For question sentences, the first text feature of the question sentences aims to express the text in a form that can be understood by a computer, namely vectorization of the text, and the extraction of the first text feature can also be realized by a corresponding text extraction algorithm model, such as an embedded network model.

Step S230, generating a second text feature corresponding to the first text feature based on the attention mechanism of the first text feature, and generating a second image feature corresponding to the first image feature based on the attention mechanism of the second text feature.

The essence of Attention (Attention) mechanism comes from human visual Attention mechanism, which is a brain signal processing mechanism specific to human vision, and human vision obtains a target area needing important Attention, namely a so-called Attention focus, by rapidly scanning a global image, and then puts more Attention resources into the area to obtain more detailed information of the target needing Attention, and suppresses other useless information. The attention Mechanism may include a Self-attention Mechanism (Self-attention Mechanism) which is characterized in that the distance between the disregarded words directly calculates the dependency relationship, enabling the internal structure of a sentence to be learned.

In this embodiment of the present application, a self-attention mechanism may be applied to the first text feature, and an internal relevance of each word vector in the first text feature is captured through the self-attention mechanism, where the relevance may be an attention weight, and the attention weight is assigned to each word vector in the first text feature to generate a second text feature corresponding to the first text feature, where the second text feature includes an internal relevance between each word vector in the question sentence.

Meanwhile, after the second text feature is generated, based on the attention mechanism, an association between the second text feature and the first image feature is captured, wherein the association can be an attention weight, and the second image feature corresponding to the first image feature is obtained by distributing the attention weight to the corresponding first image feature, so that the second image feature comprises the association between the target picture and the question sentence.

In this embodiment, after the attention mechanism is applied to obtain the second text feature and the second image feature, feature splicing may be performed on the second text feature and the second image feature at this time to obtain a joint feature, for example, the second text feature and the second image feature are 256-dimensional feature vectors respectively, and the features are spliced end to obtain a 512-dimensional joint feature, where the feature splicing may be specifically implemented by a concat function, and the concat function is used to connect two or more arrays.

After the features are spliced, a joint feature can be obtained, so that the number of subsequent inputs is reduced. For example, when the answer to the question sentence is obtained subsequently, a joint feature may be input, which reduces the input amount compared with inputting 2 features separately.

After the feature combination is performed to obtain the joint feature, the question and sentence may be classified based on the joint feature to obtain an answer to the question and sentence.

In one possible implementation, the classifying the question statement based on the joint feature may be to search a preset category label library for a category label matching the joint feature based on the joint feature, and then use the matching category label as an answer to the question statement, where the matching category label may be a category label corresponding to a similarity value of the joint feature being greater than or equal to the preset similarity value.

Based on the technical scheme of the embodiment, based on the attention mechanism, the internal relevance between word vectors in the first text feature is captured, the second text feature is obtained, meanwhile, based on the attention mechanism, the relevance between the first image feature and the second text feature is captured, the second image feature is obtained, the second image feature comprises the relevance between the target picture and the question sentence, and further, in the subsequent process of generating answers to the question sentence, a more accurate result can be obtained according to the second text feature and the second image feature.

In an embodiment of the present application, as shown in fig. 3, generating the second text feature corresponding to the first text feature based on the attention mechanism of the first text feature may specifically include steps S310 to S330, which are described in detail as follows:

step S310, performing linear transformation on the first text feature by using a plurality of distribution weights, respectively, to obtain a plurality of first feature matrices, where one first feature matrix corresponds to one distribution weight.

In this embodiment, a self-attention mechanism may be applied to the first text feature, and specifically, the first text feature may be first linearly transformed by using a plurality of assignment weights, respectively, to obtain a plurality of first feature matrices, where one first feature matrix corresponds to one assignment weight.

It should be noted that the first feature matrix is a triplet (Q, K, V), where Q, K, V represents query (query), key (key) and value (value), respectively, and Q, K, V has the same dimension, and since the first feature matrix can be a triplet, the assigned weight is also a triplet (W)^Q，W^K，W^V)。

For example, assuming the first text feature is T, the jth assigned weight is used

Linearly transforming the first text characteristic X to obtain a jth first characteristic matrix (Q)_j,K_j,V_j) Can be expressed as:

step S320, generating a second feature matrix corresponding to each first feature matrix based on the attention mechanism of each first feature matrix to obtain a plurality of second feature matrices.

After the first text features are linearly changed to obtain each first feature matrix, further, a second feature matrix corresponding to each first feature matrix may be generated based on the attention mechanism of each first feature matrix, so as to obtain a plurality of second feature matrices. The implementation manner of this step is similar to the implementation process of step S230 described above, and this embodiment of the present application is not described in detail here.

And S330, splicing the plurality of second feature matrixes to obtain spliced feature matrixes, and mapping the spliced feature matrixes to have the same dimensionality as the first text features to obtain second text features corresponding to the first text features.

Finally, the plurality of second feature matrices can be spliced together, and then the spliced feature matrices are mapped to the same dimensionality as the first text feature, so that the second text feature corresponding to the first text feature can be obtained.

Optionally, in an implementation manner, as shown in fig. 4, step S320 may specifically include step S410 to step S430, which is specifically described as follows:

in step S410, similarity calculation is performed on each first feature matrix and the transpose matrix of each first feature matrix to obtain an attention weight factor of the transpose matrix corresponding to each first feature matrix.

In the embodiment of the present application, the attention mechanism based on each first feature matrix may be understood as that each first feature matrix corresponds to its own attention weight factor, and the nature of the attention mechanism function may be described as a mapping of a query (query) to a series of (key-value) pairs.

The calculation of the second feature matrix based on the attention of each first feature matrix can be mainly divided into three steps, wherein the first step is to calculate the similarity of the transposed matrices of query (Q in the first feature matrix) and key (K in the first feature matrix) to obtain an attention weight factor, and common similarity functions comprise dot products, splicing, perceptrons and the like; in the second step, the attention weight factors can be normalized by using a softmax function to obtain the attention weight; and finally, carrying out weighted summation on the attention weight and the corresponding key value (V in the first feature matrix) to obtain a second feature matrix.

In step S420, the attention weight factor is normalized to obtain a corresponding attention weight.

As described above, the calculation of the second feature matrix based on the attention of each first feature matrix mainly includes three steps, and after the attention weight factor is obtained in step S410, in this step, the attention weight factor normalization process may be performed using a Sigmoid function or a softmax function, which is a function of mapping a variable between [0, 1 ].

Step S430, performing weighted summation calculation on the feature points included in each first feature matrix by using the attention weight, and obtaining a second feature matrix corresponding to each first feature matrix.

And finally, carrying out weighted summation on the attention weight after the normalization processing and the feature points contained in the corresponding key value (V in the first feature matrix) to obtain a second feature matrix, wherein the second feature matrix combines the relevance among the features of the problem statement, so that a more accurate result can be obtained in the subsequent prediction or classification operation.

In an embodiment of the present application, as shown in fig. 5, generating the second image feature corresponding to the first image feature based on the attention mechanism of the second text feature may specifically include steps S510 to S530, which are described in detail as follows:

step S510, performing linear transformation on the second text feature and the first image feature by using a plurality of distribution weights, respectively, to obtain a plurality of third feature matrices and a plurality of fourth feature matrices, where one distribution weight corresponds to one third feature matrix and one fourth feature matrix.

In addition to applying the self-attention mechanism to the first text feature, at the same time, the second image feature corresponding to the first image feature may be generated based on the attention mechanism of the second text feature, so that the second image feature includes the association between the target picture and the question sentence.

Specifically, the generating process of the second image feature may be to perform linear transformation on the second text feature and the first image feature by using a plurality of distribution weights, respectively, to obtain a plurality of third feature matrices and a plurality of fourth feature matrices, where one distribution weight corresponds to one third feature matrix and one fourth feature matrix.

It should be noted that the third feature matrix may be represented by Q, Q represents query (query), the fourth feature matrix is represented by K and V, K and V represent key and value respectively, Q, K and V have the same dimension, and the assigned weight may be a triplet (W)^Q，W^K，W^V)。

For example, assuming the second text feature is X and the first image feature is I, the jth weight is assigned

Respectively carrying out linear transformation on the second text characteristic X and the first image characteristic I to obtain a jth third characteristic matrix expressed as

The jth fourth feature matrix may be represented as:

step S520, generating a fifth feature matrix corresponding to a fourth feature matrix associated with each third feature matrix based on the attention mechanism of each third feature matrix to obtain a plurality of fifth feature matrices, wherein the fourth feature matrix associated with each third feature matrix is the fourth feature matrix with the same distribution weight corresponding to each third feature matrix.

Further, a fifth feature matrix corresponding to a fourth feature matrix associated with each third feature matrix may be generated based on the attention mechanism of each third feature matrix to obtain a plurality of fifth feature matrices, where the fourth feature matrix associated with each third feature matrix is the fourth feature matrix with the same assigned weight corresponding to each third feature matrix.

For example, based on the jth third feature matrix, a fifth feature matrix corresponding to the jth fourth feature matrix is generated. The implementation manner of this step is similar to the implementation process of step 230 described above, and the detailed description of this embodiment is omitted here.

And S530, splicing the plurality of fifth feature matrixes to obtain spliced feature matrixes, and mapping the spliced feature matrixes to have the same dimensionality as the first image features to obtain second image features corresponding to the first image features.

Finally, the fifth feature matrices may be stitched together, and then the stitched feature matrices are mapped to the same dimension as the first image features, so that the second image features corresponding to the first image features may be obtained.

Optionally, in an implementation manner, as shown in fig. 6, step S520 may specifically include step S610 to step S630, which are specifically described as follows:

in step S610, similarity calculation is performed on each third feature matrix and the transpose matrix of the associated fourth feature matrix, so as to obtain an attention weight factor of the transpose matrix corresponding to each third feature matrix.

In the embodiment of the present application, based on the attention of the third feature matrix, the attention weight factor of the fourth feature matrix corresponding to the third feature matrix can be understood, and the nature of the attention mechanism function can be described as a mapping from a query (query) to a series of (key-value) pairs.

The method mainly comprises three steps when calculating a fifth feature matrix based on attention of the third feature matrix, wherein the first step is to calculate the similarity of the transpose matrix of the query (the third feature matrix Q) and the key (the K in the associated fourth feature matrix) to obtain an attention weight factor, and common similarity functions comprise dot products, splicing, a perceptron and the like; in the second step, the attention weight factors can be normalized by using a softmax function to obtain the attention weight; in the third step, the attention weight and the corresponding key value (V in the associated fourth feature matrix) may be subjected to weighted summation to obtain a fifth feature matrix.

In step S620, the attention weight factor is normalized to obtain a corresponding attention weight.

In this step, the attention weight factor normalization process may be performed using a Sigmoid function or a softmax function, which has the effect of mapping variables between [0, 1 ].

Step S630, performing weighted summation calculation on the feature points included in the associated fourth feature matrices by using the attention weights, and obtaining fifth feature matrices corresponding to the fourth feature matrices associated with the respective third feature matrices.

And finally, carrying out weighted summation on the attention weight after the normalization processing and the feature points contained in the corresponding key value (V in the fourth feature matrix) to obtain a fifth feature matrix, wherein the fifth feature matrix combines the relevance of the target picture and the question statement, so that a more accurate result can be obtained in the subsequent prediction or classification operation.

In an embodiment of the present application, the manner of classifying the question sentences based on the joint features may be to complete classification processing through a classification model, as shown in fig. 7, in this embodiment, the step of generating answers to the question sentences according to the joint features may specifically include steps S710 to S720, which are described as follows:

step S710, inputting the combined features into a classification model, wherein the classification model is obtained by training according to a combined loss function, the combined loss function is obtained by constructing according to a loss value between an output result and an expected output result of the classification model and an attention weight corresponding to a target region in a sample picture, and the target region is a region determined in the sample picture according to a sample problem corresponding to the sample picture.

Specifically, after the joint features are obtained, the joint features may be input to a classification model, and classification processing may be performed by the classification model. The classification model is obtained by training according to a joint loss function, and can output prediction probabilities of question sentences for all answers according to input joint features, wherein all answers can be preset during model training, and each answer can be regarded as a category, such as red, two, men, hat, sky, animals, dancing and the like.

Before the classification model is adopted for classification processing, the classification model needs to be trained according to a joint loss function. The construction of the joint loss function includes two parts: the method comprises the steps of firstly, obtaining a loss value between an output result of a classification model and an expected output result, and secondly, obtaining attention weight corresponding to a target area in a sample picture, wherein the sample picture is used for training the classification model, and the target area is an area determined in the sample picture according to a sample problem corresponding to the sample picture.

Optionally, in an embodiment, when the target area is determined, a keyword may be extracted from the sample question, an area matching the keyword is determined in the sample picture according to the keyword, and the matching area is used as the target area. The matching meaning may be that the similarity between the image feature information in the sample picture and the keyword is greater than a preset threshold, that is, if the similarity between the image feature information in a certain region and the keyword in the sample picture is greater than the preset threshold, the region is considered to be a region matched with the keyword.

As shown in fig. 8, a sample picture is schematically shown, wherein the sample picture corresponds to the sample question "who is the scarf," and according to the keywords "who" and "scarf" in the sample question, a target area 807177 corresponding to the keyword "scarf" and a target area 807173 corresponding to the keyword "who" can be determined in the sample picture.

Continuing to refer to fig. 7, in step S720, the prediction probabilities of the question sentences output by the classification model for the respective answers are obtained, and the answers to the question sentences are determined according to the prediction probabilities.

Specifically, after the joint features are input into the classification model, the classification model may output a prediction probability of the question sentence for each answer, and further, the answer of the question sentence may be determined according to the output prediction probability.

In a possible implementation manner, the answer of the question statement is determined according to the output prediction probability, which may be by calculating a prediction value of the question statement for each answer according to the prediction probability and the corresponding weight, and then using the answer corresponding to the largest prediction value as the answer of the question statement. The corresponding weight of the prediction probability can be determined according to actual experience.

In another possible implementation manner, the manner of determining the answer to the question statement according to the output prediction probability may also be to obtain an answer corresponding to the maximum prediction probability in the prediction probabilities, and then, take the answer corresponding to the maximum prediction probability as the answer to the question statement.

In an embodiment of the present application, fig. 9 shows a flowchart of a training method of a classification model, and as shown in fig. 9, the training method of a classification model may specifically include steps S910 to S930, which are described in detail as follows:

step S910, inputting the sample picture and the sample question into a classification model to obtain an output result of the classification model, wherein the output result comprises output probabilities of the sample question for all answers.

In specific implementation, the classification model can be trained by acquiring sample pictures and sample problems. The sample question is a question sentence corresponding to the sample picture, and the sample picture and the sample question have strong relevance.

After the sample picture and the sample question are acquired, the sample picture and the sample question can be input into the classification model, and the classification model can output the output probability of the sample question for each answer according to the input sample picture and the input sample question.

Step S920, constructing a first loss function according to the labeled answers and the output probabilities of the sample questions, and constructing a second loss function according to the attention weights corresponding to the target regions.

After the output probabilities of the sample questions for the answers are obtained, a loss function, namely a first loss function, can be constructed according to the labeled answers and the output probabilities of the sample questions, and a second loss function can be constructed according to the attention weight corresponding to the target area.

It should be noted that the calculation method of the attention weight corresponding to the target area is similar to the calculation method of the attention weight mentioned in the above steps S510 to S530, and therefore, the detailed description is omitted.

In an embodiment of the present application, as shown in fig. 10, the step of constructing the first loss function according to the labeled answer and the output probability of the sample question may specifically include steps S1010 to S1030, which are described in detail as follows:

in step S1010, labeling probabilities for the respective answers are generated according to the labeled answers to the sample questions.

It is to be understood that after the sample picture and the sample question are input into the classification model, the classification model may output the output probabilities of the sample question for the respective answers. For example, if the preset answers include four, the classification model may output the output probabilities that the sample questions belong to the four answers respectively.

Similarly, because the sample question corresponds to the labeled answer, and the labeled answer is also one of the preset answers, the labeling probability of the sample question for each answer can be generated according to the labeled answer of the sample question.

For example, the sample question is represented as S, the preset answers include a1, B1, C1 and D1, and the labeled answer of the sample question is B1, then labeled probabilities of the sample question for the respective answers are 0, 1, 0 and 0, respectively, can be generated.

In step S1020, a logarithm operation is performed on the output probability of the sample question for each answer, so as to obtain an operation result for each answer.

When the first loss function is constructed, in addition to generating the labeling probability of the sample question for each answer, the output probability of the sample question for each answer can be subjected to logarithm operation to obtain an operation result for each answer.

In step S1030, a first loss function is determined according to a sum of products of the labeling probabilities for the respective answers and the operation results for the respective answers.

After the labeling probabilities and the operation results are obtained, a first loss function may be determined according to a sum of products of the labeling probabilities for the respective answers and the operation results for the respective answers. First Loss function Loss₁Can be expressed as shown in equation (1):

wherein x is_iFor the ith label probability, y_iIs the ith operation result.

For convenience of understanding, it is assumed that the sample question is represented as S, the preset answers include a1, B1, C1 and D1, the labeled answer of the sample question is B1, the labeling probabilities of the generated sample question for the answers are 0, 1, 0 and 0, respectively, and the output probabilities of the sample question S output by the classification model belonging to the four answers are 0.2, 0.1, 0.5 and 0.2, respectively, so that the first Loss function Loss can be calculated₁＝-(0×log0.2+1×log0.1+0×log0.5+0×log0.2)＝0.5。

In an embodiment of the present application, as shown in fig. 11, the step of constructing the second loss function according to the attention weight corresponding to the target area may specifically include steps S1110 to S1120, which are specifically described as follows:

in step S1110, a difference between the attention weight corresponding to the target region and a preset threshold is calculated.

In this embodiment, when the second loss function is constructed according to the attention weight corresponding to the target area, a difference between the attention weight corresponding to the target area and a preset threshold may be calculated in advance, where the preset threshold may be obtained through experimental data, for example, the preset threshold is 0.8.

In step S1120, a second loss function is constructed according to the calculated difference.

Further, a second loss function may be constructed based on the calculated difference. Second Loss function Loss₂Can be expressed as shown in equation (2):

Loss₂max { (p-a),0} equation (2)

Wherein p is a preset threshold, and a is the attention weight corresponding to the target area.

Referring to fig. 9, in step S930, a joint loss function is constructed according to the first loss function and the second loss function, and a classification model is trained based on the joint loss function, so as to obtain a trained classification model.

Further, after the first loss function and the second loss function are obtained through construction, the computer device can construct a combined loss function according to the first loss function and the second loss function, adjust model parameters of the classification model according to the direction of minimizing the combined loss function, reduce the combined loss function through updating the model parameters, continuously optimize the model parameters of the statistical classification model, determine the model parameters which enable the combined loss function to be minimum through a minimization principle, and obtain the trained classification model.

In some embodiments, the step of constructing the joint loss function according to the first loss function and the second loss function may specifically include: firstly, calculating the product of a preset adjustment factor and a second loss function to obtain an operation result; and then, adding the operation result and the first loss function to obtain an addition result, and taking the addition result as a joint loss function.

Illustratively, the expression of the joint Loss function Loss may be as shown in equation (3):

Loss＝Loss₁+α*Loss₂formula (3)

Wherein alpha is a preset adjusting factor for adjustingWhole Loss₂The value of α can usually be 1 in the specific gravity of Loss.

In an embodiment of the present application, the inputting of the sample picture and the sample question into the classification model may be inputting into the classification model after processing the sample picture and the sample question, as shown in fig. 12, in this embodiment, the step of inputting the sample picture and the sample question into the classification model may specifically include steps S1210 to S1230, which are described in detail as follows:

step 1210, performing feature extraction on the sample picture to obtain a first sample image feature of the sample picture, and performing feature extraction on the sample question to obtain a first sample image feature of the sample question.

For pictures, each picture has self characteristics which can be distinguished from other pictures, and some of the pictures are natural characteristics which can be intuitively felt, such as brightness, edges, textures, colors and the like; some of the image features are obtained by transformation or processing, such as moments, histograms, principal components, and the like, and in this embodiment, the first sample image feature can be expressed by a feature vector expression. The method for extracting the features of the sample picture can comprise the following steps: geometric method feature extraction, model method feature extraction, signal processing method feature extraction and the like.

For the sample problem, text vectorization may be performed on the sample problem to obtain a first sample feature of the sample problem, and the extraction of the first sample feature may be implemented by a corresponding text extraction algorithm model, for example, an embedded network model.

Step S1220, generating a second sample text feature corresponding to the first sample text feature based on the attention mechanism of the first sample text feature, and generating a second sample image feature corresponding to the first sample image feature based on the attention mechanism of the second sample text feature.

In this embodiment, a self-attention mechanism may be applied to the first sample text feature, and an internal association of each word vector in the first sample text feature is captured by the self-attention mechanism, where the association may be an attention weight, and the attention weight is assigned to each word vector in the first sample text feature to generate a second sample text feature corresponding to the first sample text feature, where the second sample text feature includes an internal association degree between each word vector in the sample question.

Meanwhile, after generating the second sample text feature, based on the attention mechanism, capturing an association between the second sample text feature and the first sample image feature, wherein the association may be an attention weight, and obtaining the second sample image feature corresponding to the first sample image feature by assigning the attention weight to the corresponding first sample image feature, so that the second sample image feature includes the association between the sample picture and the sample question.

And step S1230, combining the second sample text characteristic and the second sample image characteristic to obtain a combined sample characteristic, and inputting the combined sample characteristic into a classification model.

After the attention mechanism is applied to obtain the second sample text feature and the second sample image feature, the second sample text feature and the second sample image feature may be subjected to feature merging at this time to obtain a combined sample feature, and then the combined sample feature is input into the classification model.

Fig. 13 is a logic diagram of a picture-based question-answering processing method according to an embodiment of the present application, in which an executing subject may receive a picture and a question, generate an answer to the picture for the question, and thereby complete a picture question-answering. As shown in fig. 13, in this embodiment, the question-answering processing method based on pictures may specifically include the following steps:

and S1, acquiring the target picture and the question sentence.

In this step, data is mainly imported to import a target picture and question sentences, and as shown in fig. 13, the question sentences corresponding to the target picture shown in fig. 13 are "What is in the box? "

And S2, generating a second text feature based on the attention mechanism of the first text feature, and generating a second image feature based on the attention mechanism of the second text feature.

After the target picture and the question sentence are imported through step S1, the features of the question sentence and the target picture may be extracted to obtain the first text feature T and the first image feature I, respectively.

Further, a self-attention mechanism can be applied to the first text feature T to generate a second text feature X, and after the second text feature X is generated, an association between the second text feature X and the first image feature I is captured based on the attention mechanism, wherein the association can be an attention weight, and the attention weight is assigned to the corresponding first image feature I to obtain a second image feature Y corresponding to the first image feature I, so that the second image feature Y includes the association between the target picture and the question sentence.

And S3, combining to obtain joint features, and classifying based on the joint features to obtain answers.

After applying the attention mechanism to obtain the second text feature and the second image feature, the second text feature and the second image feature may be feature merged.

To facilitate feature merging, feature dimension compression may be performed through, for example, a Multi-layer per term (MLP) before feature merging, so as to obtain second text features after compression respectively

And compressed second image features

Subsequently, the compressed second text feature can be processed

And compressed second image features

And combining the characteristics to obtain combined characteristics.

After the joint feature is obtained, then the question statement "What is in the box? "is a question sentence of What is in the box after classification? The answer of (2) to complete the question and answer of the picture.

Three sets of experiments are performed based on the question-answering processing method based on the picture provided by the embodiment of the application, for convenience of introduction, please refer to fig. 14, fig. 14 is a schematic comparison of the method provided by the application and MCAN in the prior art on attention-related regions and answers, and a rectangular frame in the picture indicates the attention-related regions.

As can be seen from fig. 14, in the first picture, for the question "what material the sidewalk close to the street lamp is built up" corresponding to the first picture, the prior art gives an incorrect answer "metal" because it focuses on an incorrect region (the region where the street lamp is located), whereas the present application focuses on a correct region (the region close to the sidewalk of the street lamp), and then gives a correct answer "concrete"; in the second picture, aiming at the question 'what is an animal eating leaves' corresponding to the second picture, the prior art also focuses on an incorrect region (a region containing zebra) and further gives an incorrect answer 'zebra', while the application focuses on a correct region (a region where giraffes are located) and further gives a correct answer 'giraffe'; in the third picture, for the question "which hat is company" corresponding to the third picture, the prior art focuses on a plurality of areas and gives an incorrect answer "nike", while the present application focuses on a correct area (area where hat is located) and then gives a correct answer "adidas".

It can be seen that the method provided by the application gives correct answers to all three questions, the attention mechanism is correct in the area concerned, the MCAN method pays attention to the wrong area, wrong answers are given, and the image-based question-answer processing method provided by the application can improve the accuracy of the image question-answer through comparison.

The following describes embodiments of an apparatus of the present application, which may be used to execute the image-based question answering processing method in the above embodiments of the present application. For details that are not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the question-answering processing method based on pictures described above in the present application.

Fig. 15 is a block diagram of a picture-based question-answering processing apparatus according to an embodiment of the present application, and referring to fig. 15, a picture-based question-answering processing apparatus 1500 according to an embodiment of the present application includes: an acquisition unit 1502, an extraction unit 1504, a generation unit 1506, and a merging unit 1508.

The obtaining unit 1502 is configured to obtain a target picture and a question sentence corresponding to the target picture; the extracting unit 1504 is configured to perform feature extraction on the target picture to obtain a first image feature of the target picture, and perform feature extraction on the question statement to obtain a first text feature of the question statement; the generating unit 1506 is configured to generate a second text feature corresponding to the first text feature based on the attention mechanism of the first text feature, and generate a second image feature corresponding to the first image feature based on the attention mechanism of the second text feature; the merging unit 1508 is configured to perform feature merging on the second text feature and the second image feature to obtain a joint feature, and generate an answer to the question sentence according to the joint feature.

In some embodiments of the present application, the generating unit 1506 includes: the first linear transformation subunit is configured to perform linear transformation on the first text feature by using a plurality of distribution weights respectively to obtain a plurality of first feature matrices, wherein one first feature matrix corresponds to one distribution weight; the first generation subunit is configured to generate a second feature matrix corresponding to each first feature matrix based on the attention mechanism of each first feature matrix to obtain a plurality of second feature matrices; and the first splicing subunit is configured to splice the plurality of second feature matrices to obtain a spliced feature matrix, and map the spliced feature matrix into a dimension which is the same as that of the first text feature to obtain a second text feature corresponding to the first text feature.

In some embodiments of the present application, the first generating subunit is configured to: similarity calculation is carried out on each first feature matrix and a transposed matrix of each first feature matrix, and attention weight factors of each first feature matrix corresponding to the transposed matrix are obtained; normalizing the attention weight factor to obtain a corresponding attention weight; and performing weighted summation calculation on the feature points contained in each first feature matrix by using the attention weight to obtain a second feature matrix corresponding to each first feature matrix.

In some embodiments of the present application, the generating unit 1506 includes: a second linear transformation subunit, configured to perform linear transformation on the second text feature and the first image feature respectively by using a plurality of distribution weights to obtain a plurality of third feature matrices and a plurality of fourth feature matrices, where one of the distribution weights corresponds to one of the third feature matrices and one of the fourth feature matrices; a second generating subunit, configured to generate, based on an attention mechanism of each third feature matrix, a fifth feature matrix corresponding to a fourth feature matrix associated with the each third feature matrix to obtain a plurality of fifth feature matrices, where the fourth feature matrix associated with the each third feature matrix is a fourth feature matrix with the same assigned weight as the fourth feature matrix corresponding to the each third feature matrix; and the second splicing subunit is configured to splice the fifth feature matrices to obtain spliced feature matrices, and map the spliced feature matrices to a dimension same as that of the first image features to obtain second image features corresponding to the first image features.

In some embodiments of the present application, the second generating subunit is configured to: similarity calculation is carried out on the third feature matrixes and the transpose matrix of the associated fourth feature matrix, and attention weight factors of the third feature matrixes corresponding to the transpose matrix are obtained; normalizing the attention weight factor to obtain a corresponding attention weight; and performing weighted summation calculation on the feature points contained in the associated fourth feature matrixes by using the attention weight to obtain fifth feature matrixes corresponding to the fourth feature matrixes associated with the third feature matrixes.

In some embodiments of the present application, the merging unit 1508 includes: an input subunit, configured to input the joint features into a classification model, where the classification model is obtained by training according to a joint loss function, the joint loss function is obtained by constructing according to a loss value between an output result and an expected output result of the classification model and an attention weight corresponding to a target region in a sample picture, and the target region is a region determined in the sample picture according to a sample problem corresponding to the sample picture; and the determining subunit is configured to acquire the prediction probability of the question statement output by the classification model for each answer, and determine the answer of the question statement according to the prediction probability.

In some embodiments of the present application, the determining subunit is configured to: obtaining an answer corresponding to the maximum prediction probability in the prediction probabilities; and taking the answer corresponding to the maximum prediction probability as the answer of the question statement.

In some embodiments of the present application, the apparatus further comprises: the input unit is configured to input the sample picture and the sample question into the classification model to obtain an output result of the classification model, wherein the output result comprises output probabilities of the sample question for each answer; the construction unit is configured to construct a first loss function according to the labeled answer of the sample question and the output probability, and construct a second loss function according to the attention weight corresponding to the target area; and the training unit is configured to construct the joint loss function according to the first loss function and the second loss function, and train the classification model based on the joint loss function to obtain a trained classification model.

In some embodiments of the present application, the construction unit is configured to: generating labeling probability aiming at each answer according to the labeling answers of the sample questions; carrying out logarithmic operation on the output probability of the sample question aiming at each answer to obtain an operation result aiming at each answer; and determining the first loss function according to the sum of products of the labeling probability aiming at each answer and the operation result aiming at each answer.

In some embodiments of the present application, the construction unit is configured to: calculating a difference value between the attention weight corresponding to the target area and a preset threshold value; and constructing the second loss function according to the calculated difference.

In some embodiments of the present application, the training unit is configured to: calculating the product of a preset adjustment factor and the second loss function to obtain an operation result; and adding the operation result and the first loss function to obtain an addition result, and taking the addition result as the joint loss function.

In some embodiments of the present application, the input unit is configured to: performing feature extraction on the sample picture to obtain a first sample image feature of the sample picture, and performing feature extraction on the sample question to obtain a first sample image feature of the sample question; generating a second sample text feature corresponding to the first sample text feature based on the attention mechanism of the first sample text feature, and generating a second sample image feature corresponding to the first sample image feature based on the attention mechanism of the second sample text feature; and combining the second sample text characteristic and the second sample image characteristic to obtain a combined sample characteristic, and inputting the combined sample characteristic into the classification model.

It should be noted that the computer system 1600 of the electronic device shown in fig. 16 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 16, computer system 1600 includes a Central Processing Unit (CPU)1601, which can perform various appropriate actions and processes, such as executing the methods described in the above embodiments, according to a program stored in a Read-Only Memory (ROM) 1602 or a program loaded from a storage portion 1608 into a Random Access Memory (RAM) 1603. In the RAM 1603, various programs and data necessary for system operation are also stored. The CPU 1601, ROM 1602, and RAM 1603 are connected to each other via a bus 1604. An Input/Output (I/O) interface 1605 is also connected to the bus 1604.

The following components are connected to the I/O interface 1605: an input portion 1606 including a keyboard, a mouse, and the like; an output section 1607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage portion 1608 including a hard disk and the like; and a communication section 1609 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 1609 performs communication processing via a network such as the internet. The driver 1610 is also connected to the I/O interface 1605 as needed. A removable medium 1611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1610 as necessary, so that a computer program read out therefrom is mounted in the storage portion 1608 as necessary.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication portion 1609, and/or installed from the removable media 1611. When the computer program is executed by a Central Processing Unit (CPU)1601, various functions defined in the system of the present application are executed.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with a computer program embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. The computer program embodied on the computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method described in the above embodiments.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A question-answer processing method based on pictures is characterized by comprising the following steps:

acquiring a target picture and a question sentence corresponding to the target picture;

performing feature extraction on the target picture to obtain a first image feature of the target picture, and performing feature extraction on the question sentence to obtain a first text feature of the question sentence;

generating a second text feature corresponding to the first text feature based on the attention mechanism of the first text feature, and generating a second image feature corresponding to the first image feature based on the attention mechanism of the second text feature;

and combining the second text characteristic and the second image characteristic to obtain a joint characteristic, and generating an answer of the question sentence according to the joint characteristic.

2. The method of claim 1, wherein generating a second text feature corresponding to the first text feature based on the attention mechanism of the first text feature comprises:

performing linear transformation on the first text feature by using a plurality of distribution weights to obtain a plurality of first feature matrixes, wherein one first feature matrix corresponds to one distribution weight;

generating a second feature matrix corresponding to each first feature matrix based on the attention mechanism of each first feature matrix to obtain a plurality of second feature matrices;

and splicing the plurality of second feature matrixes to obtain a spliced feature matrix, and mapping the spliced feature matrix to a dimension which is the same as that of the first text feature to obtain a second text feature corresponding to the first text feature.

3. The method of claim 2, wherein generating the second feature matrix corresponding to each first feature matrix based on the attention mechanism of the each first feature matrix comprises:

similarity calculation is carried out on each first feature matrix and a transposed matrix of each first feature matrix, and attention weight factors of each first feature matrix corresponding to the transposed matrix are obtained;

normalizing the attention weight factor to obtain a corresponding attention weight;

and performing weighted summation calculation on the feature points contained in each first feature matrix by using the attention weight to obtain a second feature matrix corresponding to each first feature matrix.

4. The method of claim 1, wherein generating a second image feature corresponding to the first image feature based on an attention mechanism of the second text feature comprises:

performing linear transformation on the second text characteristic and the first image characteristic by using a plurality of distribution weights respectively to obtain a plurality of third characteristic matrixes and a plurality of fourth characteristic matrixes, wherein one distribution weight corresponds to one third characteristic matrix and one fourth characteristic matrix;

generating a fifth feature matrix corresponding to a fourth feature matrix associated with each third feature matrix based on the attention mechanism of each third feature matrix to obtain a plurality of fifth feature matrices, wherein the fourth feature matrix associated with each third feature matrix is the fourth feature matrix with the same distribution weight corresponding to each third feature matrix;

and splicing the fifth feature matrixes to obtain spliced feature matrixes, and mapping the spliced feature matrixes to have the same dimensionality as the first image features to obtain second image features corresponding to the first image features.

5. The method of claim 4, wherein generating a fifth feature matrix corresponding to a fourth feature matrix associated with each third feature matrix based on the attention mechanism of the each third feature matrix comprises:

similarity calculation is carried out on the third feature matrixes and the transpose matrix of the associated fourth feature matrix, and attention weight factors of the third feature matrixes corresponding to the transpose matrix are obtained;

and performing weighted summation calculation on the feature points contained in the associated fourth feature matrixes by using the attention weight to obtain fifth feature matrixes corresponding to the fourth feature matrixes associated with the third feature matrixes.

6. The method of any one of claims 1-5, wherein generating an answer to the question statement based on the joint feature comprises:

inputting the combined features into a classification model, wherein the classification model is obtained by training according to a combined loss function, the combined loss function is obtained by constructing according to a loss value between an output result and an expected output result of the classification model and an attention weight corresponding to a target region in a sample picture, and the target region is a region determined in the sample picture according to a sample problem corresponding to the sample picture;

and acquiring the prediction probability of the question sentences output by the classification model for each answer, and determining the answers of the question sentences according to the prediction probability.

7. The method of claim 6, wherein determining the answer to the question statement based on the prediction probability comprises:

obtaining an answer corresponding to the maximum prediction probability in the prediction probabilities;

and taking the answer corresponding to the maximum prediction probability as the answer of the question statement.

8. The method of claim 6, further comprising:

inputting the sample picture and the sample question into the classification model to obtain an output result of the classification model, wherein the output result comprises the output probability of the sample question for each answer;

constructing a first loss function according to the labeled answer of the sample question and the output probability, and constructing a second loss function according to the attention weight corresponding to the target area;

and constructing the joint loss function according to the first loss function and the second loss function, and training the classification model based on the joint loss function to obtain a trained classification model.

9. The method of claim 8, wherein constructing a first loss function from the labeled answers to the sample questions and the output probabilities comprises:

generating labeling probability aiming at each answer according to the labeling answers of the sample questions;

carrying out logarithmic operation on the output probability of the sample question aiming at each answer to obtain an operation result aiming at each answer;

and determining the first loss function according to the sum of products of the labeling probability aiming at each answer and the operation result aiming at each answer.

10. The method of claim 8, wherein constructing a second loss function based on the attention weight corresponding to the target region comprises:

calculating a difference value between the attention weight corresponding to the target area and a preset threshold value;

and constructing the second loss function according to the calculated difference.

11. The method of claim 8, wherein constructing the joint loss function from the first loss function and the second loss function comprises:

calculating the product of a preset adjustment factor and the second loss function to obtain an operation result;

and adding the operation result and the first loss function to obtain an addition result, and taking the addition result as the joint loss function.

12. The method of claim 8, wherein inputting the sample picture and the sample question into the classification model comprises:

performing feature extraction on the sample picture to obtain a first sample image feature of the sample picture, and performing feature extraction on the sample question to obtain a first sample image feature of the sample question;

generating a second sample text feature corresponding to the first sample text feature based on the attention mechanism of the first sample text feature, and generating a second sample image feature corresponding to the first sample image feature based on the attention mechanism of the second sample text feature;

and combining the second sample text characteristic and the second sample image characteristic to obtain a combined sample characteristic, and inputting the combined sample characteristic into the classification model.

13. A picture-based question-answering processing apparatus, characterized in that the apparatus comprises:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire a target picture and question sentences corresponding to the target picture;

the extraction unit is configured to perform feature extraction on the target picture to obtain a first image feature of the target picture, and perform feature extraction on the question sentence to obtain a first text feature of the question sentence;

the generating unit is configured to generate a second text feature corresponding to the first text feature based on the attention mechanism of the first text feature, and generate a second image feature corresponding to the first image feature based on the attention mechanism of the second text feature;

and the merging unit is configured to perform feature merging on the second text feature and the second image feature to obtain a joint feature, and generate an answer of the question sentence according to the joint feature.

14. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, implements the picture-based question-answer processing method according to any one of claims 1 to 12.

15. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the picture-based question-and-answer processing method according to any one of claims 1 to 12.