CN113761153B

CN113761153B - Picture-based question-answering processing method and device, readable medium and electronic equipment

Info

Publication number: CN113761153B
Application number: CN202110548159.7A
Authority: CN
Inventors: 彭博
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-05-19
Filing date: 2021-05-19
Publication date: 2023-10-24
Anticipated expiration: 2041-05-19
Also published as: CN113761153A

Abstract

The embodiment of the application provides a question-answering processing method and device based on pictures, a readable medium and electronic equipment. The question-answering processing method based on the pictures comprises the following steps: acquiring a target picture and a problem statement corresponding to the target picture; extracting features of the target picture to obtain a first image feature of the target picture, and extracting features of the problem statement to obtain a first text feature of the problem statement; generating a second text feature corresponding to the first text feature based on the attention mechanism of the first text feature, and generating a second image feature corresponding to the first image feature based on the attention mechanism of the second text feature; and combining the second text features and the second image features to obtain joint features, and generating answers of the question sentences according to the joint features. The technical scheme of the embodiment of the application can improve the accuracy of the picture questions and answers.

Description

Picture-based question-answering processing method and device, readable medium and electronic equipment

Technical Field

The application relates to the technical field of information processing, in particular to a question-answering processing method and device based on pictures, a readable medium and electronic equipment.

Background

Visual questions (Visual Question Answering, VQA) refer to designing a model that, given a picture, automatically answers questions related to the content of the picture using an attention mechanism.

However, in the related visual question-answering technology, the attention mechanism is completely learned by the model, proper guidance is lacking, the unsupervised attention mechanism is limited by the bias of the data set, the difference between the area of attention of the model and the area of attention of people is large, namely, the wrong picture area is concerned, the attention error can cause poor generalization capability of the model, poor interpretability and low question-answering accuracy.

Disclosure of Invention

The embodiment of the application provides a question and answer processing method and device based on pictures, a readable medium and electronic equipment, and further, the accuracy of the question and answer of the pictures can be improved at least to a certain extent.

Other features and advantages of the application will be apparent from the following detailed description, or may be learned by the practice of the application.

According to an aspect of the embodiment of the present application, there is provided a question-answering processing method based on a picture, including: acquiring a target picture and a problem statement corresponding to the target picture; extracting features of the target picture to obtain a first image feature of the target picture, and extracting features of the problem statement to obtain a first text feature of the problem statement; generating a second text feature corresponding to the first text feature based on the attention mechanism of the first text feature, and generating a second image feature corresponding to the first image feature based on the attention mechanism of the second text feature; and combining the second text features and the second image features to obtain joint features, and generating answers of the question sentences according to the joint features.

According to an aspect of an embodiment of the present application, there is provided a picture-based question-answering processing apparatus, including: the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire a target picture and a problem statement corresponding to the target picture; the extraction unit is configured to perform feature extraction on the target picture to obtain a first image feature of the target picture, and perform feature extraction on the problem statement to obtain a first text feature of the problem statement; a generating unit configured to generate a second text feature corresponding to the first text feature based on an attention mechanism of the first text feature, and generate a second image feature corresponding to the first image feature based on an attention mechanism of the second text feature; and the merging unit is configured to perform feature merging on the second text feature and the second image feature to obtain a joint feature, and generate an answer of the question sentence according to the joint feature.

In some embodiments of the application, based on the foregoing scheme, the generating unit includes: the first linear transformation subunit is configured to respectively perform linear transformation on the first text features by using a plurality of distribution weights to obtain a plurality of first feature matrixes, wherein one first feature matrix corresponds to one distribution weight; the first generation subunit is configured to generate second feature matrixes corresponding to the first feature matrixes based on the attention mechanisms of the first feature matrixes so as to obtain a plurality of second feature matrixes; and the first splicing subunit is configured to splice the plurality of second feature matrixes to obtain a spliced feature matrix, map the spliced feature matrix to the same dimension as the first text feature, and obtain a second text feature corresponding to the first text feature.

In some embodiments of the application, based on the foregoing scheme, the first generating subunit is configured to: performing similarity calculation on each first feature matrix and a transposed matrix of each first feature matrix to obtain attention weight factors of the transposed matrix corresponding to each first feature matrix; normalizing the attention weight factors to obtain corresponding attention weights; and carrying out weighted summation calculation on the feature points contained in each first feature matrix by using the attention weight to obtain a second feature matrix corresponding to each first feature matrix.

In some embodiments of the application, based on the foregoing scheme, the generating unit includes: the second linear transformation subunit is configured to respectively perform linear transformation on the second text feature and the first image feature by using a plurality of distribution weights to obtain a plurality of third feature matrixes and a plurality of fourth feature matrixes, wherein one distribution weight corresponds to one third feature matrix and one fourth feature matrix; a second generating subunit configured to generate, based on an attention mechanism of each third feature matrix, a fifth feature matrix corresponding to a fourth feature matrix associated with each third feature matrix to obtain a plurality of fifth feature matrices, where the fourth feature matrix associated with each third feature matrix is a fourth feature matrix with the same allocation weight as that corresponding to each third feature matrix; and the second splicing subunit is configured to splice the plurality of fifth feature matrixes to obtain a spliced feature matrix, map the spliced feature matrix to the same dimension as the first image feature, and obtain a second image feature corresponding to the first image feature.

In some embodiments of the application, based on the foregoing scheme, the second generating subunit is configured to: performing similarity calculation on the third feature matrices and the transposed matrices of the fourth feature matrices to obtain attention weight factors of the third feature matrices corresponding to the transposed matrices; normalizing the attention weight factors to obtain corresponding attention weights; and carrying out weighted summation calculation on the feature points contained in the associated fourth feature matrix by using the attention weight to obtain a fifth feature matrix corresponding to the fourth feature matrix associated with each third feature matrix.

In some embodiments of the present application, based on the foregoing scheme, the merging unit includes: the input subunit is configured to input the joint characteristics into a classification model, the classification model is obtained by training according to a joint loss function, the joint loss function is obtained by constructing according to a loss value between an output result and an expected output result of the classification model and an attention weight corresponding to a target area in a sample picture, and the target area is an area determined in the sample picture according to a sample problem corresponding to the sample picture; and the determining subunit is configured to acquire the prediction probability of the question sentences output by the classification model for each answer and determine the answers of the question sentences according to the prediction probability.

In some embodiments of the application, based on the foregoing scheme, the determining subunit is configured to: acquiring an answer corresponding to the maximum prediction probability in the prediction probabilities; and taking the answer corresponding to the maximum prediction probability as the answer of the question sentence.

In some embodiments of the application, based on the foregoing, the apparatus further comprises: the input unit is configured to input the sample picture and the sample questions into the classification model to obtain an output result of the classification model, wherein the output result comprises output probabilities of the sample questions for each answer; the construction unit is configured to construct a first loss function according to the labeling answers of the sample questions and the output probabilities, and construct a second loss function according to the attention weights corresponding to the target areas; the training unit is configured to construct the joint loss function according to the first loss function and the second loss function, train the classification model based on the joint loss function and obtain a trained classification model.

In some embodiments of the application, based on the foregoing, the building unit is configured to: generating labeling probability for each answer according to the labeling answers of the sample questions; carrying out logarithmic operation on the output probability of each answer of the sample questions to obtain operation results of each answer; and determining the first loss function according to the sum of products of labeling probabilities for the answers and operation results for the answers.

In some embodiments of the application, based on the foregoing, the building unit is configured to: calculating a difference value between the attention weight corresponding to the target area and a preset threshold value; and constructing the second loss function according to the calculated difference value.

In some embodiments of the application, based on the foregoing, the training unit is configured to: calculating the product of a preset adjustment factor and the second loss function to obtain an operation result; and adding the operation result and the first loss function to obtain an addition result, and taking the addition result as the joint loss function.

In some embodiments of the application, based on the foregoing, the input unit is configured to: extracting features of the sample picture to obtain first sample image features of the sample picture, and extracting features of the sample question to obtain first sample text features of the sample question; generating a second sample text feature corresponding to the first sample text feature based on the attention mechanism of the first sample text feature, and generating a second sample image feature corresponding to the first sample image feature based on the attention mechanism of the second sample text feature; and carrying out feature combination on the second sample text features and the second sample image features to obtain combined sample features, and inputting the combined sample features into the classification model.

In the technical solutions provided in some embodiments of the present application, a target picture and a problem statement corresponding to the target picture may be obtained first; then, extracting features of the target picture to obtain a first image feature of the target picture, and extracting features of the problem statement to obtain a first text feature of the problem statement; further, generating a second text feature corresponding to the first text feature based on the attention mechanism of the first text feature, and generating a second image feature corresponding to the first image feature based on the attention mechanism of the second text feature; and finally, combining the second text features and the second image features to obtain joint features, and generating answers of the question sentences according to the joint features. According to the technical scheme, based on the attention mechanism, the internal relevance among the word vectors in the first text feature is captured, the second text feature is obtained, meanwhile, based on the attention mechanism, the relevance between the first image feature and the second text feature is captured, the second image feature is obtained, the second image feature comprises the relevance between the target picture and the question sentence, further, in the process of generating the answer of the question sentence later, more accurate results can be obtained according to the second text feature and the second image feature, and the accuracy of picture questions and answers is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:

FIG. 1 shows a schematic diagram of an exemplary system architecture to which embodiments of the present application may be applied;

FIG. 2 illustrates a flow chart of a picture-based question-answering processing method according to one embodiment of the present application;

FIG. 3 illustrates a flow diagram for generating a second text feature according to one embodiment of the application;

FIG. 4 illustrates a flow diagram for generating a second feature matrix according to one embodiment of the application;

FIG. 5 illustrates a flow chart of generating a second image feature according to one embodiment of the application;

FIG. 6 shows a flow chart of generating a fifth feature matrix according to one embodiment of the application;

FIG. 7 illustrates a flowchart of generating answers to a question sentence according to one embodiment of the application;

FIG. 8 shows a schematic diagram of a target area according to one embodiment of the application;

FIG. 9 illustrates a flow diagram of classification model training according to an embodiment of the application;

FIG. 10 shows a flow chart of constructing a first loss function according to an embodiment of the application;

FIG. 11 shows a flow chart of constructing a second loss function according to an embodiment of the application;

FIG. 12 illustrates a flow chart of classification model training according to an embodiment of the application;

FIG. 13 illustrates a logic diagram of a picture-based question-answering processing method according to one embodiment of the present application;

FIG. 14 shows an effect contrast graph of a picture-based question-answer processing method according to one embodiment of the application;

FIG. 15 shows a block diagram of a picture-based question-answering processing device according to one embodiment of the present application;

fig. 16 shows a schematic diagram of a computer system suitable for use in implementing an embodiment of the application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the application may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.

It should be noted that the terms used in the description of the present application and the claims and the above-mentioned drawings are only used for describing the embodiments, and are not intended to limit the scope of the present application. It will be understood that the terms "comprises," "comprising," "includes," "including" and/or "having," when used herein, specify the presence of stated features, integers, steps, operations, elements, components, and/or groups thereof, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It will be further understood that, although the terms "first," "second," "third," etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first element could be termed a second element without departing from the scope of the present application. Similarly, the second element may be referred to as a first element. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

It should be noted that: references herein to "a plurality" means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., a and/or B may represent: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

With research and advancement of artificial intelligence (Artificial Intelligence, AI) technology, research and application of artificial intelligence technology has been developed in various fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned vehicles, autopilot, unmanned vehicles, robots, smart medicine, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and will have increasingly important value.

Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. Artificial intelligence techniques mainly include computer vision techniques, natural language processing techniques, machine learning/deep learning, and other major directions.

Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, and the like.

It should be understood that the technical scheme provided by the application can be applied to visual question-answering scenes based on artificial intelligence, wherein visual question-answering is an active interdisciplinary research field of computer visualization, natural language processing and machine learning. Given an image and a natural language question associated with the image, the visual question-answering uses natural language statements to answer the question. Visual questions and answers are not only the basic step in building artificial intelligence, but are also of paramount importance for many applications such as image retrieval, blind navigation and early childhood education. Visual question answering is a challenging task because it requires complex computational vision techniques to understand images deeply, advanced natural language processing techniques to extract the meaning of the questions, and a unified framework to efficiently integrate visual and semantic information.

Currently, the solution for visual questions and answers includes a Deep Modular Co-Attention Networks for Visual Question Answering (MCAN) and a modulated convolution of visual counts (Modulated Convolutions For Visual Counting And Beyond, MOVIE) for the visual questions and answers.

The MCAN is a visual question and answer solution proposed in 2019, firstly, a model processes a problem and a picture respectively, and corresponding text features and visual features are extracted; secondly, the text feature and the visual feature pass through an attention mechanism module, wherein the text is a self-attention mechanism, and the image is a directing attention mechanism; finally, combining the text features and the image features, and generating answers to the questions. The MOVIE is a visual question-answering solution proposed in 2020, and the model extracts features for pictures and texts respectively, except that the extracted text features are directly densely spliced on each graphic feature and then enter the attention mechanism module. The method has the advantages that the corresponding characteristics of each image area can be fused with the problem characteristics, and obvious index improvement is achieved on counting problems.

However, the current visual question answering method has the following problems: (1) The attention mechanism is completely learned by the model itself, and proper guidance is absent; (2) The unsupervised attention mechanism is limited by the bias of the data set, and the difference between the region of interest of the model and the region of interest of the person is large, namely the wrong picture region is focused; (3) The focus error of the attention mechanism may lead to poor generalization ability of the model and poor interpretability.

Based on the above, the embodiment of the application provides a question-answering processing method based on a picture, which comprises the steps of firstly obtaining a target picture and a question sentence corresponding to the target picture, then carrying out feature extraction on the target picture to obtain a first image feature of the target picture, carrying out feature extraction on the question sentence to obtain a first text feature of the question sentence, further generating a second text feature corresponding to the first text feature based on an attention mechanism of the first text feature, generating a second image feature corresponding to the first image feature based on an attention mechanism of the second text feature, finally carrying out feature combination on the second text feature and the second image feature to obtain a joint feature, and generating an answer of the question sentence according to the joint feature. According to the technical scheme, based on the attention mechanism, the internal relevance among the word vectors in the first text feature is captured, the second text feature is obtained, meanwhile, based on the attention mechanism, the relevance between the first image feature and the second text feature is captured, the second image feature is obtained, the second image feature comprises the relevance between the target picture and the question sentence, further, in the process of generating the answer of the question sentence later, more accurate results can be obtained according to the second text feature and the second image feature, and the accuracy of the picture question and answer is improved.

In order to facilitate understanding, an embodiment of the present application proposes a question-answer processing method based on a picture, where the method is applied to a system architecture shown in fig. 1, referring to fig. 1, a system architecture 100 may include a terminal device 101, a network 102, a server 103, a target picture 104, and a question sentence 105 corresponding to the target picture. Network 102 is the medium used to provide communication links between terminal device 101 and server 103. Network 102 may include, but is not limited to: wireless network, wired network including, but not limited to, at least one of: wide area network, metropolitan area network, local area network. The wireless network includes, but is not limited to, at least one of: bluetooth, WI-FI, near field communication (Near Field Communication, NFC for short), cellular mobile communication network. A user may interact with the server 103 via the network 102 using the terminal device 101 to receive or send messages or the like.

The terminal device 101 may be any electronic product that can perform man-machine interaction with a user through one or more modes of a keyboard, a touch pad, a touch screen, a remote controller, a voice interaction or handwriting device, such as a PC (Personal Computer ), a mobile phone, a smart phone, a PDA (Personal Digital Assistant ), a wearable device, a palm top PPC (Pocket PC), a tablet computer, a smart car machine, a smart television, a smart speaker, etc.

The server 103 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligence platform.

It will be appreciated by those skilled in the art that the above described terminal device 101 and server 103 are only examples, and that other terminal devices or servers, now existing or hereafter may be present, are within the scope of the present application and are incorporated herein by reference.

It should be understood that the number of terminal devices 101, networks 102 and servers 103 in fig. 1 is illustrative. There may be any number of terminal devices 101, networks 102, and servers 103, as desired for implementation. For example, the server 103 may be a server cluster formed by a plurality of servers.

In one embodiment of the application, the user may upload the target picture 104 and the question sentence 105 through an application on the terminal device 101 and send the target picture 104 and the question sentence 105 to the server 103 through the network 102 between the terminal device 101 and the server 103. Correspondingly, after receiving the target picture 104 and the question sentence 105, the server 103 performs feature extraction on the target picture 104 to obtain a first image feature of the target picture 104, performs feature extraction on the question sentence 105 to obtain a first text feature of the question sentence 105, then the server 103 generates a second text feature corresponding to the first text feature based on the attention mechanism of the first text feature, generates a second image feature corresponding to the first image feature based on the attention mechanism of the second text feature, and finally the server 103 performs feature combination on the second text feature and the second image feature to obtain a joint feature, and generates an answer of the question sentence 105 according to the joint feature.

The picture-based question-answering processing method provided by the embodiment of the application is generally executed by the server 103, the server 103 is used for receiving the target picture 104 and the question sentence 105 uploaded by the terminal equipment 101, and generating the answer of the question sentence 105 based on the target picture 104, and accordingly, the picture-based question-answering processing device is generally arranged in the server 103. However, it is easily understood by those skilled in the art that the picture-based question-answering processing method provided in the embodiment of the present application may be performed by the terminal device 101, and accordingly, the picture-based question-answering processing apparatus may be provided in the terminal device 101, which is not particularly limited in the present exemplary embodiment. For example, in an exemplary embodiment, the terminal device 101 is configured to receive the target picture 104 and the question sentence 105 uploaded by the user, and generate an answer to the question sentence 105 based on the target picture 104.

The implementation details of the technical scheme of the embodiment of the application are described in detail below:

fig. 2 illustrates a flowchart of a picture-based question-answer processing method according to an embodiment of the present application, which may be performed by a server, which may be the server 103 illustrated in fig. 1. Referring to fig. 2, the picture-based question-answering processing method at least includes the following steps:

Step S210, obtaining a target picture and a problem statement corresponding to the target picture;

step S220, extracting features of the target picture to obtain first image features of the target picture, and extracting features of the problem statement to obtain first text features of the problem statement;

step S230, generating a second text feature corresponding to the first text feature based on the attention mechanism of the first text feature, and generating a second image feature corresponding to the first image feature based on the attention mechanism of the second text feature;

and step S240, combining the second text features and the second image features to obtain joint features, and generating answers of the question sentences according to the joint features.

These steps are described in detail below.

Step S210, obtaining a target picture and a problem statement corresponding to the target picture.

Specifically, the server may obtain the target image, where the target image may be obtained through a large-scale knowledge base or a picture database, or may be directly obtained from the internet, a blockchain, or a distributed file system, which is not limited in the embodiment of the present application. The target picture may be a single picture or a frame in the video.

In the embodiment of the application, the server not only acquires the target picture, but also acquires the problem statement corresponding to the target picture. In the application scene, the server acquires the target picture and also acquires the problem statement, and the problem statement and the target picture have stronger relevance.

For example, assuming that the content of the target picture is that three apples are placed on a table, the problem statement corresponding to the target picture may be "there are several apples on the table".

And step S220, carrying out feature extraction on the target picture to obtain a first image feature of the target picture, and carrying out feature extraction on the problem statement to obtain a first text feature of the problem statement.

The feature is the corresponding feature or characteristic of a certain class of object which is different from other classes of objects, or the set of the features and the characteristics, the feature is the data which can be extracted through measurement or processing, the main purpose of feature extraction is dimension reduction, and the main idea is to project an original image sample into a low-dimensional feature space to obtain the low-dimensional image sample feature which can reflect the nature of the image sample or distinguish the image sample most.

Specifically, for the pictures, each picture has its own characteristics which can be distinguished from other pictures, and some are intuitively perceived natural characteristics such as brightness, edges, textures, colors, etc.; some of these are obtained through transformation or processing, such as a moment, a histogram, a principal component, etc., and in the embodiment of the present application, the first image feature may be expressed by a feature vector expression, for example, f= { x1, x2 … xn }, and a common image feature extraction method includes: (1) And extracting the characteristics by a geometric method, wherein the geometric method is a texture characteristic analysis method based on the theory of the picture texture primitives. (2) And (3) extracting characteristics by a model method, wherein the model method is based on a structural model of the picture, and adopts parameters of the model as texture characteristics, such as a convolutional neural network model. (3) The signal processing method feature extraction, extraction and matching of texture features mainly comprises the following steps: gray level co-occurrence matrix, autoregressive texture model, wavelet transform, etc.

For a question sentence, the first text feature of the question sentence is intended to express the text in a form that can be understood by a computer, i.e. vectorizing the text, the extraction of the first text feature may also be implemented by a corresponding text extraction algorithm model, e.g. an embedded network model.

Step S230, generating a second text feature corresponding to the first text feature based on the attention mechanism of the first text feature, and generating a second image feature corresponding to the first image feature based on the attention mechanism of the second text feature.

The essence of Attention (Attention) mechanism comes from human visual Attention mechanism, which is a brain signal processing mechanism specific to human vision, and human vision obtains a target area needing to be focused, namely a focus of Attention, by rapidly scanning a global image, and then inputs more Attention resources into the area to acquire more detail information of a target needing to be focused, while suppressing other useless information. The attention mechanism may include a Self-attention mechanism (Self-attention Mechanism) which is characterized in that the dependency relationship is directly calculated regardless of the distance between words, and the internal structure of a sentence can be learned.

In the embodiment of the application, a self-attention mechanism can be applied to the first text feature, and the internal relevance of each word vector in the first text feature is captured through the self-attention mechanism, wherein the relevance can be attention weight, the attention weight is distributed to each word vector in the first text feature, a second text feature corresponding to the first text feature is generated, and the second text feature comprises the internal relevance among each word vector in the question sentence.

Meanwhile, after the second text feature is generated, based on an attention mechanism, capturing the relevance between the second text feature and the first image feature, wherein the relevance can be attention weight, and the second image feature corresponding to the first image feature is obtained by distributing the attention weight to the corresponding first image feature, so that the second image feature comprises the relevance between the target picture and the problem statement.

In this embodiment, after the attention mechanism is applied to obtain the second text feature and the second image feature, at this time, the second text feature and the second image feature may be further feature-spliced to obtain a joint feature, for example, the second text feature and the second image feature are respectively 256-dimensional feature vectors, and each feature is spliced end to obtain a 512-dimensional joint feature, where feature splicing may be specifically implemented by a concat function, where the concat function is used to connect two or more arrays.

After feature stitching, a joint feature can be obtained, so that the number of subsequent inputs is reduced. For example, when an answer to a question sentence is subsequently acquired, a joint feature may be input, reducing the input amount compared to 2 features being input separately.

After the feature combination is carried out to obtain the joint feature, the question sentences can be classified based on the joint feature to obtain answers of the question sentences.

In one possible implementation manner, the classification processing of the question sentence based on the joint feature may be searching for a class label matched with the joint feature from a preset class label library based on the joint feature, and then using the matched class label as an answer of the question sentence, where the matched class label may be a class label corresponding to a similarity value of the joint feature greater than or equal to a preset similarity value.

Based on the technical scheme of the embodiment, based on the attention mechanism, the internal relevance among the word vectors in the first text feature is captured, so that the second text feature is obtained, meanwhile, based on the attention mechanism, the relevance between the first image feature and the second text feature is captured, so that the second image feature comprises the relevance between the target picture and the question sentence, and further, in the process of generating the answer of the question sentence later, a more accurate result can be obtained according to the second text feature and the second image feature.

In one embodiment of the present application, as shown in fig. 3, based on the attention mechanism of the first text feature, generating the second text feature corresponding to the first text feature may specifically include steps S310 to S330, which are described in detail below:

step S310, respectively performing linear transformation on the first text features by using a plurality of assigned weights to obtain a plurality of first feature matrices, wherein one first feature matrix corresponds to one assigned weight.

In this embodiment, a self-attention mechanism may be applied to the first text feature, and specifically, first, a plurality of assigned weights may be used to perform linear transformation on the first text feature respectively, so as to obtain a plurality of first feature matrices, where one first feature matrix corresponds to one assigned weight.

It should be noted that the first feature matrix is a triplet (Q, K, V), where Q, K, V represents the query, key, and value, respectively, and Q, K, V are the same dimensions, since the first feature matrix may be a triplet, and thus the assigned weight is also a triplet (W ^Q ，W ^K ，W ^V )。

For example, assuming the first text feature is T, then the j-th assigned weight is utilizedThe first text feature X is linearly transformed to obtain a j-th first feature matrix (Q _j ,K _j ,V _j ) Can be expressed as: />

Step S320, based on the attention mechanisms of the first feature matrices, generating second feature matrices corresponding to the first feature matrices to obtain a plurality of second feature matrices.

After the first text feature is linearly changed to obtain each first feature matrix, further, second feature matrices corresponding to each first feature matrix can be generated based on the attention mechanism of each first feature matrix, so that a plurality of second feature matrices are obtained. The implementation of this step is similar to the implementation of step S230, and the embodiment of the present application will not be described in detail here.

And step S330, splicing the plurality of second feature matrixes to obtain spliced feature matrixes, and mapping the spliced feature matrixes into the same dimension as the first text features to obtain second text features corresponding to the first text features.

Finally, a plurality of second feature matrices can be spliced together, and then the spliced feature matrices are mapped into the same dimension as the first text feature, so that the second text feature corresponding to the first text feature can be obtained.

Alternatively, in one implementation, as shown in fig. 4, step S320 may specifically include steps S410-S430, which are specifically described as follows:

In step S410, similarity calculation is performed on each first feature matrix and the transposed matrix of each first feature matrix, so as to obtain attention weight factors of the transposed matrix corresponding to each first feature matrix.

In the embodiment of the present application, the attention mechanism based on each first feature matrix may be understood as that each first feature matrix corresponds to its own attention weight factor, and the essence of the attention mechanism function may be described as a mapping from a query to a series of key-value pairs.

The method mainly comprises three steps when calculating the second feature matrix based on the attention of each first feature matrix, wherein the first step is to calculate the similarity of the query (Q in the first feature matrix) and the key (K in the first feature matrix) to obtain an attention weight factor, and the common similarity functions comprise dot products, splicing, perceptrons and the like; the second step may normalize the attention weighting factors using a softmax function to obtain an attention weight; and finally, carrying out weighted summation on the attention weight and the corresponding key value (V in the first feature matrix) to obtain a second feature matrix.

In step S420, the attention weighting factor is normalized to obtain a corresponding attention weighting.

As described above, three steps are mainly included in calculating the second feature matrix based on the attention of each first feature matrix, and after the attention weight factor is obtained in step S410, in this step, the normalization of the attention weight factor may be performed using a Sigmoid function or a softmax function, which functions to map the variables between [0,1 ].

And S430, carrying out weighted summation calculation on the feature points contained in each first feature matrix by using the attention weight to obtain a second feature matrix corresponding to each first feature matrix.

Finally, the normalized attention weight and the feature points contained in the corresponding key value (V in the first feature matrix) are weighted and summed to obtain a second feature matrix, and at the moment, the second feature matrix combines the relevance among the features of the problem statement, so that a more accurate result can be obtained in the follow-up prediction or classification operation.

In one embodiment of the present application, as shown in fig. 5, the generating the second image feature corresponding to the first image feature based on the attention mechanism of the second text feature may specifically include step S510 to step S530, which are described in detail as follows:

Step S510, performing linear transformation on the second text feature and the first image feature by using a plurality of distribution weights to obtain a plurality of third feature matrixes and a plurality of fourth feature matrixes, wherein one distribution weight corresponds to one third feature matrix and one fourth feature matrix.

In addition to applying the self-attention mechanism to the first text feature, a second image feature corresponding to the first image feature may be generated based on the attention mechanism of the second text feature, so that the second image feature includes a correlation between the target picture and the question sentence.

Specifically, the generating process of the second image feature may be that first, a plurality of assigned weights are used to perform linear transformation on the second text feature and the first image feature respectively, so as to obtain a plurality of third feature matrices and a plurality of fourth feature matrices, where one assigned weight corresponds to one third feature matrix and one fourth feature matrix.

It should be noted that the third feature matrix may be represented by Q, Q represents query, the fourth feature matrix is represented by K and V, K and V represent key and value (value), respectively, the dimensions of Q, K and V are the same, and the assigned weight may be a triplet (W ^Q ，W ^K ，W ^V )。

For example, assuming that the second text feature is X and the first image feature is I, then the j-th assigned weight is utilizedThe second text feature X and the first image feature I are respectively subjected to linear transformation, so that a j-th third feature matrix expressed as +.>The j-th fourth feature matrix may be expressed as: />

Step S520, generating a fifth feature matrix corresponding to a fourth feature matrix associated with each third feature matrix based on the attention mechanism of each third feature matrix, so as to obtain a plurality of fifth feature matrices, where the fourth feature matrix associated with each third feature matrix is a fourth feature matrix with the same allocation weight as that corresponding to each third feature matrix.

Further, a fifth feature matrix corresponding to a fourth feature matrix associated with each third feature matrix may be generated based on the attention mechanism of each third feature matrix to obtain a plurality of fifth feature matrices, where the fourth feature matrix associated with each third feature matrix is a fourth feature matrix with the same assigned weight as the corresponding third feature matrix.

For example, based on the j-th third feature matrix, a fifth feature matrix corresponding to the j-th fourth feature matrix is generated. The implementation of this step is similar to the implementation of step 230 described above, and embodiments of the present application are not described in detail herein.

And step S530, splicing the plurality of fifth feature matrixes to obtain spliced feature matrixes, and mapping the spliced feature matrixes into the same dimension as the first image features to obtain second image features corresponding to the first image features.

Finally, a plurality of fifth feature matrices can be spliced together, and then the spliced feature matrices are mapped into dimensions identical to those of the first image features, so that second image features corresponding to the first image features can be obtained.

Alternatively, in one implementation, as shown in fig. 6, step S520 may specifically include steps S610-S630, which are specifically described as follows:

in step S610, similarity calculation is performed on each third feature matrix and the transposed matrix of the fourth feature matrix, to obtain attention weighting factors of the transposed matrix corresponding to each third feature matrix.

In the embodiment of the present application, based on the attention of the third feature matrix, the attention weight factor of the fourth feature matrix corresponding to the third feature matrix may be understood, and the essence of the attention mechanism function may be described as a mapping from a query to a series of key-value pairs.

The method mainly comprises three steps when calculating a fifth feature matrix based on the attention of a third feature matrix, wherein the first step is to calculate the similarity of the transposed matrices of a query (the third feature matrix Q) and a key (K in an associated fourth feature matrix) to obtain an attention weight factor, and common similarity functions comprise dot products, splicing, perceptrons and the like; the second step may normalize the attention weighting factors using a softmax function to obtain an attention weight; the third step may weight sum the attention weight and the corresponding key value (V in the associated fourth feature matrix) to obtain a fifth feature matrix.

In step S620, the attention weighting factor is normalized to obtain a corresponding attention weighting.

In this step, the attention weighting factor normalization process, which may be performed using Sigmoid function or softmax function, functions to map the variables between [0,1 ].

And step 630, performing weighted summation calculation on the feature points contained in the associated fourth feature matrix by using the attention weight to obtain a fifth feature matrix corresponding to the fourth feature matrix associated with each third feature matrix.

Finally, the attention weight after normalization processing and the feature points contained in the corresponding key value (V in the fourth feature matrix) are weighted and summed to obtain a fifth feature matrix, and at the moment, the fifth feature matrix combines the relevance of the target picture and the problem statement, so that a more accurate result can be obtained in the follow-up prediction or classification operation.

In one embodiment of the present application, the classification processing of the question sentence based on the joint feature may be performed by a classification model, as shown in fig. 7, in which the step of generating the answer of the question sentence according to the joint feature may specifically include steps S710 to S720, which are described as follows:

Step S710, inputting the joint features into a classification model, wherein the classification model is obtained by training according to a joint loss function, the joint loss function is obtained by constructing according to a loss value between an output result and an expected output result of the classification model and a attention weight corresponding to a target area in a sample picture, and the target area is an area determined in the sample picture according to a sample problem corresponding to the sample picture.

Specifically, after the joint features are obtained, the joint features may be input to a classification model summary, and classification processing may be performed by the classification model. The classification model is obtained by training according to a joint loss function, and can output the prediction probability of a question sentence for each answer according to the input joint characteristics, each answer can be preset during model training, and each answer can be regarded as a category, such as red, two, men, hats, sky, animals, dancing and the like.

Before the classification model is adopted for classification treatment, the classification model needs to be trained according to the joint loss function. The construction of the joint loss function includes two parts: firstly, a loss value between an output result and an expected output result of the classification model, secondly, an attention weight corresponding to a target area in a sample picture, wherein the sample picture is a picture for training the classification model, and the target area is an area determined in the sample picture according to a sample problem corresponding to the sample picture.

Alternatively, in one embodiment, when determining the target area, the keyword may be first extracted from the sample question, and then an area matching the keyword is determined in the sample picture according to the keyword, and the matched area is taken as the target area. The meaning of matching may be that the similarity between the image feature information and the keyword in the sample picture is greater than a preset threshold, that is, if the similarity between the image feature information and the keyword in a certain region in the sample picture is greater than the preset threshold, the region is considered to be a region matched with the keyword.

As shown in fig. 8, a sample picture is schematically shown, where the sample question corresponding to the sample picture is "who is a scarf", and according to the keywords "who" and "scarf" in the sample question, a target area 807177 corresponding to the keyword "scarf" and a target area 807173 corresponding to the keyword "who" may be determined in the sample picture.

Referring to fig. 7, in step S720, the prediction probabilities of the question sentences output by the classification model for the respective answers are obtained, and the answers of the question sentences are determined according to the prediction probabilities.

Specifically, after the joint features are input into the classification model, the classification model may output the prediction probabilities of the question sentences for the respective answers, and further, the answers of the question sentences may be determined according to the output prediction probabilities.

In one possible implementation manner, the answer of the question sentence is determined according to the output prediction probability, that is, the prediction value of the question sentence for each answer is calculated according to the prediction probability and the corresponding weight, and then the answer corresponding to the maximum prediction value is used as the answer of the question sentence. The weight corresponding to the prediction probability can be determined according to actual experience.

In another possible implementation manner, the answer of the question sentence may be determined according to the output prediction probabilities, where the answer corresponding to the maximum prediction probability in the prediction probabilities is obtained, and then the answer corresponding to the maximum prediction probability is used as the answer of the question sentence.

In one embodiment of the present application, fig. 9 shows a flowchart of a training method of a classification model, and as shown in fig. 9, the training method of a classification model may specifically include steps S910 to S930, which are described in detail below:

step S910, inputting the sample picture and the sample question into the classification model to obtain an output result of the classification model, where the output result includes output probabilities of the sample question for each answer.

In practice, the classification model can be trained by acquiring sample pictures and sample questions. The sample problems are problem sentences corresponding to sample pictures, and the sample pictures and the sample problems have strong relevance.

After the sample picture and the sample question are acquired, the sample picture and the sample question can be input into a classification model, and the classification model can output the output probability of the sample question for each answer according to the input sample picture and sample question.

And S920, constructing a first loss function according to the labeling answers and the output probabilities of the sample questions, and constructing a second loss function according to the attention weights corresponding to the target areas.

After obtaining the output probabilities of the sample questions for the respective answers, a loss function, i.e., a first loss function, may be constructed according to the labeled answers and the output probabilities of the sample questions, and a second loss function may be constructed according to the attention weights corresponding to the target areas.

It should be noted that, the calculation method of the attention weight corresponding to the target area is similar to the calculation method of the attention weight mentioned in the above steps S510-S530, and thus will not be repeated.

In one embodiment of the present application, as shown in fig. 10, the step of constructing the first loss function according to the labeling answer and the output probability of the sample question may specifically include steps S1010 to S1030, which are described in detail as follows:

in step S1010, labeling probabilities for respective answers are generated from labeling answers of the sample questions.

It will be appreciated that after the sample picture and sample question are input into the classification model, the classification model may output the output probabilities of the sample question for the respective answers. The answers are preset, for example, assuming that the preset answers include four answers, the output probabilities that the sample questions respectively belong to the four answers can be output through the classification model.

Similarly, since the sample questions correspond to the labeling answers, the labeling answers are also one of the preset answers, and the labeling probability of the sample questions for each answer can be generated according to the labeling answers of the sample questions.

For example, the sample question is denoted as S, the preset answers include A1, B1, C1 and D1, and the labeled answer of the sample question is B1, so that the labeling probabilities of the sample question for the respective answers may be generated to be 0,1,0 and 0, respectively.

In step S1020, the output probabilities of the sample questions for the respective answers are logarithmically calculated, so as to obtain calculation results for the respective answers.

When the first loss function is constructed, except for generating the labeling probability of the sample question for each answer, the output probability of the sample question for each answer can be subjected to logarithmic operation at the same time, so that an operation result for each answer is obtained.

In step S1030, a first loss function is determined from the sum of products of the labeling probabilities for the respective answers and the operation results for the respective answers.

After obtaining the labeling probability and the operation result, the first loss function may be determined according to a sum of products of the labeling probability for each answer and the operation result for each answer. First Loss function Loss ₁ The expression of (2) may be as shown in formula (1):

wherein x is _i Labeling probability for ith, y _i Is the ith operation result.

For ease of understanding, the first Loss function Loss may be calculated assuming that the sample question is represented as S, the preset answer includes A1, B1, C1, and D1, the labeled answer of the sample question is B1, the labeling probabilities of the generated sample questions for the respective answers are 0,1,0, and the output probabilities of the sample question S outputted by the classification model belonging to the four answers are 0.2,0.1,0.5 and 0.2, respectively ₁ ＝-(0×log0.2+1×log0.1+0×log0.5+0×log0.2)＝0.5。

In one embodiment of the present application, as shown in fig. 11, the step of constructing the second loss function according to the attention weight corresponding to the target area may specifically include steps S1110 to S1120, which are specifically described as follows:

in step S1110, a difference between the attention weight corresponding to the target area and a preset threshold is calculated.

In this embodiment, when the second loss function is constructed according to the attention weight corresponding to the target area, a difference between the attention weight corresponding to the target area and a preset threshold may be calculated in advance, where the preset threshold may be obtained through experimental data, for example, the preset threshold is 0.8.

In step S1120, a second loss function is constructed according to the calculated difference.

Further, a second loss function may be constructed based on the calculated difference. Second Loss function Loss ₂ The expression of (2) can be as shown in the formula (2):

Loss ₂ =max { (p-a), 0} equation (2)

Wherein p is a preset threshold, and A is the attention weight corresponding to the target area.

Referring to fig. 9, in step S930, a joint loss function is constructed according to the first loss function and the second loss function, and a classification model is trained based on the joint loss function, so as to obtain a trained classification model.

Further, after the first loss function and the second loss function are constructed, the computer device may construct a joint loss function according to the first loss function and the second loss function, adjust model parameters of the classification model according to a direction of minimizing the joint loss function, reduce the joint loss function by updating the model parameters, continuously optimize model parameters of the statistical classification model, and determine model parameters minimizing the joint loss function by adopting a minimization principle, thereby obtaining a trained classification model.

In some embodiments, the step of constructing a joint loss function from the first loss function and the second loss function may specifically include: firstly, calculating the product of a preset adjustment factor and a second loss function to obtain an operation result; then, the operation result and the first loss function are added to obtain an addition result, and the addition result is used as a joint loss function.

Illustratively, the expression of the joint Loss function Loss may be as shown in equation (3):

Loss＝Loss ₁ +α*Loss ₂ formula (3)

Wherein alpha is a preset adjustment factor for adjusting Loss ₂ The value of α may be generally 1 at the specific gravity occupied by Loss.

In one embodiment of the present application, the sample picture and the sample question may be input to the classification model after being processed, as shown in fig. 12, and in this embodiment, the step of inputting the sample picture and the sample question to the classification model may specifically include steps S1210-S1230, which are described in detail below:

step S1210, performing feature extraction on the sample picture to obtain a first sample image feature of the sample picture, and performing feature extraction on the sample question to obtain a first sample feature of the sample question.

For the pictures, each picture has own characteristics which can be distinguished from other pictures, and some of the pictures are intuitively perceived natural characteristics such as brightness, edges, textures, colors and the like; some are obtained by transformation or processing, such as moments, histograms, principal components, etc., and in this embodiment, the first sample image feature may be expressed by a feature vector expression. The method for extracting the characteristics of the sample picture can comprise the following steps: geometric method feature extraction, model method feature extraction, signal processing method feature extraction, and the like.

For sample questions, text vectorization may be performed on the sample questions to obtain a first sample feature of the sample questions, and extraction of the first sample feature may be accomplished by a corresponding text extraction algorithm model, e.g., an embedded network model.

Step S1220, based on the attention mechanism of the first sample text feature, generates a second sample text feature corresponding to the first sample text feature, and based on the attention mechanism of the second sample text feature, generates a second sample image feature corresponding to the first sample image feature.

In this embodiment, a self-attention mechanism may be applied to the first text feature, and internal relevance of each word vector in the first text feature is captured through the self-attention mechanism, where the relevance may be an attention weight, and the attention weight is assigned to each word vector in the first text feature to generate a second sample text feature corresponding to the first text feature, where the second sample text feature includes internal relevance between each word vector in the sample question.

Meanwhile, after the second sample text feature is generated, based on an attention mechanism, capturing the relevance between the second sample text feature and the first sample image feature, wherein the relevance can be attention weight, and the attention weight is distributed to the corresponding first sample image feature to obtain the second sample image feature corresponding to the first sample image feature, so that the second sample image feature comprises the relevance between the sample picture and the sample problem.

And step S1230, carrying out feature combination on the second sample text features and the second sample image features to obtain joint sample features, and inputting the joint sample features into the classification model.

After the attention mechanism is applied to obtain the second sample text feature and the second sample image feature, at this time, feature combination may be performed on the second sample text feature and the second sample image feature to obtain a joint sample feature, and then the joint sample feature is input into the classification model.

Fig. 13 illustrates a logic diagram of a picture-based question-answering processing method according to one embodiment of the present application, in which an executing body can receive a picture and a question, generate an answer to the question for the picture, and thus complete the picture question-answering. As shown in fig. 13, in the present embodiment, the picture-based question-answering processing method may specifically include the steps of:

S1, acquiring a target picture and a problem statement.

In this step, mainly data is imported, and a target picture and a question sentence are imported, as shown in fig. 13, and the question sentence corresponding to the target picture shown in fig. 13 is "What animal is in the box? "

S2, generating a second text feature based on the attention mechanism of the first text feature, and generating a second image feature based on the attention mechanism of the second text feature.

After the target picture and the question sentence are imported through the step S1, the characteristics of the question sentence and the target picture may be extracted, respectively, to obtain the first text characteristic T and the first image characteristic I, respectively.

Further, a self-attention mechanism can be applied to the first text feature T to generate a second text feature X, and meanwhile, after the second text feature X is generated, based on the attention mechanism, the relevance between the second text feature X and the first image feature I is captured, the relevance can be attention weight, and the second image feature Y corresponding to the first image feature I is obtained by distributing the attention weight to the corresponding first image feature I, so that the second image feature Y comprises the relevance between the target picture and the problem statement.

And S3, merging to obtain joint features, and classifying based on the joint features to obtain an answer.

After applying the attention mechanism to obtain the second text feature and the second image feature, the second text feature and the second image feature may be feature-combined.

To facilitate feature merging, feature dimension compression may be performed, for example, by a Multi-layer Perceptron (MLP) prior to feature merging, to obtain compressed second text features, respectivelyAnd compressed second image feature +.>Subsequently, the compressed second text feature can be taken as +.>And compressed second image feature +.>And carrying out feature combination to obtain the combined feature.

After deriving the joint feature, then the question statement "What animal is in the box? "classify to get question statement What animal is in the box? The picture question and answer is completed.

Based on the question-answering processing method based on the picture provided by the embodiment of the application, three groups of experiments are carried out, for convenience of description, refer to fig. 14, fig. 14 is a comparison illustration of the method provided by the application and the MCAN in the prior art on attention mechanism attention area and answer, and a rectangular box in the picture represents the attention mechanism attention area.

As can be seen from fig. 14, in the first picture, the problem "what material the sidewalk close to the street lamp is built from" corresponding to the first picture, the prior art gives the wrong answer "metal" due to the fact that the wrong area (area where the street lamp is located) is concerned, whereas the present application gives the right answer "concrete" due to the fact that the right area (area close to the sidewalk of the street lamp) is concerned; in the second picture, aiming at the problem 'what the animal eating leaves is', the prior art also focuses on the wrong region (the region containing the zebra) so as to give the wrong answer 'zebra', and the application focuses on the correct region (the region where the giraffe is located) so as to give the correct answer 'giraffe'; in the third picture, the prior art focused on multiple regions, giving the wrong answer "nikk", and the present application focused on the right region (region where the cap is located), giving the right answer "adidas", for the question "which company the cap is corresponding to the third picture".

Therefore, the method provided by the application gives correct answers to all three questions, the attention mechanism focuses on the correct area, the MCAN method focuses on the wrong area, so that the wrong answers are given, and the comparison can be carried out.

The following describes an embodiment of the apparatus of the present application, which may be used to perform the picture-based question-answering processing method in the above embodiment of the present application. For details not disclosed in the embodiment of the apparatus of the present application, please refer to the embodiment of the above-mentioned picture-based question-answering processing method of the present application.

Fig. 15 shows a block diagram of a picture-based question-answering processing apparatus according to one embodiment of the present application, and referring to fig. 15, a picture-based question-answering processing apparatus 1500 according to one embodiment of the present application includes: an acquisition unit 1502, an extraction unit 1504, a generation unit 1506, and a merging unit 1508.

The acquiring unit 1502 is configured to acquire a target picture and a problem statement corresponding to the target picture; the extracting unit 1504 is configured to perform feature extraction on the target picture to obtain a first image feature of the target picture, and perform feature extraction on the question sentence to obtain a first text feature of the question sentence; the generating unit 1506 is configured to generate a second text feature corresponding to the first text feature based on the attention mechanism of the first text feature, and generate a second image feature corresponding to the first image feature based on the attention mechanism of the second text feature; the merging unit 1508 is configured to perform feature merging on the second text feature and the second image feature to obtain a joint feature, and generate an answer of the question sentence according to the joint feature.

In some embodiments of the present application, the generating unit 1506 includes: the first linear transformation subunit is configured to respectively perform linear transformation on the first text features by using a plurality of distribution weights to obtain a plurality of first feature matrixes, wherein one first feature matrix corresponds to one distribution weight; the first generation subunit is configured to generate second feature matrixes corresponding to the first feature matrixes based on the attention mechanisms of the first feature matrixes so as to obtain a plurality of second feature matrixes; and the first splicing subunit is configured to splice the plurality of second feature matrixes to obtain a spliced feature matrix, map the spliced feature matrix to the same dimension as the first text feature, and obtain a second text feature corresponding to the first text feature.

In some embodiments of the application, the first generation subunit is configured to: performing similarity calculation on each first feature matrix and a transposed matrix of each first feature matrix to obtain attention weight factors of the transposed matrix corresponding to each first feature matrix; normalizing the attention weight factors to obtain corresponding attention weights; and carrying out weighted summation calculation on the feature points contained in each first feature matrix by using the attention weight to obtain a second feature matrix corresponding to each first feature matrix.

In some embodiments of the present application, the generating unit 1506 includes: the second linear transformation subunit is configured to respectively perform linear transformation on the second text feature and the first image feature by using a plurality of distribution weights to obtain a plurality of third feature matrixes and a plurality of fourth feature matrixes, wherein one distribution weight corresponds to one third feature matrix and one fourth feature matrix; a second generating subunit configured to generate, based on an attention mechanism of each third feature matrix, a fifth feature matrix corresponding to a fourth feature matrix associated with each third feature matrix to obtain a plurality of fifth feature matrices, where the fourth feature matrix associated with each third feature matrix is a fourth feature matrix with the same allocation weight as that corresponding to each third feature matrix; and the second splicing subunit is configured to splice the plurality of fifth feature matrixes to obtain a spliced feature matrix, map the spliced feature matrix to the same dimension as the first image feature, and obtain a second image feature corresponding to the first image feature.

In some embodiments of the application, the second generation subunit is configured to: performing similarity calculation on the third feature matrices and the transposed matrices of the fourth feature matrices to obtain attention weight factors of the third feature matrices corresponding to the transposed matrices; normalizing the attention weight factors to obtain corresponding attention weights; and carrying out weighted summation calculation on the feature points contained in the associated fourth feature matrix by using the attention weight to obtain a fifth feature matrix corresponding to the fourth feature matrix associated with each third feature matrix.

In some embodiments of the present application, the merging unit 1508 includes: the input subunit is configured to input the joint characteristics into a classification model, the classification model is obtained by training according to a joint loss function, the joint loss function is obtained by constructing according to a loss value between an output result and an expected output result of the classification model and an attention weight corresponding to a target area in a sample picture, and the target area is an area determined in the sample picture according to a sample problem corresponding to the sample picture; and the determining subunit is configured to acquire the prediction probability of the question sentences output by the classification model for each answer and determine the answers of the question sentences according to the prediction probability.

In some embodiments of the application, the determining subunit is configured to: acquiring an answer corresponding to the maximum prediction probability in the prediction probabilities; and taking the answer corresponding to the maximum prediction probability as the answer of the question sentence.

In some embodiments of the application, the apparatus further comprises: the input unit is configured to input the sample picture and the sample questions into the classification model to obtain an output result of the classification model, wherein the output result comprises output probabilities of the sample questions for each answer; the construction unit is configured to construct a first loss function according to the labeling answers of the sample questions and the output probabilities, and construct a second loss function according to the attention weights corresponding to the target areas; the training unit is configured to construct the joint loss function according to the first loss function and the second loss function, train the classification model based on the joint loss function and obtain a trained classification model.

In some embodiments of the application, the building unit is configured to: generating labeling probability for each answer according to the labeling answers of the sample questions; carrying out logarithmic operation on the output probability of each answer of the sample questions to obtain operation results of each answer; and determining the first loss function according to the sum of products of labeling probabilities for the answers and operation results for the answers.

In some embodiments of the application, the building unit is configured to: calculating a difference value between the attention weight corresponding to the target area and a preset threshold value; and constructing the second loss function according to the calculated difference value.

In some embodiments of the application, the training unit is configured to: calculating the product of a preset adjustment factor and the second loss function to obtain an operation result; and adding the operation result and the first loss function to obtain an addition result, and taking the addition result as the joint loss function.

In some embodiments of the application, the input unit is configured to: extracting features of the sample picture to obtain first sample image features of the sample picture, and extracting features of the sample question to obtain first sample text features of the sample question; generating a second sample text feature corresponding to the first sample text feature based on the attention mechanism of the first sample text feature, and generating a second sample image feature corresponding to the first sample image feature based on the attention mechanism of the second sample text feature; and carrying out feature combination on the second sample text features and the second sample image features to obtain combined sample features, and inputting the combined sample features into the classification model.

It should be noted that, the computer system 1600 of the electronic device shown in fig. 16 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 16, the computer system 1600 includes a central processing unit (Central Processing Unit, CPU) 1601 that can perform various appropriate actions and processes, such as performing the methods described in the above embodiments, according to a program stored in a Read-Only Memory (ROM) 1602 or a program loaded from a storage section 1608 into a random access Memory (Random Access Memory, RAM) 1603. In the RAM 1603, various programs and data required for system operation are also stored. The CPU 1601, ROM 1602, and RAM 1603 are connected to each other by a bus 1604. An Input/Output (I/O) interface 1605 is also connected to bus 1604.

The following components are connected to the I/O interface 1605: an input portion 1606 including a keyboard, a mouse, and the like; an output portion 1607 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and the like, a speaker, and the like; a storage section 1608 including a hard disk or the like; and a communication section 1609 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 1609 performs communication processing via a network such as the internet. The drive 1610 is also connected to the I/O interface 1605 as needed. A removable medium 1611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 1610 so that a computer program read out therefrom is installed into the storage section 1608 as needed.

In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method shown in the flowchart. In such embodiments, the computer program may be downloaded and installed from a network via the communication portion 1609, and/or installed from the removable media 1611. When executed by a Central Processing Unit (CPU) 1601, performs various functions defined in the system of the present application.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with a computer-readable computer program embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. A computer program embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Where each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer-readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the methods described in the above embodiments.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A picture-based question-answering processing method, the method comprising:

acquiring a target picture and a problem statement corresponding to the target picture;

extracting features of the target picture to obtain a first image feature of the target picture, and extracting features of the problem statement to obtain a first text feature of the problem statement;

generating a second text feature corresponding to the first text feature based on the attention mechanism of the first text feature, capturing attention weights between the second text feature and the first image feature based on the attention mechanism of the second text feature, and generating a second image feature corresponding to the first image feature by distributing the attention weights to the corresponding first image feature;

And combining the second text features and the second image features to obtain joint features, and generating answers of the question sentences according to the joint features.

2. The method of claim 1, wherein generating a second text feature corresponding to the first text feature based on an attention mechanism of the first text feature comprises:

respectively carrying out linear transformation on the first text features by using a plurality of distribution weights to obtain a plurality of first feature matrixes, wherein one first feature matrix corresponds to one distribution weight;

generating second feature matrixes corresponding to the first feature matrixes based on the attention mechanisms of the first feature matrixes so as to obtain a plurality of second feature matrixes;

and splicing the plurality of second feature matrixes to obtain spliced feature matrixes, and mapping the spliced feature matrixes into the same dimension as the first text features to obtain second text features corresponding to the first text features.

3. The method of claim 2, wherein generating a second feature matrix corresponding to each first feature matrix based on an attention mechanism of the first feature matrix comprises:

Performing similarity calculation on each first feature matrix and a transposed matrix of each first feature matrix to obtain attention weight factors of the transposed matrix corresponding to each first feature matrix;

normalizing the attention weight factors to obtain corresponding attention weights;

and carrying out weighted summation calculation on the feature points contained in each first feature matrix by using the attention weight to obtain a second feature matrix corresponding to each first feature matrix.

4. The method of claim 1, wherein generating a second image feature corresponding to the first image feature based on an attention mechanism of the second text feature comprises:

respectively carrying out linear transformation on the second text feature and the first image feature by using a plurality of distribution weights to obtain a plurality of third feature matrixes and a plurality of fourth feature matrixes, wherein one distribution weight corresponds to one third feature matrix and one fourth feature matrix;

generating fifth feature matrixes corresponding to fourth feature matrixes associated with the third feature matrixes based on the attention mechanism of the third feature matrixes to obtain a plurality of fifth feature matrixes, wherein the fourth feature matrixes associated with the third feature matrixes are fourth feature matrixes with the same distribution weights as the fourth feature matrixes corresponding to the third feature matrixes;

And splicing the plurality of fifth feature matrixes to obtain spliced feature matrixes, and mapping the spliced feature matrixes into dimensions identical to the dimensions of the first image features to obtain second image features corresponding to the first image features.

5. The method of claim 4, wherein generating a fifth feature matrix corresponding to a fourth feature matrix associated with each third feature matrix based on an attention mechanism of the respective third feature matrix comprises:

performing similarity calculation on the third feature matrices and the transposed matrices of the fourth feature matrices to obtain attention weight factors of the third feature matrices corresponding to the transposed matrices;

and carrying out weighted summation calculation on the feature points contained in the associated fourth feature matrix by using the attention weight to obtain a fifth feature matrix corresponding to the fourth feature matrix associated with each third feature matrix.

6. The method of any of claims 1-5, wherein generating an answer to the question sentence from the joint feature comprises:

Inputting the joint characteristics into a classification model, wherein the classification model is obtained by training according to a joint loss function, the joint loss function is obtained by constructing according to a loss value between an output result and an expected output result of the classification model and an attention weight corresponding to a target area in a sample picture, and the target area is an area determined in the sample picture according to a sample problem corresponding to the sample picture;

and obtaining the prediction probability of the question sentences output by the classification model for each answer, and determining the answers of the question sentences according to the prediction probability.

7. The method of claim 6, wherein determining an answer to the question sentence based on the predictive probability comprises:

acquiring an answer corresponding to the maximum prediction probability in the prediction probabilities;

and taking the answer corresponding to the maximum prediction probability as the answer of the question sentence.

8. The method of claim 6, wherein the method further comprises:

inputting the sample picture and the sample questions into the classification model to obtain an output result of the classification model, wherein the output result comprises output probabilities of the sample questions for each answer;

Constructing a first loss function according to the labeling answers of the sample questions and the output probabilities, and constructing a second loss function according to the attention weights corresponding to the target areas;

and constructing the joint loss function according to the first loss function and the second loss function, and training the classification model based on the joint loss function to obtain a trained classification model.

9. The method of claim 8, wherein constructing a first loss function from the labeled answers to the sample questions and the output probabilities comprises:

generating labeling probability for each answer according to the labeling answers of the sample questions;

carrying out logarithmic operation on the output probability of each answer of the sample questions to obtain operation results of each answer;

and determining the first loss function according to the sum of products of labeling probabilities for the answers and operation results for the answers.

10. The method of claim 8, wherein constructing a second loss function from the attention weights corresponding to the target region comprises:

calculating a difference value between the attention weight corresponding to the target area and a preset threshold value;

And constructing the second loss function according to the calculated difference value.

11. The method of claim 8, wherein constructing the joint loss function from the first loss function and the second loss function comprises:

calculating the product of a preset adjustment factor and the second loss function to obtain an operation result;

and adding the operation result and the first loss function to obtain an addition result, and taking the addition result as the joint loss function.

12. The method of claim 8, wherein inputting the sample picture and the sample question into the classification model comprises:

extracting features of the sample picture to obtain first sample image features of the sample picture, and extracting features of the sample question to obtain first sample text features of the sample question;

generating a second sample text feature corresponding to the first sample text feature based on the attention mechanism of the first sample text feature, and generating a second sample image feature corresponding to the first sample image feature based on the attention mechanism of the second sample text feature;

And carrying out feature combination on the second sample text features and the second sample image features to obtain combined sample features, and inputting the combined sample features into the classification model.

13. A picture-based question-answering apparatus, the apparatus comprising:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire a target picture and a problem statement corresponding to the target picture;

the extraction unit is configured to perform feature extraction on the target picture to obtain a first image feature of the target picture, and perform feature extraction on the problem statement to obtain a first text feature of the problem statement;

a generating unit configured to generate a second text feature corresponding to the first text feature based on an attention mechanism of the first text feature, capture an attention weight between the second text feature and the first image feature based on an attention mechanism of the second text feature, and generate a second image feature corresponding to the first image feature by assigning the attention weight to the corresponding first image feature;

and the merging unit is configured to perform feature merging on the second text feature and the second image feature to obtain a joint feature, and generate an answer of the question sentence according to the joint feature.

14. A computer readable medium on which a computer program is stored, which when executed by a processor implements the picture-based question-answering processing method according to any one of claims 1 to 12.

15. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the picture-based question-answering method according to any one of claims 1 to 12.