CN117312508A

CN117312508A - Image-based question answering method, apparatus, device, storage medium, and program product

Info

Publication number: CN117312508A
Application number: CN202311206363.6A
Authority: CN
Inventors: 陈亨达
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2023-09-18
Filing date: 2023-09-18
Publication date: 2023-12-29

Abstract

The application relates to an image-based question-answering method, device, equipment, storage medium and program product. According to the method, the visual object features, the scene text features and the problem features are fused, so that the multi-mode information of the images and the problems can be comprehensively considered, the problems can be better understood, the problems can be more accurately answered, and the accuracy of the problem reply is further improved.

Description

Image-based question answering method, apparatus, device, storage medium, and program product

Technical Field

The present application relates to the field of visual question answering technology, and in particular, to an image-based question answering method, apparatus, device, storage medium, and program product.

Background

With the rapid development of financial science and technology, the intelligent customer service of the bank is increasingly widely applied in the financial field, and can provide online help and support for users. For example, a banking intelligent customer service can automatically process a user's queries, provide financial product information, solve common problems, and the like.

Currently, when a user consults a certain financial service, the user generally chooses to send related pictures to a bank intelligent customer service and raise related questions so as to obtain personalized help and guidance. The intelligent customer service of the bank combines pictures and questions, and provides targeted answers and solutions for users through image recognition technology and natural language processing technology.

However, when the bank intelligent customer service answers the user questions, there is a problem that the answer is inaccurate.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an image-based question-answering method, apparatus, device, storage medium, and program product that can improve the accuracy of replies.

In a first aspect, the present application provides an image-based question-answering method, the method comprising:

acquiring a target image and a problem text corresponding to the target image from a user terminal;

Inputting a target image and a question text into a pre-trained target question-answering model, respectively extracting features of the target image and the question text through the target question-answering model to obtain visual object features, scene text features and question features, carrying out fusion processing on the visual object features, the scene text features and the question features, and obtaining answers corresponding to the question text based on fusion processing results;

and feeding back the answer to the user terminal.

In one embodiment, the target question-answering model includes a first converged network and a second converged network; fusion processing is carried out on the visual object characteristics, the scene text characteristics and the problem characteristics, and the fusion processing comprises the following steps:

inputting the visual object features and the scene text features into a first fusion network for fusion processing to obtain initial fusion features;

and inputting the initial fusion characteristics and the problem characteristics into a second fusion network for fusion processing, and obtaining a fusion processing result.

In one embodiment, the target question-answering model further includes a first feature extraction network and a second feature extraction network, and the feature extraction is performed on the target image and the question text through the target question-answering model to obtain a visual object feature, a scene text feature and a question feature, including:

Inputting the target image into a first feature extraction network to perform feature extraction to obtain visual object features and scene text features;

and inputting the question text into a second feature extraction network to extract features, so as to obtain question features.

In one embodiment, the target question-answering model further includes a dynamic pointer network, and obtains an answer corresponding to the question text based on the fusion processing result, including:

inputting the fusion processing result into a dynamic pointer network, and dynamically generating a pointer position; the pointer position is used for indicating the position range of the answer corresponding to the question text in the output sequence;

and determining an answer corresponding to the question text according to the pointer position.

In one embodiment, the target question-answering model further comprises a third feature extraction network, the method further comprising:

inputting the target image into a third feature extraction network to perform feature extraction to obtain global grid features;

correspondingly, inputting the initial fusion characteristics and the problem characteristics into a second fusion network for fusion processing to obtain a fusion processing result, wherein the fusion processing result comprises the following steps:

and inputting the global grid features, the initial fusion features and the problem features into a second fusion network for fusion processing to obtain a fusion processing result.

In one embodiment, the method further comprises:

acquiring a training sample set; the training sample set comprises a plurality of training samples and labels of the training samples, wherein the training samples comprise sample images and sample problems corresponding to the sample images, and the sample images comprise sample objects and sample scene texts; labeling sample answers corresponding to the sample questions;

and carrying out joint training on the initial feature extraction network, the initial fusion network and the initial question-answering network in the initial question-answering model based on the training sample set to obtain a target question-answering model.

In a second aspect, the present application further provides an image-based question answering apparatus, the apparatus including:

the acquisition module is used for acquiring a target image and a problem text corresponding to the target image from the user terminal;

the question-answering module is used for inputting the target image and the question text into a pre-trained target question-answering model, respectively extracting the characteristics of the target image and the question text through the target question-answering model to obtain visual object characteristics, scene text characteristics and question characteristics, carrying out fusion processing on the visual object characteristics, the scene text characteristics and the question characteristics, and obtaining answers corresponding to the question text based on fusion processing results;

And the feedback module is used for feeding back the answer to the user terminal.

In a third aspect, the present application also provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

and feeding back the answer to the user terminal.

In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

And feeding back the answer to the user terminal.

In a fifth aspect, the present application also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of:

and feeding back the answer to the user terminal.

The image-based question-answering method, the image-based question-answering device, the image-based question-answering equipment, the storage medium and the program product acquire a target image and a question text corresponding to the target image from a user terminal, then input the target image and the question text into a pre-trained target question-answering model, respectively perform feature extraction on the target image and the question text through the target question-answering model to obtain visual object features, scene text features and question features, perform fusion processing on the visual object features, scene text features and the question features, obtain answers corresponding to the question text based on fusion processing results, and finally feed back the answers to the user terminal. According to the method, the visual object features, the scene text features and the problem features are fused, so that the multi-mode information of the images and the problems can be comprehensively considered, the problems can be better understood, the problems can be more accurately answered, and the accuracy of the problem reply is further improved.

Drawings

FIG. 1 is an application diagram of an image-based question-answering method in one embodiment;

FIG. 2 is a flow diagram of an image-based question-answering method in one embodiment;

FIG. 3 is a flow chart of an image-based question-answering method according to another embodiment;

FIG. 4 is a flow chart of an image-based question-answering method according to another embodiment;

FIG. 5 is a flow chart of an image-based question-answering method according to another embodiment;

FIG. 6 is a flow chart of an image-based question-answering method according to another embodiment;

FIG. 7 is a flow chart of an image-based question-answering method according to another embodiment;

FIG. 8 is a flow chart of an image-based question-answering method according to another embodiment;

FIG. 9 is a block diagram of the structure of an image-based question-answering apparatus in one embodiment;

fig. 10 is an internal structural view of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

With the rapid development of financial science and technology, the intelligent customer service of the bank is increasingly widely applied in the financial field, and can provide online help and support for users. For example, a banking intelligent customer service can automatically process a user's queries, provide financial product information, solve common problems, and the like. Currently, when a user consults a certain financial service, the user generally chooses to send related pictures to a bank intelligent customer service and raise related questions so as to obtain personalized help and guidance. The intelligent customer service of the bank combines pictures and questions, and provides targeted answers and solutions for users through image recognition technology and natural language processing technology. However, when the bank intelligent customer service answers the user questions, there is a problem that the answer is inaccurate. The present application provides an image-based question-answering method, which aims to solve the above technical problems, and the following embodiments will specifically describe the image-based question-answering method described in the present application.

The question answering method based on the image provided by the embodiment of the application can be applied to an application system shown in fig. 1. The computer device 01 communicates with the user terminal 02 via a network, wherein the computer device 01 may be implemented as a stand-alone server or as a server cluster comprising a plurality of servers. The user terminal 02 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like, and the portable wearable devices may be smart watches, smart bracelets, head-mounted devices, and the like.

Those skilled in the art will appreciate that the architecture shown in FIG. 1 is a block diagram of only some of the architecture relevant to the present application and is not limiting of the application system to which the present application may be applied, a particular application may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components

In one embodiment, as shown in fig. 2, an image-based question answering method is provided, and the method is applied to the computer device 01 in fig. 1 for illustration, and includes the following steps:

s201, acquiring a target image and a question text corresponding to the target image from a user terminal.

Wherein the target image is an image comprising a visual object in the image and scene text in the image. A visual object refers to an object, person, animal, or other identifiable entity in an image. Scene text refers to textual information in an image, such as a slogan, trademark, placard, subtitle, etc.

In the embodiment of the application, the user can select or shoot the target image related to the operation, such as a check, a draft, an identity card or other financial files needing to be processed, on the user terminal. The user simultaneously inputs a question text related to the target image, such as the amount of a check, the account number of a payee for transfer, the use of a loan application, etc., and then the target image and the question text related to the target image may be transmitted to the computer device through the network, and the computer device may receive the target image and the question text corresponding to the target image from the user terminal.

S202, inputting a target image and a question text into a pre-trained target question-answering model, respectively extracting features of the target image and the question text through the target question-answering model to obtain visual object features, scene text features and question features, carrying out fusion processing on the visual object features, the scene text features and the question features, and obtaining answers corresponding to the question text based on fusion processing results.

The target question-answering model may be a neural network model of visual question-answering. The target question-answering model can be applied to the financial field, can be used as intelligent customer service of banks, and can also be applied to other fields. The visual object features are used to characterize the visual object in the target image. Scene text features are used to characterize scene text in a target image.

In this embodiment of the present application, after obtaining the target image and the question text corresponding to the target image based on the above steps, the computer device may input the target image and the question text into a pre-trained target question-answer model, or may pre-process the target image and the question text and then input the pre-trained target question-answer model. The target question-answering model can perform feature extraction on visual objects in the target image to obtain features of the visual objects, perform feature extraction on scene texts in the target image to obtain scene text features, and perform semantic and context information analysis on the problem texts to obtain the problem features. And then, the three characteristics of the visual object characteristics, the scene text characteristics and the question characteristics can be directly fused to obtain a fusion processing result, and then, an answer matched with the question is generated based on the fusion processing result.

Optionally, the visual object features and the scene text features may be first fused, then the preliminarily fused results and the question features are fused again to obtain a fusion processing result, and then an answer matched with the question is generated based on the fusion processing result.

Optionally, the scene text features and the question features can be first subjected to primary fusion, then the primary fusion result and the question features are fused again to obtain a fusion processing result, and then an answer matched with the question is generated based on the fusion processing result. Other features can be fused on the basis of the visual object features, the scene text features and the question features to obtain a fusion processing result, and then an answer matched with the question is generated on the basis of the fusion processing result.

For example, the user provides an investment report image containing a stock chart and company name, and raises a question of "how the stock of this company performs". The target question-answering model can extract the characteristics of the image to obtain the characteristics of visual objects such as stock charts, company marks and the like, and process the text of the questions to understand the keywords of 'company' and 'stock performance' in the questions and other relevant information of the questions. The stock performance of the company can then be analyzed and predicted based on the stock chart and the question text, and an answer similar to "analyze based on historical data, the company's stock recently performed well" can be generated.

S203, feeding back the answer to the user terminal.

In this embodiment of the present application, after obtaining an answer corresponding to the question text based on the above steps, the computer device may return the generated answer to the user terminal through network connection, message transmission or other communication channels.

According to the image-based question-answering method, the target image and the question text corresponding to the target image are acquired from the user terminal, then the target image and the question text are input into the pre-trained target question-answering model, the target image and the question text are respectively subjected to feature extraction through the target question-answering model, the visual object features, the scene text features and the question features are obtained, fusion processing is carried out on the visual object features, the scene text features and the question features, answers corresponding to the question text are obtained based on the fusion processing result, and finally the answers are fed back to the user terminal. According to the method, the visual object features, the scene text features and the problem features are fused, so that the multi-mode information of the images and the problems can be comprehensively considered, the problems can be better understood, the problems can be more accurately answered, and the accuracy of the problem reply is further improved.

In an embodiment, a specific implementation manner of fusion processing of the visual object feature, the scene text feature and the problem feature is further provided, where the target question-answering model includes a first fusion network and a second fusion network, as shown in fig. 3, the "fusion processing of the visual object feature, the scene text feature and the problem feature" in step S202 includes:

S301, inputting the visual object features and the scene text features into a first fusion network to perform fusion processing, and obtaining initial fusion features.

The first fusion network may be an attention mechanism fusion network, a multi-layer sensing network, or a fusion convolutional neural network. The initial fusion feature is a feature that fuses the visual object feature and the scene text feature.

In this embodiment of the present application, after the visual object feature, the scene text feature, and the problem feature are obtained by the computer device based on the above steps, the visual object feature and the scene text feature may be input into the first fusion network to perform fusion processing, and specifically, the two feature vectors may be connected, or some specific fusion operations, for example, element-by-element addition or multiplication, may be used to generate the initial fusion feature.

S302, inputting the initial fusion characteristics and the problem characteristics into a second fusion network to carry out fusion processing, and obtaining a fusion processing result.

The second converged network may be a Transformer characteristic converged network, or may be a network with the same architecture as the first converged network.

In this embodiment of the present application, after obtaining the initial fusion feature based on the foregoing steps, the computer device may input the initial fusion feature and the problem feature into the second fusion network to perform fusion processing, and specifically may connect the two feature vectors, or use some specific fusion operations, for example, element-by-element addition or multiplication, and finally generate a fusion processing result.

In the above embodiment, the fusion processing is performed on the visual object features and the scene text features through the first fusion network, so that the relationship between the scene text and the visual object can be better understood, and the further fusion processing is performed on the initial fusion features and the problem features through the second fusion network, so that the semantics and the context of the problem can be better understood, and the information of different modes can be integrated together through the fusion of the features for two times, so that the accuracy and the efficiency of question and answer are improved.

In an embodiment, a specific implementation manner for obtaining the visual object feature, the scene text feature and the problem feature is further provided, where the target question-answering model further includes a first feature extraction network and a second feature extraction network, as shown in fig. 4, and the step S202 of performing feature extraction on the target image and the problem text through the target question-answering model to obtain the visual object feature, the scene text feature and the problem feature includes:

s401, inputting the target image into a first feature extraction network to perform feature extraction, and obtaining the visual object features and the scene text features.

Wherein the first feature extraction network may be a convolutional neural network.

In this embodiment of the present application, after the computer device obtains the target image based on the above steps, preprocessing such as size adjustment, clipping, normalization, or color space conversion of the target image may be performed on the target image to adapt to the input requirement of the first feature extraction network. And inputting the preprocessed target image into a first feature extraction network, and obtaining the visual object features by the first feature extraction network through multiple rolling and pooling operations. For example, low-level visual features such as edges, textures, colors, etc. may be extracted, and increasingly higher-level semantic features such as object shapes, object portions, object categories, etc. may be obtained. And the first feature extraction network can extract a text region in the image through a text detection algorithm or an OCR technology, and then converts the extracted text region into a feature vector by using word vectorization, a word bag model and the like, so as to obtain scene text features.

S402, inputting the question text into a second feature extraction network to extract features, and obtaining question features.

The second feature extraction network may be a natural language processing network, or may be a recurrent neural network.

In this embodiment of the present application, after obtaining the question text corresponding to the target image based on the foregoing steps, the computer device may perform preprocessing of word segmentation, stop word removal, word stem formation, or other text cleaning on the input question text, so as to adapt to the input requirement of the second feature extraction network. And inputting the preprocessed problem text into a second feature extraction network, wherein the second feature extraction network can model the problem text, capture semantic information, context association and important features of the problem to obtain feature representation of the problem, and then extracting the features to obtain the feature of the problem.

It should be noted that, S401 and S402 are one of the execution sequences, and in practical application, S402 may be executed first and S401 may be executed subsequently; alternatively, S401 and S402 are executed in parallel.

In the above embodiment, by performing feature extraction on the target image, object, scene and text information in the image can be captured better, and by performing feature extraction on the question text, semantic and context information of the question can be understood better, and accuracy of feature extraction is improved.

In an embodiment, a specific implementation manner of obtaining an answer corresponding to a question text based on the fusion processing result is further provided, where the target question-answering model further includes a dynamic pointer network, as shown in fig. 5, and the "obtaining an answer corresponding to a question text based on the fusion processing result" in step S202 includes:

s501, inputting the fusion processing result into a dynamic pointer network, and dynamically generating a pointer position; the pointer position is used to indicate the range of positions of the answers corresponding to the question text in the output sequence.

Where the dynamic pointer network is a network based on an attention mechanism for determining the range of locations of the question text in the output sequence (typically the answer sequence).

In this embodiment of the present application, after obtaining the fusion processing result based on the foregoing steps, the computer device may input the fusion processing result into the dynamic pointer network, where the dynamic pointer network performs a series of attention calculations on the fusion processing result in an attention distribution manner, finds out positions where the answer corresponding to the question text in the output sequence starts and ends, and further determines a position range of the answer corresponding to the question text, so as to implement a process of dynamically generating the pointer position.

S502, determining an answer corresponding to the question text according to the pointer position.

In this embodiment of the present application, after the computer device obtains the pointer position based on the above steps, the corresponding portion may be extracted from the output sequence as an answer according to the start position and the end position in the pointer position. If the output sequence is a sequence generation task, such as generating a text sequence, a decoder can be used to further process and decode the extracted partial answer, and then, based on the output of the decoder, the word with the highest output probability of the decoder can be generated again or converted into the final answer corresponding to the question text by using some other post-processing technology (such as beam search and length penalty).

In the above embodiment, since the dynamic pointer network can adaptively generate the pointer position according to the input information, the answer corresponding to the question text is generated by using the dynamic pointer network, so that the position range of the question text in the output sequence can be accurately determined, and the accuracy of the answer position is improved. Meanwhile, the multi-mode information of the image and the problem can be comprehensively considered by analyzing and processing the fusion processing result, so that the problem can be better understood, the problem can be more accurately answered, and more comprehensive and accurate answer can be provided.

In an embodiment, a specific implementation manner of inputting the initial fusion feature and the question feature into the second fusion network to perform fusion processing is further provided to obtain a fusion processing result, where the target question-answering model further includes a third feature extraction network, as shown in fig. 6, and the "inputting the initial fusion feature and the question feature into the second fusion network to perform fusion processing to obtain the fusion processing result" in step S302 includes:

s601, inputting the target image into a third feature extraction network to perform feature extraction, and obtaining global grid features.

Wherein the third feature extraction network may be a convolutional neural network or other structure. Global grid features are used to characterize the overall structure and key information of the target image.

In this embodiment of the present application, after the computer device obtains the target image based on the above steps, preprocessing such as size adjustment, clipping, normalization, or color space conversion of the target image may be performed on the target image to adapt to the input requirement of the third feature extraction network. And then inputting the preprocessed target image into a third feature extraction network, wherein the third feature extraction network can gradually extract feature representations of different layers of the image by stacking a plurality of convolution layers and pooling layers, and finally, global grid features capable of describing the structure and key information of the whole target image are obtained.

Correspondingly, when executing step S302 "input the initial fusion feature and the problem feature into the second fusion network to perform fusion processing, and obtain a fusion processing result", step S602 is specifically executed: and inputting the global grid features, the initial fusion features and the problem features into a second fusion network for fusion processing to obtain a fusion processing result.

In this embodiment of the present application, after obtaining the initial fusion feature, the problem feature, and the global grid feature based on the above steps, the computer device may input the global grid feature, the initial fusion feature, and the problem feature into the second fusion network to perform fusion processing, and specifically may connect three feature vectors, or use some specific fusion operations, for example, element-by-element addition or product, and finally generate a fusion processing result.

In the above embodiment, by extracting global grid features of the target image, the features can capture the overall characteristics and structural information of the image, and can help the network to better understand and process the problems related to the target image. And simultaneously combining the global grid features, the initial fusion features and the problem features, so that different features can be mutually supplemented, and a more accurate and more complete fusion processing result is obtained, thereby being beneficial to improving the performance and expression of the system in a task.

In one embodiment, there is further provided a specific implementation manner for training the initial question-answering model to obtain the target question-answering model, as shown in fig. 7, where the training method includes:

s701, acquiring a training sample set.

The training sample set comprises a plurality of training samples and labels of the training samples, wherein the training samples comprise sample images and sample problems corresponding to the sample images, and the sample images comprise sample objects and sample scene texts; the labels include sample answers corresponding to the sample questions.

In the embodiment of the application, the computer device can collect and collect a large number of sample images and sample questions corresponding to the sample images in a mode of searching on the internet, acquiring from a related database or resource and the like, then pair the sample questions with the sample images, and generate corresponding labels for each sample question and sample image pair by manually writing the questions or acquiring sample answers corresponding to the sample questions from an open question-answer data set.

Optionally, the sample image may be subjected to preprocessing such as image enhancement, scaling, denoising, etc., text processing such as word segmentation, word stem extraction, stop word removal, etc. may be performed on the sample image, and the preprocessed sample image, the corresponding sample question and the labeled sample answer are combined together to construct a training sample set, where each training sample includes one sample image, one sample question, and the corresponding sample answer.

For example, the sample image is an image with financial data, charts or financial transaction scenes, and the sample problem can be a problem related to the financial field, such as risk management, investment policy, market trend analysis, etc. And finally, providing corresponding sample answers for each sample question, and extracting the sample answers from the existing financial database through the labeling of a professional labeling operator and the participation of financial specialists.

S702, performing joint training on an initial feature extraction network, an initial fusion network and an initial question-answer network in the initial question-answer model based on a training sample set to obtain a target question-answer model.

In this embodiment of the present invention, after obtaining a training sample set based on the above steps, a computer device may input the training sample set into an initial feature extraction network to perform feature extraction, extract a sample visual object feature, a sample scene text feature, a sample global grid feature, and a sample problem feature in a sample image, input the sample visual object feature and the sample scene text feature in the sample image into a primary fusion network in the initial fusion network, perform primary fusion processing of the features, input a primary fusion processing result, the sample global grid feature, and the sample problem feature into a secondary fusion network in the initial fusion network, perform secondary fusion processing, and finally input the secondary fusion processing result into the initial question-answer network, to generate a prediction result of a question answer. And calculating the gradient of the loss function relative to the network parameters through a back propagation algorithm, updating the weight and the parameters in the network by using an optimization algorithm such as gradient descent, so that the loss function is gradually reduced, sequentially updating the parameters of the initial feature extraction network, the initial fusion network and the initial question-answering network until a preset training round number is reached or the loss function converges, and finally finishing training to obtain the target question-answering model.

In the embodiment, through combined training, the feature extraction network, the fusion network and the question-answering network can fully utilize multi-mode information in the training sample set, so that the expression capacity and generalization capacity of the model can be improved, and meanwhile, the understanding capacity of the model on complex problems and images can be enhanced. Moreover, the initial question-answering model can be gradually optimized by carrying out combined training based on the training sample set, so that a stronger and more accurate target question-answering model is obtained, and questions can be better understood and answered.

In all the above embodiments, there is also provided an image-based question-answering method, as shown in fig. 8, including:

s801, a training sample set is acquired. The training sample set comprises a plurality of training samples and labels of the training samples, the training samples comprise sample images and sample questions corresponding to the sample images, the sample images comprise sample objects and sample scene texts, and the labels comprise sample answers corresponding to the sample questions.

S802, performing joint training on an initial feature extraction network, an initial fusion network and an initial question-answer network in the initial question-answer model based on a training sample set to obtain a target question-answer model.

S803, a target image and a question text corresponding to the target image are acquired from the user terminal.

S804, inputting the target image into a first feature extraction network to perform feature extraction, and obtaining the visual object features and the scene text features.

S805, inputting the question text into a second feature extraction network to perform feature extraction, and obtaining the question feature.

S806, inputting the target image into a third feature extraction network to perform feature extraction, and obtaining global grid features.

S807, inputting the visual object features and the scene text features into a first fusion network for fusion processing, and obtaining initial fusion features.

S808, inputting the global grid features, the initial fusion features and the problem features into a second fusion network for fusion processing, and obtaining a fusion processing result.

S809, inputting the fusion processing result into a dynamic pointer network, and dynamically generating a pointer position; the pointer position is used to indicate the range of positions of the answers corresponding to the question text in the output sequence.

S810, determining an answer corresponding to the question text according to the pointer position.

S811, feeding back the answer to the user terminal.

The method of each step is described in the foregoing embodiments, and the detailed description is referred to the foregoing description and is not repeated here.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides an image-based question-answering device for implementing the above-mentioned image-based question-answering method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the image-based question answering device or devices provided below may be referred to the limitation of the image-based question answering method hereinabove, and will not be repeated here.

In one embodiment, as shown in fig. 9, there is provided an image-based question answering apparatus, including:

the target data acquisition module 10 is configured to acquire a target image and a question text corresponding to the target image from the user terminal.

The question and answer module 11 is configured to input a target image and a question text into a pre-trained target question and answer model, perform feature extraction on the target image and the question text through the target question and answer model, obtain a visual object feature, a scene text feature and a question feature, perform fusion processing on the visual object feature, the scene text feature and the question feature, and obtain an answer corresponding to the question text based on a fusion processing result.

And the feedback module 12 is used for feeding back the answer to the user terminal.

In one embodiment, the question and answer module 11 includes:

the first fusion unit is used for inputting the visual object features and the scene text features into the first fusion network to be subjected to fusion processing, so as to obtain initial fusion features.

And the second fusion unit is used for inputting the initial fusion characteristics and the problem characteristics into a second fusion network to carry out fusion processing, so as to obtain a fusion processing result.

In one embodiment, the question and answer module 11 includes:

the first feature extraction unit is used for inputting the target image into the first feature extraction network to perform feature extraction, and obtaining the visual object features and the scene text features.

And the second feature extraction unit is used for inputting the problem text into a second feature extraction network to perform feature extraction so as to obtain the problem feature.

In one embodiment, the question and answer module 11 includes:

and the generating unit is used for inputting the fusion processing result into the dynamic pointer network and dynamically generating the pointer position. The pointer position is used for indicating the position range of the answer corresponding to the question text in the output sequence;

and the determining unit is used for determining an answer corresponding to the question text according to the pointer position.

In one embodiment, the question and answer module 11 includes:

and the third feature extraction unit is used for inputting the target image into a third feature extraction network to perform feature extraction so as to obtain global grid features.

And the third fusion unit inputs the global grid features, the initial fusion features and the problem features into the second fusion network for fusion processing, and a fusion processing result is obtained.

In one embodiment, the image-based question answering apparatus includes:

and the sample acquisition module is used for acquiring a training sample set. The training sample set comprises a plurality of training samples and labels of the training samples, wherein the training samples comprise sample images and sample problems corresponding to the sample images, and the sample images comprise sample objects and sample scene texts; the label comprises a sample answer corresponding to the sample question.

And the training module is used for carrying out joint training on the initial feature extraction network, the initial fusion network and the initial question-answer network in the initial question-answer model based on the training sample set to obtain the target question-answer model.

The respective modules in the above-described image-based question answering apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 10. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement an image-based question-answering method. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in FIG. 10 is merely a block diagram of some of the structures associated with the present application and is not intended to limit the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than those shown, or may combine certain components, or may have a different arrangement of components

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:

and feeding back the answer to the user terminal.

In one embodiment, the processor when executing the computer program further performs the steps of:

The computer device provided in the foregoing embodiments has similar implementation principles and technical effects to those of the foregoing method embodiments, and will not be described herein in detail.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

And feeding back the answer to the user terminal.

In one embodiment, the computer program when executed by the processor further performs the steps of:

The foregoing embodiment provides a computer readable storage medium, which has similar principles and technical effects to those of the foregoing method embodiment, and will not be described herein.

In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:

and feeding back the answer to the user terminal.

The foregoing embodiment provides a computer program product, which has similar principles and technical effects to those of the foregoing method embodiment, and will not be described herein.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. An image-based question-answering method, the method comprising:

acquiring a target image from a user terminal and a question text corresponding to the target image;

inputting the target image and the question text into a pre-trained target question-answering model, respectively extracting features of the target image and the question text through the target question-answering model to obtain visual object features, scene text features and question features, carrying out fusion processing on the visual object features, the scene text features and the question features, and obtaining answers corresponding to the question text based on fusion processing results;

And feeding back the answer to the user terminal.

2. The method of claim 1, wherein the target question-answering model comprises a first converged network and a second converged network; the fusing processing of the visual object feature, the scene text feature and the question feature comprises the following steps:

inputting the visual object features and the scene text features into the first fusion network for fusion processing to obtain initial fusion features;

and inputting the initial fusion characteristics and the problem characteristics into the second fusion network to perform fusion processing, so as to obtain a fusion processing result.

3. The method according to claim 2, wherein the target question-answering model further includes a first feature extraction network and a second feature extraction network, and the feature extraction is performed on the target image and the question text by the target question-answering model to obtain a visual object feature, a scene text feature and a question feature, respectively, including:

inputting the target image into the first feature extraction network to perform feature extraction to obtain the visual object features and the scene text features;

and inputting the question text into the second feature extraction network to perform feature extraction to obtain the question feature.

4. The method of claim 3, wherein the target question-answering model further comprises a dynamic pointer network, the obtaining an answer corresponding to the question text based on the fusion processing result comprises:

inputting the fusion processing result into the dynamic pointer network, and dynamically generating a pointer position; the pointer position is used for indicating a position range of an answer corresponding to the question text in the output sequence;

5. A method according to claim 3, wherein the target question-answering model further comprises a third feature extraction network, the method further comprising:

inputting the target image into the third feature extraction network to perform feature extraction to obtain global grid features;

correspondingly, the step of inputting the initial fusion feature and the problem feature into the second fusion network to perform fusion processing, to obtain a fusion processing result, includes:

and inputting the global grid features, the initial fusion features and the problem features into the second fusion network to carry out fusion processing, so as to obtain a fusion processing result.

6. The method of claim 5, wherein the method further comprises:

acquiring a training sample set; the training sample set comprises a plurality of training samples and labels of the training samples, wherein the training samples comprise sample images and sample problems corresponding to the sample images, and the sample images comprise sample objects and sample scene texts; the label comprises a sample answer corresponding to the sample question;

and carrying out joint training on the initial feature extraction network, the initial fusion network and the initial question-answering network in the initial question-answering model based on the training sample set to obtain the target question-answering model.

7. An image-based question-answering apparatus, the apparatus comprising:

the target data acquisition module is used for acquiring a target image and a problem text corresponding to the target image from the user terminal;

the question-answering module is used for inputting the target image and the question text into a pre-trained target question-answering model, respectively extracting features of the target image and the question text through the target question-answering model to obtain visual object features, scene text features and question features, carrying out fusion processing on the visual object features, the scene text features and the question features, and obtaining answers corresponding to the question text based on fusion processing results;

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.