CN116662497A

CN116662497A - Visual question-answer data processing method, device and computer equipment

Info

Publication number: CN116662497A
Application number: CN202310492337.8A
Authority: CN
Inventors: 张海轩
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-05-04
Filing date: 2023-05-04
Publication date: 2023-08-29

Abstract

The application relates to a visual question-answering data processing method, a visual question-answering data processing device and computer equipment. To the field of financial technology or other related fields. The method comprises the following steps: acquiring a visual question-answer image to be predicted and query information aiming at the visual question-answer image, and inputting the visual question-answer image and the query information into a pre-trained visual question-answer prediction model; processing inquiry information through a text feature extraction network in a fusion network to obtain text feature information, and processing a visual question-answer image through an image feature extraction network in the fusion network to obtain image feature information; the text feature extraction network is adjusted based on a hierarchical structure attention mechanism; and carrying out pixel classification processing on the image characteristic information through a multi-layer perception network based on the text characteristic information to obtain a model output prediction result which is used as prediction question-answering information corresponding to the query information. By adopting the method, visual representation can be enriched, a comprehensive visual language view is provided for the input image, and the diversity and accuracy of generating the predictive question-answer information are improved.

Description

Visual question-answer data processing method, device and computer equipment

Technical Field

The present application relates to the field of financial technology, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for processing visual question-answering data.

Background

Visual question answering is a challenging task that requires natural language answers to be provided for a given image and natural language questions about the image.

At present, the visual question-answering model in the related technology relates to characterization learning and cross-modal fusion of different modalities, has great difficulty, is difficult to explore visual information from a given image, has high failure rate, aims at complex financial information in a financial business scene, and also has the problem that an accurate answer prediction result cannot be obtained effectively, and has poor visual question-answering effect.

Disclosure of Invention

Based on this, it is necessary to provide a visual question-answer data processing method, apparatus, computer device, storage medium and computer program product capable of solving the above-mentioned problems.

In a first aspect, the present application provides a method for processing visual question-answer data, the method comprising:

acquiring a visual question-answer image to be predicted and query information aiming at the visual question-answer image, and inputting the visual question-answer image and the query information into a pre-trained visual question-answer prediction model; the pre-trained visual question-answering prediction model comprises a fusion network with a double-branch structure and a multi-layer perception network, wherein the visual question-answering image and the query information are generated in a financial business scene;

Processing the inquiry information through a text feature extraction network in the fusion network to obtain text feature information, and processing the visual question-answering image through an image feature extraction network in the fusion network to obtain image feature information; the text feature extraction network is adjusted based on a hierarchical structure attention mechanism;

and carrying out pixel classification processing on the image characteristic information through the multi-layer perception network based on the text characteristic information to obtain a model output prediction result which is used as prediction question-answer information corresponding to the query information.

In one embodiment, the processing the visual question-answer image through the image feature extraction network in the fusion network to obtain image feature information includes:

extracting and obtaining image information of the visual question-answer image through an image feature extraction network in the fusion network;

combining the text associated region features and the text recognition features corresponding to the visual question-answering images and the image information to obtain visual representation information; the text associated region features and the text recognition features are used to adjust image understanding to provide a comprehensive visual language view;

And obtaining a target characteristic image according to the visual representation information and the visual question-answer image, and taking the target characteristic image as the image characteristic information.

In one embodiment, the performing, based on the text feature information, pixel-level classification processing on the image feature information through the multi-layer perceptual network to obtain a model output prediction result includes:

performing image restoration processing on the target feature image through the multi-layer perception network to obtain a processed feature image;

and adopting a decoder of the multi-layer perception network to carry out pixel level classification processing on the processed characteristic image according to the text characteristic information, so as to obtain the model output prediction result.

In one embodiment, the pre-trained visual question-answer prediction model is trained by the following method:

acquiring a training sample set; each training sample in the training sample set consists of a sample image and a plurality of question-answer information pairs contained in the sample image; the sample image and the question-answer information pairs are acquired based on a financial service scene;

constructing and obtaining a visual question-answer prediction model to be trained according to the fusion model with the double-branch structure and the multi-layer perception network; the first branch in the fusion model is the text feature extraction network which is adjusted based on a hierarchical structure attention mechanism, and the second branch in the fusion model is the image feature extraction network;

And carrying out model training on the visual question-answer prediction model to be trained by adopting the training sample set to obtain the pre-trained visual question-answer prediction model.

In one embodiment, before the step of obtaining the training sample set, the method further comprises:

acquiring an initial sample set acquired under the financial business scene; different question-answer information pairs in each initial sample of the initial sample set have different query object types;

performing data processing on the initial sample set according to preset processing information, and obtaining the training sample set and the test sample set according to the processed initial sample set; the preset processing information is used for indicating the data screening operation and the image size adjustment operation to the initial sample set.

In one embodiment, after the step of obtaining the pre-trained visual question-answer prediction model, the method further comprises:

acquiring preset evaluation information; the preset evaluation information is used for counting and predicting the accuracy degree of the question and answer result in the model test;

and testing the pre-trained visual question-answer prediction model by adopting the test sample set, and combining the preset evaluation information and the predicted question-answer result output by the pre-trained visual question-answer prediction model to obtain a model test result of the pre-trained visual question-answer prediction model.

In one embodiment, before the step of constructing a visual question-answer prediction model to be trained according to the fusion model with the dual-branch structure and the multi-layer perception network, the method further includes:

respectively constructing a text feature extraction network for processing a text feature extraction task as a first branch and an image feature extraction network for processing an image feature extraction task as a second branch;

and fusing the first branch and the second branch to obtain the fused model with the double-branch structure.

In one embodiment, the building a text feature extraction network for processing text feature extraction tasks includes:

acquiring an initial text feature extraction network, and adjusting the initial text feature extraction network by combining a weighted accumulation mode and a hierarchical structure attention mechanism to obtain the text feature extraction network;

the hierarchical structure attention mechanism is used for combining characteristics of different text levels in the question-answer information with the self-attention mechanism so as to strengthen the capability of the network to acquire the structural information by utilizing the association degree among different hierarchical structures.

In a second aspect, the present application also provides a visual question-answering data processing apparatus, the apparatus comprising:

The system comprises a to-be-predicted data acquisition module, a pre-training visual question-answer prediction model, a visual question-answer image acquisition module and a visual question-answer image prediction module, wherein the to-be-predicted data acquisition module is used for acquiring a to-be-predicted visual question-answer image and query information aiming at the visual question-answer image, and inputting the to-be-predicted visual question-answer image and the query information into the pre-training visual question-answer prediction model; the pre-trained visual question-answering prediction model comprises a fusion network with a double-branch structure and a multi-layer perception network, wherein the visual question-answering image and the query information are generated in a financial business scene;

the visual question-answering prediction model processing module is used for processing the query information through a text feature extraction network in the fusion network to obtain text feature information, and processing the visual question-answering image through an image feature extraction network in the fusion network to obtain image feature information; the text feature extraction network is adjusted based on a hierarchical structure attention mechanism;

and the prediction question-answering information obtaining module is used for carrying out pixel classification processing on the image characteristic information through the multi-layer perception network based on the text characteristic information to obtain a model output prediction result which is used as prediction question-answering information corresponding to the query information.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the visual question-answer data processing method as described above when the processor executes the computer program.

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the visual question-answer data processing method as described above.

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the visual question-answer data processing method as described above.

According to the visual question-answering data processing method, the device, the computer equipment, the storage medium and the computer program product, the visual question-answering image to be predicted and the query information aiming at the visual question-answering image are obtained and input into the pre-trained visual question-answering prediction model, the pre-trained visual question-answering prediction model comprises a fusion network with a double-branch structure and a multi-layer perception network, the visual question-answering image and the query information are both generated under a financial service scene, then the query information is processed through text features in the fusion network, the text feature information is obtained, the visual question-answering image is processed through the image features in the fusion network, the text feature extraction network is obtained based on a hierarchical structure attention mechanism, the image feature information is further processed through the multi-layer perception network, a prediction result is obtained, the prediction result is output as the prediction question-answering information corresponding to the query information, the fusion network with the double-branch structure is optimized, different feature extraction tasks are processed on the side by side respectively, the multi-layer perception network is utilized to process the query information, the image feature information through the image features in the fusion network, the multi-layer perception network is utilized, the visual question-answering prediction information is capable of providing various visual question-answering information, and the visual question-answering information can be generated through the prediction language with the full-answering view, and the full-face prediction language is provided.

Drawings

FIG. 1 is a flow chart of a method for processing visual question-answering data according to one embodiment;

FIG. 2 is a schematic diagram of a model training and testing process in one embodiment;

FIG. 3 is a flow chart of a model training step in one embodiment;

FIG. 4 is a flow chart of another method of processing visual question-answering data in one embodiment;

FIG. 5 is a block diagram of a visual question and answer data processor apparatus in one embodiment;

FIG. 6 is an internal block diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for presentation, analyzed data, etc.) related to the present application are both information and data authorized by the user or sufficiently authorized by each party; correspondingly, the application also provides a corresponding user authorization entry for the user to select authorization or select rejection.

In one embodiment, as shown in fig. 1, a visual question-answer data processing method is provided, where the method is applied to a terminal to illustrate the method, it is understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes the steps of:

step 101, acquiring a visual question-answer image to be predicted and query information aiming at the visual question-answer image, and inputting the visual question-answer image and the query information into a pre-trained visual question-answer prediction model;

the pre-trained visual question-answer prediction model can comprise a fusion network with a double-branch structure and a multi-layer perception network, wherein a first branch in the fusion network can be a text feature extraction network, and a second branch can be an image feature extraction network, so that different feature extraction tasks can be respectively focused on by using the double-branch structure.

As an example, the visual question-answering image and the query information are both generated in the financial business scenario, and the visual question-answering task to be processed may be acquired based on an intelligent customer service, etc. in the financial business scenario, and the visual question-answering task may include the visual question-answering image to be predicted and the query information for the visual question-answering image.

In practical application, a visual question-answer task to be predicted can be obtained based on intelligent robot services such as intelligent customer service, intelligent consultation service and the like in a financial service scene, then a visual question-answer image to be predicted and query information aiming at the visual question-answer image can be obtained according to the visual question-answer task, and the obtained visual question-answer image and query information can be input into a pre-trained visual question-answer prediction model to generate predicted question-answer information based on the visual question-answer image and the query information to serve as a processing result corresponding to the visual question-answer task.

In an alternative embodiment, as shown in fig. 2, for the pre-trained visual question-answer prediction model, the text feature extraction network of the first branch in the fusion network may be an improved LSTM (Long Short-Term Memory) model, and the image feature extraction network of the second branch may be an EAST model (Efficient and Accuracy Scene Text, efficient and accurate scene text detection), which may be used to detect text in images and videos, such as optical character recognition that may perform text recognition for natural scene images; the Multi-Layer Perceptron may be an MLP (Multi-Layer Perceptron) Multi-Layer Perceptron module.

102, processing the inquiry information through a text feature extraction network in the fusion network to obtain text feature information, and processing the visual question-answer image through an image feature extraction network in the fusion network to obtain image feature information;

as an example, the text feature extraction network may be adapted based on a hierarchical attention mechanism, such as may be used to refine the LSTM model as the text feature extraction network of the first branch in the converged network.

In a specific implementation, after the visual question-answer image and the query information are input into the pre-trained visual question-answer prediction model, the query information can be processed through a text feature extraction network in the fusion network based on a fusion network in the pre-trained visual question-answer prediction model to obtain text feature information, and the visual question-answer image is processed through an image feature extraction network in the fusion network to obtain image feature information, so that different feature extraction tasks can be respectively processed based on a double-branch structure.

In an example, the image information of the visual question-answer image can be extracted by fusing the image feature extraction network in the network, then the text associated region feature and the text recognition feature corresponding to the visual question-answer image, such as the rich text region feature and the detailed OCR (optical character recognition, word recognition) feature extracted by the model, and the image information can be combined to obtain the visual representation information, and further the target feature image can be obtained according to the visual representation information and the visual question-answer image as the image feature information, so that the visual representation of the image understanding can be improved by using the EAST model to extract the image information and utilizing the richer text region feature and the detailed OCR feature, a comprehensive visual language view can be provided for the input image, and the diversity and the accuracy of the generated answer can be improved.

And 103, performing pixel classification processing on the image characteristic information through the multi-layer perception network based on the text characteristic information to obtain a model output prediction result which is used as prediction question-answer information corresponding to the query information.

After the text feature information and the image feature information are obtained, image restoration processing can be carried out on the target feature image through a multi-layer perception network to obtain a processed feature image, and further pixel classification processing can be carried out on the processed feature image according to the text feature information through a decoder of the multi-layer perception network to obtain a model output prediction result which is used as prediction question-answer information corresponding to the query information.

For example, the classification of the pixel level can be completed on the feature map (i.e. the target feature image) output by the EAST model by using the decoder of the MLP structure to output the prediction result, i.e. the model output prediction result is obtained.

Compared with the traditional visual question-answering model, the method and the device relate to characterization learning and cross-modal fusion of different modes, have larger difficulty, are difficult to explore visual information from given images, and according to the technical scheme of the embodiment, through providing a two-way routing network model, namely a pre-trained visual question-answering prediction model, abundant text region features and OCR features can be used for enriching visual representation, a comprehensive visual language view can be provided for an input image to be predicted, the diversity and accuracy of model output prediction question-answering information are improved, and a new thinking mode of human intelligence can be stimulated and realized.

According to the visual question-answer data processing method, the visual question-answer image to be predicted and the query information aiming at the visual question-answer image are acquired and input into the pre-trained visual question-answer prediction model, then the query information is processed through the text feature extraction network in the fusion network to obtain text feature information, the visual question-answer image is processed through the image feature extraction network in the fusion network to obtain image feature information, further the image feature information is subjected to pixel level classification processing through the multi-layer perception network based on the text feature information to obtain a model output prediction result, the model output prediction result is used as the prediction question-answer information corresponding to the query information, the visual question-answer prediction model optimization is achieved, the fusion network based on the double-branch structure is used for respectively focusing on different feature extraction tasks, and pixel level classification processing is carried out through the multi-layer perception network, so that comprehensive visual language views can be provided for the input to-be predicted image, and the diversity and accuracy of the prediction question-answer information are improved.

In one embodiment, the processing the visual question-answer image through the image feature extraction network in the fusion network to obtain image feature information may include the following steps:

Extracting and obtaining image information of the visual question-answer image through an image feature extraction network in the fusion network; combining the text associated region features and the text recognition features corresponding to the visual question-answering images and the image information to obtain visual representation information; the text associated region features and the text recognition features are used to adjust image understanding to provide a comprehensive visual language view; and obtaining a target characteristic image according to the visual representation information and the visual question-answer image, and taking the target characteristic image as the image characteristic information.

Specifically, the EAST model (i.e. the image feature extraction network) can be used for extracting image information, then the richer text region features (i.e. text associated region features) and the richer detailed OCR features (i.e. text recognition features) can be utilized to improve visual representation of image understanding to obtain visual representation information, and then the feature map (i.e. the target feature image) output by the EAST model can be obtained according to the visual representation information and the visual question-answer image, so that a comprehensive visual language view can be provided for an input image, and the variety and accuracy of answer generation can be improved.

In one embodiment, the performing, based on the text feature information, pixel classification processing on the image feature information through the multi-layer perceptual network to obtain a model output prediction result may include the following steps:

Performing image restoration processing on the target feature image through the multi-layer perception network to obtain a processed feature image; and adopting a decoder of the multi-layer perception network to carry out pixel level classification processing on the processed characteristic image according to the text characteristic information, so as to obtain the model output prediction result.

In an example, the MLP structure (i.e., a multi-layer perceptual network) may be used to restore the low-resolution feature map (i.e., the target feature image) output by the encoder to the original map size, i.e., to obtain the processed feature image, and the feature map output by the last layer in the MLP structure may be input to the full-connection layer to obtain the pixel-level classification result, i.e., to obtain the model output prediction result. Therefore, global information aggregation can be realized based on the comprehensive visual language view provided by the rich visual representation, and the diversity and accuracy of the model generated answers can be improved.

Specifically, for the MLP structure, in order to recover the low-level feature information lost in downsampling, the input feature map may be spliced with the feature map transferred through the jump connection in the encoder, and batch normalization processing may be performed; in order to reduce the number of parameters, the feature map can be input into a depth separable convolution layer to reduce the number of channels, then the processed feature map can be sent into 2 continuous crossed MLP modules, the purpose of the feature map is to fuse the features in the height and width 2 directions with the channel features so as to realize global information aggregation, and the final layer of feature map can be processed through 2 linear layers to obtain an output feature map.

In one embodiment, as shown in fig. 3, the pre-trained visual question-answer prediction model is trained by the following method, and may include the following steps:

step 301, acquiring a training sample set;

each training sample in the training sample set may be composed of a sample image and a plurality of question-answer information pairs contained in the sample image.

As an example, the sample image and the plurality of question-answering information pairs are acquired based on a financial service scene, for example, a sample visual question-answering task can be acquired based on an intelligent customer service, an intelligent consultation service and the like in the financial service scene, and further a training sample can be obtained based on the sample image and the plurality of question-answering information pairs contained in the sample image in the sample visual question-answering task.

In an example, an initial sample set (such as an image question-answer data set collected in fig. 2) collected in a financial service scenario may be obtained, and then, according to preset processing information, a data screening operation and an image size adjustment operation may be performed on the initial sample set, so as to obtain a training sample set and a test sample set, so that model training and model verification may be further performed by using the training sample set and the test sample set, respectively.

Step 302, constructing and obtaining a visual question-answer prediction model to be trained according to the fusion model with the double-branch structure and the multi-layer perception network; the first branch in the fusion model is the text feature extraction network which is adjusted based on a hierarchical structure attention mechanism, and the second branch in the fusion model is the image feature extraction network;

in a specific implementation, a text feature extraction network (such as an improved LSTM model in fig. 2) for processing a text feature extraction task may be separately constructed, and used as a first branch, and an image feature extraction network (such as an EAST model in fig. 2) for processing an image feature extraction task may be separately constructed, and used as a second branch, and then the first branch and the second branch may be fused to obtain a fusion model with a dual-branch structure, and further, a visual question-answer prediction model to be trained may be constructed according to the fusion model and the multi-layer perception network.

In an alternative embodiment, the initial text feature extraction network, such as an LSTM basic model, may be adjusted by combining a weighted accumulation mode and a hierarchical structure attention mechanism, so as to obtain a text feature extraction network, such as an improved LSTM model, then a fusion network model (i.e., a fusion model with a dual-branch structure) based on the EAST model and the improved LSTM model may be constructed, and further the constructed fusion network model may be improved, and a visual question-answer prediction model to be trained may be constructed by combining a multi-layer perception network (such as an MLP multi-layer perceptron module in fig. 2).

And 303, performing model training on the visual question-answer prediction model to be trained by adopting the training sample set to obtain the pre-trained visual question-answer prediction model.

In practical application, as shown in fig. 2, in the model training process, parameters of the visual question-answer prediction model to be trained may be updated according to the loss value output by the fusion network model, for example, the loss value of model training may be obtained by using a cross entropy loss function, and the following manner may be adopted to perform calculation:

wherein J (θ) is the partial derivative of the parameter θ, y ⁽ⁱ⁾ Taking 0 or 1; m represents a category, if a question has 3 answers, then m=3; (i) Indicating the i-th pixel, h (x) indicating the partial derivative of the parameter.

In the embodiment, each training sample in the training sample set is composed of a sample image and a plurality of question-answer information pairs contained in the sample image, the sample image and the question-answer information pairs are acquired based on financial business scenes, then a visual question-answer prediction model to be trained is constructed according to a fusion model with a double-branch structure and a multi-layer perception network, a first branch in the fusion model is a text feature extraction network which is adjusted based on a hierarchical structure attention mechanism, a second branch in the fusion model is an image feature extraction network, further the training sample set is adopted to perform model training on the visual question-answer prediction model to be trained, a pre-trained visual question-answer prediction model is obtained, optimization of the visual question-answer prediction model is achieved, different feature extraction tasks are processed on the fusion network based on the double-branch structure in a focused mode, pixel-class classification processing is performed by utilizing the multi-layer perception network, a full visual representation is provided for an input image to be predicted, and diversity and accuracy of generated predictive question-answer information are improved.

In one embodiment, before the step of obtaining the training sample set, the method may further include the following steps:

acquiring an initial sample set acquired under the financial business scene; different question-answer information pairs in each initial sample of the initial sample set have different query object types; performing data processing on the initial sample set according to preset processing information, and obtaining the training sample set and the test sample set according to the processed initial sample set; the preset processing information is used for indicating the data screening operation and the image size adjustment operation to the initial sample set.

In a specific implementation, in a financial business scenario, a plurality of different types of image question-answer data sets (i.e., initial sample sets) can be collected and obtained, for example, different objects such as animals, plants, people and the like in an image are queried, i.e., different question-answer information pairs have different query object types, and the image question-answer data sets can comprise a plurality of pictures, and each picture can correspond to a plurality of question-answer data (i.e., question-answer information pairs). In order to ensure that the acquired questions and answers in the data set are correct, the questions and answers of each type in the collected image question and answer data set can be checked, so that the image question and answer data set can be used for supporting model training and model testing.

For example, to obtain a data set containing a plurality of different types of image questions and answers, sample Visual questions and answers tasks can be obtained based on intelligent customer service, etc. in a financial business scenario, or by collecting and downloading common data sets such as DAQUAR, COCO-QA, FM-IQA, visual7W, etc. VQA (Visual question answering, visual questions and answers) data sets, which can be used to support model training and model testing.

In an example, to improve the training accuracy and speed of the model, the collected and obtained image question-answer data set may be sorted and part of the data may be discarded (i.e. a data screening operation), so as to ensure that the sorted part of the data is replied with relevance to the image; meanwhile, in order to ensure the size consistency of the input model images, the images in the image question-answer dataset can be subjected to scaling processing or expanding processing (namely image size adjustment operation), so that the images input into the model have consistent size proportion.

In yet another example, for the collated image question-answer data set, the collated image question-answer data set may be divided into a training set and a test set according to a preset ratio (e.g. 8:2), that is, a training sample set and a test sample set are obtained, for example, 80% of samples randomly selected from the constructed image question-answer data set may be used as the training set, and the remaining 20% of samples may be used as the test set.

In one embodiment, after the step of obtaining the pre-trained visual question-answer prediction model, the method may further include the steps of:

acquiring preset evaluation information; the preset evaluation information is used for counting and predicting the accuracy degree of the question and answer result in the model test; and testing the pre-trained visual question-answer prediction model by adopting the test sample set, and combining the preset evaluation information and the predicted question-answer result output by the pre-trained visual question-answer prediction model to obtain a model test result of the pre-trained visual question-answer prediction model.

In practical application, as shown in fig. 2, a test set (i.e., a test sample set) may be used to evaluate a trained improved network model (i.e., a pre-trained visual question-answer prediction model), so as to obtain a test result of the model by combining an evaluation index, i.e., a predicted question-answer result output by combining preset evaluation information and the pre-trained visual question-answer prediction model, to obtain a model test result of the pre-trained visual question-answer prediction model.

In one example, accuracy (accuracy) may be used as an evaluation index for a pre-trained visual question-answer prediction model for model evaluation, which may be expressed by the following formula:

Wherein, for the same inquiry information in any sample image, if at least three different answer objects give the same answer, the answer can be regarded as 100% correct inquiry information.

In one embodiment, before the step of constructing the visual question-answer prediction model to be trained according to the fusion model with the dual-branch structure and the multi-layer perception network, the method may further include the following steps:

respectively constructing a text feature extraction network for processing a text feature extraction task as a first branch and an image feature extraction network for processing an image feature extraction task as a second branch; and fusing the first branch and the second branch to obtain the fused model with the double-branch structure.

In a specific implementation, the LSTM base model may be adjusted by combining a weighted accumulation manner and a hierarchical structure attention mechanism, so as to obtain a text feature extraction network for processing a text feature extraction task, and the text feature extraction network may be used as a first branch, for example, an improved LSTM model, so as to construct a fusion network model based on an EAST model and an improved LSTM model (i.e., a fusion model with a dual-branch structure).

In one example, the EAST model is a full convolutional network, which may have three parts: the device comprises a feature extraction layer, a feature fusion layer and an output layer. Aiming at the problem that different characters in a picture are different in size, the feature images of different levels can be fused, the semantic information of the bottom layer is adopted for the prediction of small characters, the semantic information of the high layer is adopted for the prediction of large characters, and further the feature images of different levels can be fused by using jump connection, so that the feature loss condition caused by severe line-scale conversion of the text can be avoided.

In one embodiment, the constructing a text feature extraction network for processing text feature extraction tasks may include the steps of:

and acquiring an initial text feature extraction network, and adjusting the initial text feature extraction network by combining a weighted accumulation mode and a hierarchical structure attention mechanism to obtain the text feature extraction network.

The hierarchical structure attention mechanism can be used for combining the characteristics of different text levels in the question-answer information with the self-attention mechanism so as to strengthen the capability of the network to acquire the structural information by utilizing the association degree among different hierarchical structures.

In one example, the LSTM model may be improved by using a hierarchical attention mechanism, such as based on a design weighted additive method, a common sentence vector may be converted into a structurally weakly associated sentence vector, and a word, sentence, text hierarchical attention mechanism may be constructed to improve model structure learning.

For example, the LSTM model may use various gates to process data selectively, such as a forget gate, an input gate, an output gate, and a new value, where the forget gate may use a previous state and a current word vector as inputs, and record information using a neural network parameter matrix, and further may dot-multiply the output vector with the transmission band matrix to implement the functions of selective memory and forget; the input gate can be used for connecting the last state with the current input word vector and then processing and outputting the last state and the current input word vector through a switching function; the new value may process the output vector using the tanh function after linking the previous state and the current data.

In yet another example, based on the hierarchical structure attention mechanism, the characteristics of words, phrases, sentences and texts in the Chinese text can be combined with the self-attention mechanism, and the capability of capturing the structural information contained in the integral features can be enhanced by fully utilizing the association between different layers. The similarity calculation can be performed by combining the text vector with each structure weak correlation sentence vector and adopting dot product operation, the sentence vector can be an ith sentence sequence in a training sample, the score can be divided by scale, namely the square root of the dimension of the word vector, so that the inner product is not overlarge, the gradient is more stable, the score of all words can be normalized through a Softmax function, the score of positive value and sum of 1 is obtained, and the weight distribution of the word vector in the sentence vector can be calculated, so that weighted summation can be performed, and the output correlation vector of the attention at the current position can be obtained.

In one embodiment, as shown in FIG. 4, a flow diagram of another visual question-answer data processing method is provided. In this embodiment, the method includes the steps of:

in step 401, an initial sample set acquired in a financial business scenario is acquired; the different question-answer information pairs in each initial sample of the initial sample set have different query object types. In step 402, performing data processing on the initial sample set according to preset processing information, and obtaining a training sample set and a test sample set according to the processed initial sample set; the preset processing information is used for indicating the data screening operation and the image resizing operation to the initial sample set. In step 403, a visual question-answer prediction model to be trained is constructed according to the fusion model with the double-branch structure and the multi-layer perception network. In step 404, a training sample set is used to perform model training on the visual question-answer prediction model to be trained, so as to obtain a pre-trained visual question-answer prediction model. In step 405, preset evaluation information is obtained; the preset evaluation information is used for counting and predicting the accuracy degree of the question and answer result in the model test. In step 406, a test sample set is used to test the pre-trained visual question-answer prediction model, and a model test result of the pre-trained visual question-answer prediction model is obtained by combining the preset evaluation information and the predicted question-answer result output by the pre-trained visual question-answer prediction model. It should be noted that, the specific limitation of the above steps may be referred to the specific limitation of a visual question-answer data processing method, which is not described herein.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a visual question-answer data processing device for realizing the above-mentioned visual question-answer data processing method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the visual question-answering data processing device or devices provided below may refer to the limitation of the visual question-answering data processing method hereinabove, and will not be repeated herein.

In one embodiment, as shown in fig. 5, there is provided a visual question-answer data processing apparatus including:

the to-be-predicted data acquisition module 501 is used for acquiring a to-be-predicted visual question-answer image and query information aiming at the visual question-answer image, and inputting the to-be-predicted visual question-answer image and the query information into the pre-trained visual question-answer prediction model; the pre-trained visual question-answering prediction model comprises a fusion network with a double-branch structure and a multi-layer perception network, wherein the visual question-answering image and the query information are generated in a financial business scene;

the visual question-answering prediction model processing module 502 is configured to process the query information through a text feature extraction network in the fusion network to obtain text feature information, and process the visual question-answering image through an image feature extraction network in the fusion network to obtain image feature information; the text feature extraction network is adjusted based on a hierarchical structure attention mechanism;

and a predicted question-answer information obtaining module 503, configured to perform pixel classification processing on the image feature information through the multi-layer perceptual network based on the text feature information, to obtain a model output predicted result, where the model output predicted result is used as predicted question-answer information corresponding to the query information.

In one embodiment, the visual question-answer prediction model processing module 502 includes:

the image information extraction sub-module is used for extracting the image information of the visual question-answer image through an image feature extraction network in the fusion network;

the visual representation information obtaining sub-module is used for combining the text association region features and the text recognition features corresponding to the visual question-answering images and the image information to obtain visual representation information; the text associated region features and the text recognition features are used to adjust image understanding to provide a comprehensive visual language view;

and the target feature image obtaining sub-module is used for obtaining a target feature image as the image feature information according to the visual representation information and the visual question-answer image.

In one embodiment, the predictive question and answer information obtaining module 503 includes:

the image atomic module is used for carrying out image reduction processing on the target characteristic image through the multi-layer perception network to obtain a processed characteristic image;

and the pixel-level classification processing sub-module is used for carrying out pixel-level classification processing on the processed characteristic image according to the text characteristic information by adopting a decoder of the multi-layer perception network to obtain the model output prediction result.

In one embodiment, the apparatus further comprises:

the training sample set acquisition module is used for acquiring a training sample set; each training sample in the training sample set consists of a sample image and a plurality of question-answer information pairs contained in the sample image; the sample image and the question-answer information pairs are acquired based on a financial service scene;

the model to be trained building module is used for building a visual question-answer prediction model to be trained according to the fusion model with the double-branch structure and the multi-layer perception network; the first branch in the fusion model is the text feature extraction network which is adjusted based on a hierarchical structure attention mechanism, and the second branch in the fusion model is the image feature extraction network;

and the model training module is used for carrying out model training on the visual question-answer prediction model to be trained by adopting the training sample set to obtain the pre-trained visual question-answer prediction model.

In one embodiment, the apparatus further comprises:

the initial sample acquisition module is used for acquiring an initial sample set acquired under the financial business scene; different question-answer information pairs in each initial sample of the initial sample set have different query object types;

The initial sample processing module is used for carrying out data processing on the initial sample set according to preset processing information, and obtaining the training sample set and the test sample set according to the processed initial sample set; the preset processing information is used for indicating the data screening operation and the image size adjustment operation to the initial sample set.

In one embodiment, the apparatus further comprises:

the evaluation information acquisition module is used for acquiring preset evaluation information; the preset evaluation information is used for counting and predicting the accuracy degree of the question and answer result in the model test;

and the model test module is used for testing the pre-trained visual question-answer prediction model by adopting the test sample set, and obtaining a model test result of the pre-trained visual question-answer prediction model by combining the preset evaluation information and a predicted question-answer result output by the pre-trained visual question-answer prediction model.

In one embodiment, the apparatus further comprises:

the branch construction module is used for respectively constructing a text feature extraction network for processing a text feature extraction task as a first branch and an image feature extraction network for processing an image feature extraction task as a second branch;

And the fusion model obtaining module is used for fusing the first branch and the second branch to obtain the fusion model with the double-branch structure.

In one embodiment, the branch construction module comprises:

the text feature extraction network obtaining sub-module is used for obtaining an initial text feature extraction network, and adjusting the initial text feature extraction network by combining a weighted accumulation mode and a hierarchical structure attention mechanism to obtain the text feature extraction network; the hierarchical structure attention mechanism is used for combining characteristics of different text levels in the question-answer information with the self-attention mechanism so as to strengthen the capability of the network to acquire the structural information by utilizing the association degree among different hierarchical structures.

The respective modules in the above-described visual question-answering data processing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a visual question-answer data processing method.

It will be appreciated by those skilled in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:

In one embodiment, the processor, when executing the computer program, also implements the steps of the visual question-answer data processing method in the other embodiments described above.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor also implements the steps of the visual question-answer data processing method in the other embodiments described above.

In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of visual question-answering data processing, the method comprising:

2. The method according to claim 1, wherein the processing the visual question-answer image through the image feature extraction network in the fusion network to obtain image feature information includes:

3. The method according to claim 2, wherein the performing pixel-level classification processing on the image feature information through the multi-layer perceptual network based on the text feature information to obtain a model output prediction result comprises:

4. The method of claim 1, wherein the pre-trained visual question-answer prediction model is trained by:

5. The method of claim 4, wherein prior to the step of obtaining a set of training samples, the method further comprises:

6. The method of claim 5, wherein after the step of obtaining the pre-trained visual question-answer prediction model, the method further comprises:

7. The method according to claim 4, wherein before the step of constructing a visual question-answer prediction model to be trained from the fusion model with a dual-branch structure and the multi-layer perceptual network, the method further comprises:

8. The method of claim 7, wherein said constructing a text feature extraction network for processing text feature extraction tasks comprises:

9. A visual question-answering data processing apparatus, the apparatus comprising:

10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.

11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 8.

12. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 8.