CN115221369A

CN115221369A - Visual question-answer implementation method and visual question-answer inspection model-based method

Info

Publication number: CN115221369A
Application number: CN202210664672.7A
Authority: CN
Inventors: 田俊峰; 严明; 徐海洋; 李晨亮; 王玮; 闭彬
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-06-13
Filing date: 2022-06-13
Publication date: 2022-10-21

Abstract

A method for implementing visual question answering is provided. The implementation method comprises the following steps: receiving a target question and a rich text picture corresponding to the target question by adopting a hybrid expert model and outputting an answer of the target question, wherein the hybrid expert model comprises a gating network and a plurality of expert models, the gating network is used for determining the question type of the target question, confirming the target question to be a first question type in the plurality of question types based on the question type and providing the target question to the first expert model in the plurality of expert models; the first expert model is used for providing answers to the target questions. According to the method, all the problems are not processed by one general expert, but different expert models are used for processing different problems in a targeted manner, and the design can effectively cooperate with a plurality of expert models to widen the performance boundary of the models and improve the accuracy of answers.

Description

Visual question-answer implementation method and visual question-answer inspection model-based method

Technical Field

The disclosure relates to the field of application of neural network models, in particular to a visual question-answer implementation method and a visual question-answer inspection model-based method.

Background

Visual Question Answering (VQA) takes as input an image and a freeform, open natural language question about the image, and generates as output a natural language answer, for example, given an image and a series of questions, the machine is asked to reason about the image content, in conjunction with some common knowledge, to arrive at the question answer. To complete a Visual Question Answering (VQA), an expert model with artificial intelligence capability is required. But expert models tend to perform well for some problems but not others.

Disclosure of Invention

In view of the above, the present disclosure is directed to a method for implementing a visual question-answering and a method based on a visual question-answering test model to solve the existing technical problems.

According to a first aspect of the present disclosure, there is provided a method for implementing a visual question answering, including: receiving a target question and a rich text picture corresponding to the target question and outputting an answer to the target question using a hybrid expert model, wherein the hybrid expert model includes a gating network and a plurality of expert models,

the gate control network is configured to determine a problem type of the target problem, determine the target problem as a first problem type of a plurality of problem types based on the problem type, and provide the target problem to a first expert model of the plurality of expert models;

the first expert model is used for providing answers of the target questions.

Optionally, the expert model comprises: the word embedding expression module is used for coding the target question into a word embedding sequence, the visual encoder is used for coding the rich text picture into a visual feature sequence, and the converter is used for multiplying the word embedding sequence and the visual feature sequence by an attention weight respectively to obtain a fraction matrix and determining an answer of the target question according to the fraction matrix.

Optionally, each word vector included in the word embedding sequence is obtained based on a modality type, position information of a corresponding word, and word embedding of the corresponding word, and each item included in the visual feature sequence is also obtained based on the modality type, the position information of a corresponding component of the rich text picture, and a visual feature component of the corresponding component of the rich text picture.

Optionally, in the transformer, inter-modality and intra-modality interactions are controlled by different attention weights.

Optionally, the visual feature of the corresponding component of the rich text picture is at least one of a region feature, a grid feature and a face feature.

Optionally, the plurality of experts is:

a text reading expert for answering questions related to textual information in the rich text image;

a counting expert for answering questions related to the number of objects in the rich-text picture;

a clock reading expert for answering questions related to a clock time in the rich text picture.

Optionally, the counting expert and the clock reading expert respectively extract region features, grid features and patch features from the rich text picture, fuse the region features, the grid features and the patch features, and match the fusion result with the text features extracted from the target question.

Optionally, in the fusion result, the region feature, the mesh feature and the patch feature respectively adopt different attention weights.

Optionally, the region feature and the mesh feature each obtain a higher attention weight than the patch feature in the clock reading specialist and the counting specialist.

Optionally, the text reading expert acquires text information from the rich text picture by using OCR, and accordingly obtains a first word embedding sequence, serializes the rich text picture by using a cell to obtain a cell sequence, each item of the cell sequence includes at least one word in the text information, splices a second word embedding sequence corresponding to the target question with the first word embedding sequence to obtain a third word embedding sequence, and then uses the cell sequence and the third word embedding sequence as input of a word span prediction classifier, and provides an answer to the target question according to a prediction result.

Optionally, the method further comprises: dividing the received visual language question-answering task into a plurality of subtasks, each subtask including the rich text picture and a target question for the rich text picture.

According to a second aspect of the present disclosure, there is provided a method for knowledge mining based on a visual question and answer system, comprising:

collecting a plurality of samples with prediction scores lower than a set threshold value in the visual question answering system to form a sample set, wherein the samples comprise rich text pictures and target questions corresponding to the rich text pictures;

clustering the plurality of samples to form a plurality of subsample sets;

determining an expert model missing in the visual question-answering system from the subsample set.

Optionally, the method further comprises: constructing an expert model lacking in the visual question-answering system and training it with a corresponding subsample set of the plurality of subsample sets.

According to a third aspect of the present disclosure, there is provided an electronic device, comprising a memory and a processor, the memory storing a computer program operable on the processor, and the processor implementing the method of the first aspect or the method of the second aspect when executing the program.

According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of implementing the first aspect or the method of the second aspect described above.

For the VQA task, the various embodiments of the disclosure do not adopt a general expert to process all the problems, but adopt different expert models to process different problems in a targeted manner. For example, for a rich text picture containing a lot of text information, a special text understanding expert is adopted to extract the text information in the rich text picture to answer a proposed question, and professional skill type scenes such as clock reading and counting are correspondingly adopted to be processed by the clock reading expert and the counting expert. The design can effectively cooperate with a plurality of expert models to widen the performance boundary of the model and improve the accuracy of the answer. In addition, the coding layer of each expert model adopts attention weight to dynamically control interaction between modalities and interaction in modalities so as to improve the performance of cross-modality fusion.

Drawings

The foregoing and other objects, features, and advantages of the disclosure will be apparent from the following description of embodiments of the disclosure, which refers to the accompanying drawings in which:

FIG. 1 illustrates a block diagram of a hybrid expert model provided by an embodiment of the present disclosure;

fig. 2 is a schematic diagram showing a basic structure of an expert model proposed by an embodiment of the present disclosure;

FIG. 3 illustrates a block diagram of an exemplary hybrid expert model;

FIG. 4 is a flow chart illustrating a method for implementing a visual question answering provided by an embodiment of the present disclosure;

FIG. 5 illustrates a method for visual question-answering based inspection model provided by an embodiment of the present disclosure;

FIG. 6 shows a schematic diagram of an application of an embodiment of the present disclosure;

FIG. 7 shows a block diagram of an electronic device in which an embodiment of the disclosure is deployed.

Detailed Description

The present disclosure is described below based on examples, but the present disclosure is not limited to only these examples. In the following detailed description of the present disclosure, some specific details are set forth in detail. It will be apparent to one skilled in the art that the present disclosure may be practiced without these specific details. Well-known methods, procedures, and procedures have not been described in detail so as not to obscure the essence of the present disclosure. The figures are not necessarily drawn to scale.

It should be appreciated that the expert model for answering VQA is constructed based on a pre-trained model of visual language. Currently, there are two main architectures for visual language models: single-stream architecture (single-stream architecture) and dual-stream architecture (dual-stream architecture). The former, assuming simple and clear fundamental semantics behind the two modalities, simply connects image and text features as an input network to a single converter for early fusion in a straightforward manner. The paradigm utilizes a self-attention mechanism to learn cross-modal semantic alignments from the underlying feature level. However, the design of the single-stream architecture treats both modality inputs equally, so that the different characteristics inherent to each modality are not fully exploited. In contrast, the dual stream architecture first learns high-level abstractions for image and sentence representations separately using separate converter encoders, and then combines both modalities with a cross-modality converter. Such designs explicitly distinguish between different modal inputs and align the cross-modal representations at a higher semantic level, but such design parameters are less efficient and may ignore the association at the base feature level. On the one hand, the single-stream architecture treats both input modalities essentially equally, and therefore does not fully utilize the signals of each modality. On the other hand, the dual-stream architecture is not sufficient to capture fine-grained interactions between visual and textual prompts.

In order to solve the limitations of these architectures, the expert model used in the embodiments of the present disclosure is based on a single-stream architecture, but replaces the original self-attention weight in the transformer (transform) with a weighted attention weight, and dynamically adjusts the intra-modal attention and the inter-modal attention to achieve effective alignment across modal semantics, and meanwhile, since a single expert model does not answer all the questions well, the embodiments of the present disclosure use a hybrid expert model composed of multiple single expert models to deal with various questions.

Referring to fig. 1, the hybrid expert model 100 receives a target question and a rich text picture corresponding to the target question, and outputs an answer to the target question, wherein the target question is a free-form and open-style natural language question, and the rich text picture may include text information of a natural language, image information, or both. The hybrid expert model 100 includes a plurality of expert models 102 and a gating network 101.

The plurality of expert models 102 correspond to a plurality of problem types, respectively, each expert model being dedicated to solving one problem type, and the gating network 101 in conjunction with the knowledge graph 103 determines the problem type to which each problem belongs and delegates the problem to the appropriate expert model. The knowledge graph 103 is a large-scale semantic network that can be built and integrated into the system through some artificial intelligence process.

The basic structure of each expert model 102 is shown in fig. 2, and includes a word-embedding expression module 201, a visual encoder 202, and a transformer 203. Referring to fig. 2, a word embedding expression module 201 is used to convert an input question into a word embedding sequence, and in particular, the input question may be labeled as a series of word embedding sequences (w) using the method and word marker in BERT ₁ ,…,w _m ) Each mark w _i Three learnable embedded expressions are included: modality type, location embedding, and word embedding. To provide better text features, the BERT model may employ three different pre-trained language models: BERT, roBERTa, and StructBERT initialize text stream parameters to embed words of each word from an input question over a larger corpus, more word ordering, and sentence ordering information. In an alternative embodiment, before the word embedding expression module 201 obtains the word vector, three embeddings are performed: the modality type, location embedding and word embedding are summed and normalized to represent the input question as a word embedding sequence E _emb＝ {e _CLS ,e ₁ ,…,e _m ,e _SEP In which [ CLS]And [ SEP ]]Is a special mark in BERT, e _i Is the type and location of the modeThe word embedding sequence is composed of a sequence in which embedding and word embedding are summed and normalized.

The visual encoder 202 is used to obtain and encode visual features. To fully understand the visual features, three visual features need to be considered: region feature (region feature), grid feature (grid feature), and patch feature (patch feature). The regional features can well locate salient objects in the image, which is more suitable for tasks such as object counting. Mesh features are better at global or background information in the image.

Regional characteristics: with the discovery of "bottom-up" attention, region-based visual features have become de facto criteria for visual and linguistic tasks. Unlike the normal "top-down" attention, which focuses directly on semantically irrelevant parts of the visual input, bottom-up attention uses a pre-trained object detector (object detector) to identify salient regions from the visual input. Thus, the image is represented by a set of region-based features that can better locate individual objects and capture detailed semantics in the image content. Typically, region-based visual encoders are pre-trained using detection data such as visual genomes. Recently, vinVL is built on the basis of a large-scale pre-trained object property detection model, which has a large amount of data on four common object detection data sets, and is helpful to better capture coarse-grained and fine-grained visual semantic information in images. Object-level regional feature set with more detailed visual semantics is extracted in this work by the object detector from VinVL, where each object o ^j Represented as a 2048-dimensional feature vector r ^j . Capturing spatial information of targets, wherein the frame-level position characteristics of each target are also passed through a four-dimensional vector

Performing coding, wherein x ₁ ,x _2, y _2, y ₂ Coordinates of coordinate points at the upper left corner and the lower right corner of the region where the target object is located, W and H are the width and the height of the image, r ^j And l _j Concatenated to form a position-sensitive target feature vector, and then connectedIt is further converted to the lower dimension of D with a linear projection to ensure that it has the same dimension as the word embedding sequence output by the word embedding expression module 201. In order to distinguish between the different modalities, a learnable modality type is supplemented and added to the output of the linear projection layer. Finally, a sequence of visual features based on the region features is pre-trained as input to the transformer 203.

Grid characteristics: to address the limitations of region-based features (such as locality), it has been proposed in some products to revisit grid-based convolution features for multi-modal learning, thereby skipping the expensive region-dependent steps.

The advantages of the grid feature are: 1) Grid-based features allow for the introduction of more flexible architectural designs for vision and language tasks, which makes it possible to support end-to-end training and efficient online reasoning; 2) It operates on the complete image, rather than on a collection of semantic regions, so it can better capture global information of the image, such as background information; 3) It does not rely on a pre-trained object detector with a limited visual vocabulary. In particular, from an original image with 3 color channels

Initially, a CNN-based fixed image encoder (such as ResNet) generates a low resolution activation map: f _img ∈R ^C×H×W Wherein C is the channel width

Since the cross-modal fusion network requires a sequence as input, F _img Is compressed into one dimension, thereby generating a HW × C feature map. Finally, the high-level features are mapped to the channel dimensions using a linear projection layer, i.e., from C down to a smaller D to match the dimensions of the word-embedding sequence output by the word-embedding expression module 201. To distinguish between the different modes, the grid feature map is supplemented with a learnable mode type and added to the output of the linear projection layer. Finally, the visual feature sequence based on the grid features is pre-trained as an input to the transformer 203And (5) refining.

To generate good mesh features, a powerful CNN-based image encoder must be pre-trained to incorporate visual semantic information into it. For data used to pre-train an image encoder, there are two main ways of this route: 1) Supervision and pre-training: the image encoder is pre-trained using image classification data (e.g., imageNet) or detection data (e.g., visual genome). Large-scale object and attribute annotations collected in visual genomics are very helpful in providing mesh features containing visual semantics; 2) Unsupervised pre-training: the image encoder is pre-trained using a large number of unlabeled image-text pairs, without manual supervision, where approximately 400M aligned image-text pairs are used. It belongs to the field of studying visual characterization from natural language supervision. In this way, the image encoder is naturally semantically aligned with the text to facilitate cross-modality fusion. It is well known that a fully supervised pre-trained CNN model shows good performance on intra-or near-domain datasets, whereas it may not yield optimal performance when learned on out-of-domain datasets.

Dough sheet characteristics: vision Transformer (ViT) achieves excellent performance in various visual tasks. It first divides the image into patches (patch) of fixed size and then uses simple linear projection of the patches, supplemented with a learnable modality type in order to distinguish the different modes, and added to the output of the linear projection layer, which is then fed into the transformer. ViLT is the first person to explore the feature of patch-based multimodal learning, with reasoning speeds several tens of times faster than previous region-based VLP methods. The dough sheet based features have the following advantages: 1) In the online reasoning stage, the simple framework is more effective than the convolution characteristic based on the grid; 2) It may provide complementary visual features different from region-based and grid-based visual features by more effectively capturing the global structure of the complete image through a self-attention mechanism. In particular, 2D images I _img ∈R ³ ×H ₀ ×W _t Is reshaped into a 2D patch

Wherein (H) ₀ ,W ₀ ) Is the resolution of the original image, C is the number of channels, (P, P) is the resolution of each image patch, N = HW/P ² Is the number of patches produced and is also used as the input sequence length to the coder converter. Then, the patches are flattened and embedded into D dimensions using a trainable linear projection to match the dimensions of the word embedding sequence output by the word embedding expression module 201, and the coordinate information of each patch is used as a position code to capture the geometric relationship between the patches. Finally, a sequence of visual features based on patch features is pre-trained as input to transformer 203.

The visual encoder 202 may employ at least one of a region feature (region feature), a grid feature (grid feature), and a patch feature (patch feature) to generate a sequence of visual features. In some embodiments, the vision encoder 202 obtains the above three visual features and encodes each visual feature to generate a visual feature sequence with three visual features fused together, and outputs the visual feature sequence to the converter 203.

Although the existing visual language model SemVLP proposes a joint representation of vision and language to achieve alignment of cross-modal semantics at multiple levels, it is based on a shared transformer encoder (transformer encoder) with a specific self-attention mask (mask) for cross-modal interaction. However, the interaction between these two modalities is controlled by a fixed self-attention mask, with only two modes: interactive or non-interactive. Whereas the transformer 203 in the expert model of the disclosed embodiment uses different attention weights that can be learned for each layer to dynamically control inter-modal and/or intra-modal interactions. In some embodiments, the transformer 203 is constructed based on a transformer whose encoder may include multiple encoding layers, each of which may use different attention weights that may be learned to dynamically control inter-modal and intra-modal interactions, and employs a transformer's decoder to construct a classifier to achieve predictive scoring of the encoding results.

In the single-stream model, the input to the converter layer is a series of two modes, X = [ X = [) _L ∣X _V ]. Thus, in each single stream attention head, the query representation is as follows:

wherein, the first and the second end of the pipe are connected with each other,

are the text and visual sub-matrices of the input and the result output. W ^Q Is a weight matrix. As shown in fig. 2, the fractional matrix S can be defined as four sub-matrices:

two learnable attention weights "epsilon" are then introduced for the intra-modality attention-score submatrix (the subject of S) and the inter-modality attention-score submatrix (the anti-diagonal of S), respectively ₁ "and" epsilon ₂ ". In the encoder of each transformer 203, two learnable weights are multiplied by the attention score matrix to obtain a new attention score matrix:

study of attention weight ε ₁ "and" epsilon ₂ ”：

● The weights are from a single layer feed Forward Neural Network (FNN) with a sigmoid activation function. V _CLS ([CLS]Representation of) is used as an input feature to reflect how well the image matches the text. It provides useful signals for measuring inter-modal and intra-modal interactions.

(ε ₁ ,ε ₂ )＝FFN(V _CLS ) (4)

Learning the self-attention weight directly as two parameters with specified initial values:

(ε ₁ ,ε ₂ )＝nn.Parameter(init_value ₁ ,init_value ₂ ) (5)

in summary, for three types of visual features: the system comprises region features, grid features and facial features, and each type of visual features and text features can be fused through a learning participation mechanism respectively to realize cross-modal interaction.

The above describes representations of different types of image features. In order to capture the local semantics and the global semantics of the image and obtain different types of visual feature representations, different types of visual features are concatenated together for fusion. In some embodiments, the region, grid, and patch features employ different attention weights when fusing different types of visual features. The fusion results are then combined with the text embedding function and used as input to a transformer to obtain attention weights for various modalities and different types of image features of the visual modalities through a learning engagement mechanism (learning to blend). This approach is called Fusion VLP (Fusion-VLP).

In the pre-training phase, there are three types (linguistic, visual, and cross-modal) of pre-training tasks.

1. Mask LM Prediction (Masked LM Prediction). The task settings are substantially the same as in BERT. The masking words are predicted by using a visual mode, which is beneficial to eliminating ambiguity.

2. Mask Object Prediction (Masked Object Prediction). Likewise, the visual side is pre-trained with randomly masked targets. In particular, 15% of the image objects are randomly masked, and then the model is asked to use the output object representation O ^L Attributes of these masked objects are predicted.

3. Image-Text Matching (Image-Text Matching). The task randomly extracts 50% of the unmatched image text pairs and 50% of the matched image text pairs and trains the classifier to predict whether the images and sentences match each other in representation.

4. Image Question Answering (QA)). The image question-and-answer task is converted into a classification problem, where the model is pre-trained using QA data. Then in the model

The classifier is constructed above.

Region-VLPs (VLPs with regional characteristics) were pre-trained using all four pre-training tasks at pre-training, with the four penalties added with equal weight. Grid-VLPs (VLPs using mesh features) are pre-trained using pre-training tasks other than masked object prediction (mask prediction) since mesh features do not capture explicit semantics. Patch-VLP (VLP using Patch features) as well as Fusion-VLP, the masked object prediction (prediction) task was removed during pre-training. During the fine-tuning process after the pre-training process, the complete region/grid/patch features can then be used to retain all the extracted visual information. Hidden state of last layer

For cross modality calculation.

Fig. 3 is a block diagram of an exemplary hybrid expert. As shown in fig. 3, the plurality of experts includes a text reading expert 303, a counting expert 304, a clock reading expert 305, and other experts 306. The text reading expert 303 is used to answer questions related to textual information in the rich text image. The counting expert 304 is used to answer questions related to the number of objects in the rich text picture. The clock reading expert 305 is used to answer questions related to the clock time in the rich text picture. Other experts 306 are used to answer the remaining questions.

The text reading expert 303 obtains an image text (an image including text information) in the rich text picture to obtain an answer to the target question. Specifically, the text reading expert includes a word embedding expression module, a visual encoder, and a word span prediction classifier as shown in FIG. 2. The word embedding expression module is also used for converting the target question into a word embedding sequence. The visual encoder is built based on structured lm. In order to adapt the structure lm to the image text, the pre-trained structure lm is trimmed. Firstly, recognizing image text in a picture by adopting an OCR tool and obtaining character information, converting the character information into a word embedding sequence, and utilizing a cell (boundary box) from the upper left corner to the right corner of the imageAnd serializing the lower corner, splicing the word embedding sequence corresponding to the target problem with the word embedding sequence generated by OCR to obtain a new word embedding sequence, and organizing the spliced new word embedding sequence and the cell sequence together. For example, each image is represented as a sequence of cells c ₁ ,…,c _n Each c _i Comprises a series of words of

Adding a separator [ SEP ] between every two bounding boxes]To separate them, which will give an input sequence q ₁ ,…,q _e ,[SEP],c ₁ ,[SEP],c ₂ ,…,[SEP],c _n }。{q ₁ ,…,q _e And the word embedding sequence is obtained by splicing the word embedding sequence corresponding to the target problem and the result output by the OCR tool. The structured lm is then pre-trained in the same way as the document image pre-training.

Finally, a token-level span prediction classifier (QA) is built on the word representation to execute the QA task and embed the vector E _emb＝ {e _CLS ,e ₁ ,…,e _m ,e _SEP And the coding result of the input sequence obtained based on the StructuralLM (q) ₁ ,…,q _e ,[SEP],c ₁ ,[SEP],c ₂ ,…,[SEP],c _n As input, it is provided to a word span prediction classifier to give a prediction result, and then an answer to the question is given according to the prediction result. Where the added separator will be deleted from the predicted answer range.

Since the clock reading requires specific a priori knowledge, it is still difficult to read the exact time from the clock. Therefore, a clock reading expert 305 is introduced to solve such problems. The clock reading expert 305 is composed of a clock detector and a clock reader. The clock detector is used to detect the clock in the image, which is essentially an item detection task. The cascaded RCNN may serve as a backbone network for the clock detector. Binary classification penalties and bounding box regression penalties are used for training, as is done by the standard detection framework. The bounding box detected from the clock detector is fed into a clock reader, which reads the exact time in the clock. The clock readings are modeled as a classification task and a regression task.

In a clock reader, resnet50-IBN can be used as a backbone and two specific branches are introduced for hour and minute predictions, respectively. Furthermore, since the hour and minute hands in a clock are the key to predicting time, an attention module was introduced that forced the focus of the model on the hands. Channel level attention was performed using the SE layer (SE-layer) after the trunk, and spatial level attention was performed using the spatial attention module consisting of the convolutional layer and the ReLU activation at the beginning of the hour and minute branches, respectively. This combination of channel level and spatial level attention can accommodate the personal deviations of hour and minute predictions. The characteristic outputs of the two branches are as follows:

f _m ＝E _m (Attn _sp (F)*(F)),f _h ＝E _h (Attn _sp (F)*(F) (6)

wherein I is an image, E _h 、E _m Respectively, trunk, hour and minute branches. F is the mapping of the SE layer back trunk, F = Attn _se (E(I))。

Since the clock reader is described as a sort task and a regression task, it introduces losses from two perspectives. From a categorical perspective, the 12 category categorical losses are used for hour and minute predictions. The cross entropy loss is taken as follows:

where N is the number of categories set to 12.

Since 2. The regression losses are listed below:

where B is the batch size. The cosine formula is used for the periodicity constraint of the clock prediction. C is smallThe time or minute period is set to 60 steps. p is a radical of _i And g _i Is a predictive and ground truth.

Since one full turn of the minute hand corresponds to 5 movements of the hour hand, an auto-supervision loss is introduced. Hour and minute self-supervision is considered a regularization penalty to improve the generalization ability of clock readers.

Finally, the total loss is:

L＝L _cls +L _self +λL _reg (10)

where λ is the weight of the unsupervised loss and is set to 0.01.

The counting expert 304 may acquire region features among the visual features and identify targets for counting based on the region features, for example, the above-mentioned target detector and object property detector may be employed to detect the objects and identify the targets. It should be understood that the clock reading expert 305 and the counting expert 304 are both experts that answer based on image information. In this case, the other experts 306 are assigned to other question types that do not belong to the aforementioned experts. Other experts 306 may be considered a general expert model, and when a problem cannot be classified to one expert, it is classified to the other expert.

It should be understood that a VQA task may require both a text understanding expert and other experts to respond, and thus in some embodiments, the task may be divided into multiple subtasks, each of which is responded by an expert, and then the answer results are combined and output.

Fig. 4 is a flowchart of a method for implementing a visual question answering according to an embodiment of the present disclosure. The method comprises the steps of receiving a target question and a rich text picture corresponding to the target question by adopting a mixed expert model and outputting an answer of the target question, wherein the mixed expert model comprises a gating network and a plurality of expert models. The method specifically comprises the following steps.

In step S401, the hybrid expert model receives a rich text picture and a target question for the rich text picture.

In step S402, the gating network determines that the target issue is a first issue type of a plurality of issue types.

In step S403, the gating network inputs the rich text picture and the target question for the rich text picture to a first expert model of the plurality of expert models according to the first question type.

In step S404, the first expert model acquires an answer to the target question.

In the present embodiment, for the VQA task, one general expert is no longer used to process all the problems, but different expert models are used to process different problems in a targeted manner. For example, for a rich text picture containing a lot of text information, a special text understanding expert is used for extracting the text information in the rich text picture to answer the proposed questions, and professional skill type scenes such as clock reading, counting and the like are processed by adopting expert models such as a clock reading expert and a counting expert. The design can effectively cooperate with a plurality of expert models to widen the performance boundary of the model and improve the accuracy of the answer.

Further, since VQA is constantly evolving forward, even though VQA systems have integrated several expert models, there are still some problems that are difficult to solve by combining different feature representations with language model pre-training. To address these problems and make the model evolve, the present disclosure proposes a framework of hybrid expert models to handle these VQA tasks. Briefly, the framework of the hybrid expert model contains three aspects: 1) Discovering artificial intelligence capability which is lost currently; 2) Constructing an expert model with corresponding skills to make up for the artificial intelligence capability which is lacked at present; 3) And (3) adaptively matching the expert models, namely enabling the gating network to autonomously select corresponding expert model targeted answer questions through training.

In specific implementation, a plurality of samples with low confidence coefficient are collected from, for example, a VQA system, and the plurality of samples are clustered by using a clustering method, so as to form a new subtask. For example, given a basic model M (e.g., a generic language vision model), a sample is first collected of correct answers that model M has difficulty giving high confidence. The model cannot handle these samples well with existing knowledge, which requires an expert model with additional knowledge to handle them. Given a sample t, the base model M is designed to give a prediction with a confidence score s. The output score on the prediction tag of the base model is used to calculate the score s. Therefore, a sample with a low confidence score (s < 0. By analogy, a large number of such samples with low confidence scores are collected. These low confidence sample sets are then divided into sub-sample sets using a typical clustering algorithm, K-Means, and then a specialized expert model is constructed for one or more of the sub-sample sets. And finally, forming a finished mixed expert model by the newly constructed expert model and the original basic model through a gate control network of the mixed expert model, and then training the mixed expert model. For example, in one practice, initially, the VQA system includes only one generic model for processing all VQA tasks, but over time the system captures questions whose answer confidence scores are low and determines that the questions are about the time of the clock on the graph and textual information on the graph, for which training clocks reading experts and text understanding experts are used.

And then back to the specific implementation. Suppose that expert M is specified for subtask t _t The expert will give an answer and calculate the reward score S between the predicted answer and the human annotation tag using equation (13) _t . Reward points S _t For supervised training, the training network routes each instance to its best matching expert. At training, the loss of L using the maximized Binary Cross Entropy (BCE) is as follows:

wherein S _t A reward score representing the basic truth of the subtask t,

representing the prediction score of the MoE network. At test time, we chose to have the largest prediction score

Of the individual experts. Computing prediction scores using multi-layer aware networks (MLPs)

As follows:

where x is the input feature, W _i ,b _i Are learnable parameters.

Acc(ans)＝min{#human that said ans/3,1} (13)

To be consistent with "human precision", machine precision was averaged over 3 sets of tests.

The following features are used to train the gating network:

1) Confidence of each expert: for visual understanding experts, the maximum prediction score is used for the confidence score. For text reading and clock reading experts, the score is output for confidence score, and if the image has no text or any clock, the score is set to 1.

2) And the type of the problem: a three-class classifier is trained to predict whether a problem is with text reading, clock reading, or visual understanding. To train the classifier, OCR label data is collected from TextVQA and STVQA, and clock label data is collected from the VQA dataset by retrieving the keywords "clock" and "what time". Other cases in VQA data were sampled as negative samples at a ratio of 1. The prediction scores of these three categories are used as input features. Table 1 shows some sample cases when embodied.

Table 1

Image of a person

Problem to be solved

Yes/no

Number of

Others

Answer to the question

Training

80K

443K

169K

58K

219K

4.4M

Validation

40K

214K

81K

28K

106K

2.1M

Test

80K

447K

The visual understanding experts integrated 46 models in total, including 14 regional VLP models (Region-VLP models), 21 mesh VLP models (21 Grid-VLP models), 4Patch VLP models (4 Patch-VLP models), and 7Fusion VLP models (7 Fusion-VLP models). And integrating all models by adopting a simple maximum voting method according to the prediction scores. The regional VLP model is used for detecting objects and extracting regional characteristics, and is constructed on the basis of ResNeXt-152.

The hybrid expert model employs a multi-layered perceptron (MLP) as a gating network to determine the experts of a given problem. MLP has two hidden layers, consisting of 100 neurons and 50 neurons respectively. It uses tanh as the activation function and uses an Adam optimizer with an initial learning rate of 1 e-3. The network was trained in 5 stages with a batch size of 256.

Research has shown that the VLP method based on region and grid features achieves better performance than the VLP method based on patch features. When examined by a single problem type, regional VLPs perform better on the "numeric" type, while grid VLPs perform better on the "yes/no" and "other" types. This difference can be attributed to the fact that the regional features capture more local information of the image at the target level, and thus by identifying local objects in the image, the visual count problem can be solved more effectively. On the other hand, the grid features capture a global visual context in the image, which helps answer "yes/no" and "other" questions; 2) By combining the three classes of features by means of early fusion, fused VLPs performed best in all single models. The results show that different types of features can complement each other well.

Accordingly, the embodiment of the present disclosure provides a method based on a visual question-answer test model, as shown in fig. 5, which specifically includes the following steps.

In step S501, a plurality of samples in the visual question-answering system whose prediction score is lower than a set threshold are collected to form a sample set.

In step S502, a plurality of samples are clustered to form a plurality of subsample sets.

In step S503, the expert models missing from the visual question-answering system are determined from the subsample set.

In step S504, an expert model missing in the visual question-answering system is constructed and trained using a corresponding one of the plurality of subsample sets.

Based on the embodiment, a plurality of samples with prediction scores lower than a set threshold value are collected in a visual question-answering system to form a sample set, the sample set is clustered to obtain a plurality of sub-sample sets, the types of the missing experts are determined according to the sub-sample sets, corresponding expert types are trained accordingly, and then the trained corresponding experts are integrated into a mixed expert model to expand the performance boundary of the mixed expert model. In addition, fine tuning is required when integrating the trained corresponding experts into the hybrid expert model.

Application scenarios and architectures

FIG. 6 is a schematic diagram of the deployment of an application service. As shown, deployment diagram 600 includes a terminal 603 and a cluster of servers 602 communicating via a network 601.

Network 601 is a combination of one or more of a variety of communication technologies implemented based on exchanging signals, including but not limited to wired technologies employing electrically and/or optically conductive cables, and wireless technologies employing infrared, radio frequency, and/or other forms. In different application scenarios, the network 601 may be the internet, a wide area network, or a local area network, and may be a wired network or a wireless network. For example, network 601 is a local area network within a company.

The server cluster 602 is made up of a plurality of physical servers. The terminal 603 may be an electronic device such as a smartphone, a tablet computer, a laptop computer, a desktop computer, and the like. Various application systems are deployed on the server cluster 602, and the terminal 603 can obtain services provided by the application systems via the network 601.

As cloud computing evolves, server cluster 602 may deploy cloud service systems. The cloud service system can aggregate software and hardware resources in the cloud server cluster and provide software and hardware services according to a request from the terminal 603. For example, the cloud service system may provide a server (possibly a virtual machine) with a specified configuration to a user, wherein the specified configuration includes information such as a processor model, a memory size, a hard disk size, an operating system type, various software (e.g., text processing software, video playing software) types deployed on the operating system, and the like, and the user remotely accesses and uses the server to complete various tasks through the terminal 603. As another example, a specific application system is deployed on a server (which may be a virtual machine), and a service portal is provided for a user, so that the user can obtain corresponding functions provided by the application system through the service portal. In the present disclosure, a VQA system integrating a hybrid expert model may be deployed on the server cluster 602, and each terminal 603 acquires a VQA service through the network 601.

Fig. 7 is a schematic diagram of an exemplary server. The server cluster 602 may be constructed from the servers shown in FIG. 7. As shown in fig. 7, server 700 may include, but is not limited to: a scheduler 701, a memory unit 703, an I/O interface 704, a Field Programmable Gate Array (FPGA) 708, a Graphics Processor (GPU) 702, a neural network acceleration unit (NPU) 702, and a Data Transfer Unit (DTU) 707 coupled via a bus 705.

The memory unit 703 may include a readable medium in the form of a volatile memory unit, such as a random access memory unit (RAM) and/or a cache memory unit. The storage unit 703 may also include readable media in the form of nonvolatile storage units, such as read only memory units (ROM), flash memory, and various disk memories.

The storage unit 703 may store various program modules including an operating system, application programs that provide functions such as text processing, video playback, software editing and compilation, and data. The executable codes of these application programs are read out from the storage unit 703 by the scheduler 701 and executed to achieve the functions that these program modules are intended to provide. Scheduler 701 is typically a CPU.

The bus 705 may be any one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The server 700 may communicate with one or more external devices (e.g., keyboard, pointing device, bluetooth device, etc.), may also communicate with one or more devices that enable a user to interact with the server 700, and/or may communicate with any device (e.g., router, modem, etc.) that enables the server 700 to communicate with one or more other computing devices. Such communication may occur through input/output (I/O) interfaces 704. Further, server 700 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via a network adapter (not shown). The terminal 103 in fig. 1 may access the server 700, for example, through a network adapter. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used based on the server 700, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The neural network acceleration unit (NPU) 702 employs an architecture of data-driven parallel computing, a processing unit for processing a large number of operations (e.g., convolution, pooling, etc.) of each neural network node. Because data in a large number of operations (such as convolution, pooling and the like) of each neural network node and intermediate results are closely related in the whole calculation process and are frequently used, the existing CPU framework is used, and because the memory capacity in a CPU core is small, a large amount of external storage is frequently accessed, and the processing efficiency is low. By adopting the NPU, each core is provided with an on-chip internal memory with the storage capacity suitable for neural network calculation, so that the frequent access to a memory outside the core is avoided, the processing efficiency can be greatly improved, and the calculation performance is improved.

A Data Transmission Unit (DTU) 707 is a wireless terminal device dedicated to converting serial data into IP data or converting IP data into serial data for transmission via a wireless communication network. The main function of the DTU is to wirelessly transmit data from the remote device back to the back office. At the front end, the DTU interfaces with the customer's equipment. After the DTU is powered on and operated, the DTU is firstly registered to a mobile GPRS network and then goes to a background center arranged in the DTU to establish socket connection. The background center is used as a server side of socket connection, and the DTU is a client side of socket connection. Therefore, the DTU and the background software are matched for use, and after the connection is established, the front-end equipment and the background center can perform wireless data transmission through the DTU.

Graphics Processor (GPU) 706 is a microprocessor dedicated to image and graphics related arithmetic operations. The GPU develops the defect of too little space of a computing unit in the CPU, and adopts a large number of computing units specially used for graphics computation, so that the display card reduces the dependence on the CPU and bears some of the computation-intensive graphics image processing work originally borne by the CPU.

A Field Programmable Gate Array (FPGA) 708 is a product of further development on the basis of programmable devices such as PAL, GAL, and the like. The circuit is a semi-custom circuit in the field of Application Specific Integrated Circuits (ASIC), not only overcomes the defects of the custom circuit, but also overcomes the defect that the number of gate circuits of the original programmable device is limited.

It should be noted that the above and fig. 7 are only used for exemplary description of the server in the system, and are not used to limit the specific implementation manner of the server. The server may further include other components, and each of the above-described servers may also be appropriately omitted in practical applications.

Furthermore, in some embodiments, various embodiments of the present disclosure may also be implemented in the form of a computer program product in one or more computer-readable media having computer-readable program code embodied therein.

It should be understood that the above-described are only preferred embodiments of the present disclosure, and are not intended to limit the present disclosure, since many variations of the embodiments described herein will occur to those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

It should be understood that the embodiments in this specification are described in a progressive manner, and that the same or similar parts in the various embodiments may be referred to one another, with each embodiment being described with emphasis instead of the other embodiments.

It should be understood that the above description describes particular embodiments of the present specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

It should be understood that an element described herein in the singular or shown in the figures only represents that the element is limited in number to one. Further, modules or elements described or illustrated herein as separate may be combined into a single module or element, and modules or elements described or illustrated herein as a single may be split into multiple modules or elements.

It is also to be understood that the terms and expressions employed herein are used as terms of description and not of limitation, and that the embodiment or embodiments of the specification are not limited to those terms and expressions. The use of such terms and expressions is not intended to exclude any equivalents of the features shown and described (or portions thereof), and it is recognized that various modifications may be made within the scope of the claims. Other modifications, variations, and alternatives are also possible. Accordingly, the claims are to be regarded as covering all such equivalents.

Claims

1. A method for implementing a visual question and answer includes: receiving a target question and a rich text picture corresponding to the target question and outputting an answer to the target question using a hybrid expert model, the hybrid expert model including a gating network and a plurality of expert models,

the first expert model is used for providing answers of the target questions.

2. The implementation method of claim 1, wherein the expert model comprises: the word embedding expression module is used for coding the target question into a word embedding sequence, the visual encoder is used for coding the rich text picture into a visual feature sequence, and the converter is used for multiplying the word embedding sequence and the visual feature sequence by an attention weight respectively to obtain a fraction matrix and determining an answer of the target question according to the fraction matrix.

3. The implementation method according to claim 2, wherein each word vector included in the word embedding sequence is obtained based on a modality type, position information of a corresponding word, and word embedding of the corresponding word, and each item included in the visual feature sequence is also obtained based on the modality type, position information of a corresponding component of the rich text picture, and a visual feature component of the corresponding component of the rich text picture.

4. An implementation method according to claim 3, wherein inter-modality and intra-modality interactions are controlled by different attention weights in the transformer.

5. The implementation method of claim 2, wherein the visual feature of the corresponding component of the rich text picture is at least one of a region feature, a grid feature and a face feature.

6. The implementation method of any one of claims 1 to 5, wherein the plurality of experts are:

7. The implementation method of claim 6, wherein the counting expert and the clock reading expert extract and fuse region, mesh and patch features from the rich text picture, respectively, and match the fused result with the text features extracted from the target question.

8. The implementation method of claim 7, wherein the region feature, the mesh feature and the patch feature respectively adopt different attention weights in the fusion result.

9. The implementation method of claim 8, wherein the region feature and the mesh feature each obtain a higher attention weight than the patch feature in the clock reading expert and the counting expert.

10. The implementation method of claim 6, wherein the text reading expert acquires text information from the rich text picture by using OCR and accordingly acquires a first word embedding sequence, serializes the rich text picture by using a cell to obtain a cell sequence, each item of the cell sequence includes at least one word in the text information, concatenates a second word embedding sequence corresponding to the target question with the first word embedding sequence to obtain a third word embedding sequence, and then takes the cell sequence and the third word embedding sequence as input of a word span prediction classifier and gives an answer to the target question according to a prediction result.

11. The implementation method of claim 1, further comprising: dividing the received visual task into a plurality of subtasks, each subtask including the rich text picture and a target question for the rich text picture.

12. The implementation method of claim 1 wherein a knowledge graph is generated in the gated network by training the hybrid expert model to enable the gated network to establish correspondence between a plurality of problem types and a plurality of expert models.

13. A method for visual question-answering based inspection model, comprising:

collecting a plurality of samples with prediction scores lower than a set threshold value in a visual question-answering system to form a sample set, wherein the samples comprise rich text pictures and target questions corresponding to the rich text pictures;

clustering the plurality of samples to form a plurality of subsample sets;

determining, from the subsample set, an expert model missing from the visual question-answering system.

14. The method of claim 13, further comprising: constructing an expert model missing from the visual question-answering system and training it with a corresponding subsample set of the plurality of subsample sets.

15. An electronic device comprising a memory and a processor, the memory storing a computer program operable on the processor, the processor implementing the method of any of claims 1 to 12 or the method of any of claims 13 to 14 when executing the program.

16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of carrying out any one of claims 1 to 12 or the method of any one of claims 13 to 14.