CN117892140A

CN117892140A - Visual question and answer and model training method and device thereof, electronic equipment and storage medium

Info

Publication number: CN117892140A
Application number: CN202410295706.9A
Authority: CN
Inventors: 徐聪; 赵雅倩; 范宝余; 刘璐; 贾麒; 金良; 闫瑞栋
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2024-03-15
Filing date: 2024-03-15
Publication date: 2024-04-16
Anticipated expiration: 2044-03-15
Also published as: CN117892140B

Abstract

The invention discloses a visual question and answer and a model training method, a device, electronic equipment and a storage medium thereof, which are applied to the field of artificial intelligence. The method comprises the steps of obtaining a visual question-answer training sample data set; inputting the question-image pair sample into the visual question-answer model, carrying out picture-text coding processing on the question-image pair sample by a picture-text encoder, extracting semantic features of an interactive object from the received picture-text coding features by an interactive decoder, merging the received picture-text coding features and the interactive object features by an inference decoder, and continuously and iteratively updating the picture-text encoding features based on correct answers-correct event knowledge labels corresponding to the merged picture-text coding features and losses among answers and event knowledge retrieved from a knowledge base until a preset model training ending condition is met. The method and the device can solve the problem that the related technology cannot meet the high-precision question-answering requirement of the user and the interpretable requirement of the answer, improve the visual question-answering precision in the scene-based interaction task and enable the answer to have more interpretability.

Description

Visual question and answer and model training method and device thereof, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a visual question and answer and a model training method, a device, electronic equipment and a readable storage medium thereof.

Background

Visual question-answering is a learning task involving computer vision and natural language processing, which refers to giving corresponding answers after deep understanding and reasoning about the content of videos and images and questions posed by users. The visual questions and answers based on the scene interaction task can understand the interaction behavior between the human and the scene, and are widely applied.

The visual question model for executing the corresponding visual question task based on the visual question and answer requirement in the scene interaction task has strong combined understanding capability and can infer among the knowledge graph, the questions and the images, and the visual question and answer model in the related technology is relatively low in question and answer precision and knowledge reasoning precision and cannot meet the high-precision question and answer requirement and answer interpretable requirement of the user.

In view of this, promote based on the visual question and answer precision in the scene interaction task, make the answer have more interpretability, be the technical problem that the person of the field needs to solve.

Disclosure of Invention

The invention provides a visual question and answer and a model training method, a device, electronic equipment and a readable storage medium thereof, which can effectively improve the accuracy of the visual question and answer in a scene-based interaction task and enable the answer to have better interpretability.

In order to solve the technical problems, the invention provides the following technical scheme:

the first aspect of the invention provides a visual question-answering model training method, which comprises the following steps:

acquiring a visual question-answer training sample data set; the visual question-answer training sample data set comprises a knowledge base and a plurality of groups of question-image pair samples with correct answer-correct event knowledge labels; the problem-image pair sample comprises a problem sample and a corresponding image sample thereof, wherein the problem sample comprises the behavior of a target object, and the image sample at least comprises an interaction object pointed by the behavior of the target object interacting with a scene;

inputting a question-image pair sample into a pre-constructed visual question-answer model; the visual question-answering model comprises an image-text encoder, an interactive decoder and an inference decoder;

the image-text encoder carries out image-text encoding processing on the problem-image pair sample, and respectively inputs image-text encoding characteristics to the interactive decoder and the reasoning decoder; the interactive decoder extracts semantic features of the interactive object from the received image-text coding features and sends the extracted interactive object features to the reasoning decoder; and the reasoning decoder fuses the received image-text coding features and the interactive object features, and iteratively updates loss information between correct answers-correct event knowledge labels corresponding to the fused image-text coding features and answers and event knowledge retrieved from the knowledge base until a preset model training ending condition is met.

In a first exemplary embodiment, the inference decoder includes an answer inference branch and a knowledge inference branch; the input of the image-text encoder further comprises an answer output identifier and an event output identifier, the received image-text encoding feature and the interactive object feature are fused, and the iteration update is carried out on the loss information between the correct answer-correct event knowledge label corresponding to the fused image-text encoding feature and the answer and event knowledge retrieved from the knowledge base, and the method comprises the following steps:

the answer reasoning branch receives first-type image-text coding features output by the position corresponding to the answer output identifier of the image-text encoder, and carries out iterative updating based on correct answer labels corresponding to the first-type image-text coding features and loss information between answers retrieved from the knowledge base;

the knowledge reasoning branch receives second-class image-text coding features output by the position corresponding to the event output identifier of the image-text encoder, fuses the second-class image-text coding features with the interactive object features, and carries out iterative updating based on the fused image-text coding features and loss information among event knowledge of the knowledge base;

The answer output identifier is used for identifying the image-text coding characteristics of the image-text coder input to the answer reasoning branch, and the event output identifier is used for identifying the image-text coding characteristics of the image-text coder input to the knowledge reasoning branch.

In a second exemplary embodiment, the iteratively updating the loss information between the correct answer label corresponding to the first type of teletext coding feature and each answer retrieved from the knowledge base includes:

carrying out vectorization representation on each answer of the knowledge base in advance to obtain an answer space containing a plurality of answer characterizations;

for each first type of image-text coding feature, obtaining a correct answer representation corresponding to the current first type of image-text coding feature based on a correct answer-correct event knowledge label of a question-image sample pair corresponding to the current first type of image-text coding feature, and determining standard similarity between the current first type of image-text coding feature and the correct answer representation corresponding to the current first type of image-text coding feature;

determining reference similarity between the current first-type image-text coding features and answer characterization of the answer space;

and determining loss information of each answer representation of the current first-type image-text coding feature and the answer space according to the standard similarity and each reference similarity.

In a third exemplary embodiment, the determining the standard similarity between the current first-type teletext coding feature and its corresponding correct answer representation includes:

invoking a similarity calculation relation, and calculating the standard similarity between the current first-type image-text coding features and the corresponding correct answer characterizations; the similarity calculation relational expression is as follows:

；

wherein,for the standard similarity degree, the reference similarity degree,Tthe transpose is represented by the number,P _n is given by index numbernIs a problem-image pair sample of (c),aindicate answer, ->Representation ofP _n Corresponding first-class graphic coding features +.>Representing a representation of the correct answer,representing the adjustment parameters.

In a fourth exemplary embodiment, the iteratively updating the loss information between the correct answer label corresponding to the first type of teletext coding feature and each answer retrieved from the knowledge base includes:

invoking an answer reasoning loss function to calculate a relation, and calculating answer reasoning loss between the first-class image-text coding features and each answer retrieved from the knowledge base; the answer reasoning loss function calculation relation is as follows:

；

in the method, in the process of the invention,L _a for answer reasoning loss, N is the total number of question-image pairs samples,Tthe transpose is represented by the number, P _n Is given by index numbernIs a problem-image pair sample of (c),athe answer is represented by the sign of the answer,representing corresponding first type of graphic coding features +.>Representing the correct answer representation->Representing the adjustment parameters->Representing answers in answer space A, +.>Representing the answer characterizations in answer space a.

In a fifth exemplary embodiment, the answer reasoning branch includes a semantic space layer, an answer feature extraction layer, and an answer feature representation layer;

the semantic space layer receives first-type image-text coding features output by the positions corresponding to answer output identifiers of the image-text encoder, and calculates similarity between each first-type image-text coding feature and each answer representation;

the answer characteristic extraction layer maps each answer characteristic of the answer characteristic representation layer to the semantic space layer;

and the answer characteristic representation layer is used for carrying out vectorization representation on each answer of the knowledge base, generating corresponding answer characterizations and sending each answer characterizations to the answer characteristic extraction layer.

In a sixth exemplary embodiment, the fusing the second type of teletext encoding features with the interactive object features includes:

calculating distance measurement information of the interactive object features and the second-class image-text coding features respectively to obtain initial fusion image-text coding features;

And carrying out feature sum addition on the initial fusion image-text coding feature and the corresponding image-text coding feature of the second type to obtain the fusion image-text coding feature.

In a seventh exemplary embodiment, the fusing the second type of teletext encoding features with the interactive object features includes:

calling a feature fusion relation, and fusing the second-class image-text coding features with the interactive object features; the feature fusion relation is as follows:

；

in the method, in the process of the invention,f _es in order to fuse the coded features of the graphics,f _e for the second type of teletext encoding feature,f _s for the characteristics of the interactive object,D _KL (f _e ||f _s ) And representing and calculating KL divergence of the interactive object features and the second-class image-text coding features respectively.

In an eighth exemplary embodiment, the iteratively updating the loss information based on the fusion teletext coding feature and the knowledge of each event of the knowledge base includes:

vectorizing each event knowledge of the knowledge base in advance to obtain an event knowledge space containing a plurality of event knowledge characterizations;

for each fusion image-text coding feature, obtaining a correct event knowledge representation corresponding to the current fusion image-text coding feature based on a correct answer-correct event knowledge label of a question-image sample pair corresponding to the current fusion image-text coding feature, and determining event standard similarity between the current fusion image-text coding feature and the correct event knowledge representation corresponding to the current fusion image-text coding feature;

Determining event reference similarity between the current fusion graphic coding feature and the event knowledge representation of the event knowledge space;

and determining loss information between the current fusion graphic coding feature and each event knowledge representation of the event knowledge space according to the event standard similarity and each event reference similarity.

In a ninth exemplary embodiment, the determining the event criteria similarity between the current fusion teletext encoding feature and its corresponding correct event knowledge representation comprises:

invoking an event similarity calculation relation to calculate event standard similarity between the current fusion image-text coding feature and the corresponding correct event knowledge representation; the event similarity calculation relational expression is as follows:

；

wherein,as a result of the standard similarity of the events,Tthe transpose is represented by the number,P _n is given by index numbernIs a problem-image pair sample of (c),ean event is represented by the fact that,srepresenting an interaction object->Representation ofP _n Corresponding fusion graphic coding feature->Representing the correct event knowledge representation +_>Representing the adjustment parameters.

In a tenth exemplary embodiment, the iteratively updating the loss information based on the fusion teletext coding feature and the knowledge of each event of the knowledge base includes:

Invoking a knowledge reasoning loss function to calculate a relation, and calculating knowledge reasoning loss between the fusion graphic coding characteristic and each event knowledge representation of the knowledge base; the knowledge reasoning loss function calculation relation is as follows:

；

in the method, in the process of the invention,L _e in order to reason the loss of answer,Nfor the question-image pair sample total,Tthe transpose is represented by the number,P _n is given by index numbernIs a problem-image pair sample of (c),ean event is represented by the fact that,sthe interactive object is represented by a representation of the interactive object,representation ofP _n The corresponding fusion graphic-text coding characteristics,representing the correct event knowledge representation +_>Representing the adjustment parameters->Representing event knowledge spaceEIs provided with a knowledge of the event in question,representing event knowledge spaceEIs characterized by event knowledge.

In an eleventh exemplary embodiment, the knowledge reasoning branch includes a feature fusion layer, an event space layer, an event feature extraction layer, and an event knowledge feature representation layer;

the feature fusion layer receives second-type image-text coding features output by the position corresponding to the event output identifier of the image-text encoder, fuses the second-type image-text coding features with the interactive object features, and sends the fused image-text coding features to the event space layer;

the event space layer calculates the similarity between the fusion image-text coding characteristics and each event knowledge representation;

The event feature extraction layer maps each event knowledge representation of the event knowledge feature representation layer to the event space layer;

the event knowledge feature representation layer carries out vectorization representation on each event knowledge in the knowledge base, generates corresponding event knowledge representation, and sends each event knowledge representation to the event feature extraction layer.

In a twelfth exemplary embodiment, the performing a teletext encoding process on the question-image pair samples includes:

carrying out text coding on each question-image pair sample, and carrying out text coding on the question sample corresponding to the current question-image pair sample to obtain text coding characteristics;

image coding is carried out on the image sample corresponding to the current problem-image pair sample, so that image coding characteristics are obtained;

and carrying out feature fusion on the text coding features and the image coding features, and outputting the image-text coding features generated by fusion to the interactive decoder and the reasoning decoder.

In a thirteenth exemplary embodiment, the feature fusing the text encoding feature and the image encoding feature, and outputting the image encoding feature generated by the fusion to the interactive decoder and the inference decoder, includes: performing feature stitching on the text coding features and the image coding features, coding the stitching features, and outputting image-text coding features corresponding to the stitching features to the interactive decoder;

Splicing the text coding features and the image coding features into an input sequence, inserting an answer output identifier and an event output identifier in front of the input sequence, coding the input sequence, and outputting the image-text coding features corresponding to the input sequence to the reasoning decoder.

In a fourteenth exemplary embodiment, the teletext encoder comprises a text input, an image input, an answer output identifier input, an event output identifier input, an image encoding layer, a text encoding layer, a feature stitching layer, a first cross-attention layer, and a second cross-attention layer;

the feature splicing layer performs feature splicing on the image coding features output by the image coding layer and the text coding features output by the text coding layer;

the first cross attention layer encodes the splicing features output by the feature splicing layer;

and the second cross attention layer is used for carrying out coding processing on the answer output identifier input by the answer output identifier input end, the event output identifier input by the event output identifier input end, the image coding feature output by the image coding layer and the text coding feature output by the text coding layer.

In a fifteenth exemplary embodiment, the interactive decoder includes an interactive object feature extraction model;

the interactive object feature extraction model is used for extracting semantic features of the interactive object based on the received image-text coding features and positioning the interactive object in a corresponding image sample; and outputting the characteristics of the interactive object to the reasoning decoder, and outputting the position information of the interactive object.

In a sixteenth exemplary embodiment, the overall loss function relationship of the visual question-answer model is:

；

wherein,L _r as a function of the total loss,L _v the loss is located for the interactive object of the interactive decoder, N is the problem-image pair sample total,Tthe transpose is represented by the number,P _n is given by index numbernIs a problem-image pair sample of (c),ean event is represented by the fact that,sthe interactive object is represented by a representation of the interactive object,representation ofP _n Corresponding fusion graphic coding feature->Representing the correct event knowledge representation +_>Which represents the adjustment parameters of the device,representing event knowledge spaceEEvent knowledge in (a)，/>Representing event knowledge spaceEIs characterized by the knowledge of the event in the database,aindicate answer, ->Representing corresponding first type of graphic coding features +.>Representing the correct answer representation->Representing answers in answer space A, +. >Representing the answer characterizations in answer space a.

The second aspect of the present invention provides a visual question-answering method, comprising:

acquiring a to-be-answered question and a corresponding target image;

inputting the questions to be answered and the corresponding target images into a visual question-answering model which is trained in advance by the visual question-answering model training method according to any one of the previous claims;

obtaining candidate answers, target interaction object characteristics and supporting knowledge of the questions to be answered according to the output of the visual question-answering model, and selecting correct answers from the candidate answers based on the similarity between the candidate answers and the supporting knowledge;

the target interaction object features are features of interaction objects, in the target image, of which the target objects corresponding to the questions to be answered and the target image perform scene interaction; the supporting knowledge is characterized by retrieving event knowledge related to the reasoning process of the questions to be answered from a knowledge base.

In a first exemplary embodiment, the obtaining the candidate answer, the target interactive object feature and the supporting knowledge of the to-be-answered question according to the output of the visual question-answering model includes:

The image-text encoder of the visual question-answering model carries out image-text encoding on the questions to be answered and the corresponding target images, and outputs image-text encoding characteristics to be processed;

the interactive decoder of the visual question-answering model extracts the interactive object features of the image-text coding features to be processed and outputs target interactive object features;

the inference decoder of the visual question-answer model retrieves a plurality of candidate answers and a plurality of associated supporting knowledge in a knowledge base based on the graphic encoding features to be processed.

In a second exemplary embodiment, the selecting a correct answer from the candidate answers based on the similarity between the candidate answers and the supporting knowledge includes:

calculating the similarity between each candidate answer and each supporting knowledge, and determining the score of the current candidate answer based on the numerical relation between each similarity and a preset similarity threshold;

and taking the candidate answer with the highest score as a correct answer.

In a third exemplary embodiment, each supporting knowledge constitutes a supporting knowledge set, and the calculating the similarity between the current candidate answer and each supporting knowledge includes:

invoking an answer similarity calculation relation, and calculating the similarity between each candidate answer and each support knowledge in the support knowledge set; the answer similarity calculation relational expression is as follows:

；

In the method, in the process of the invention,Mas candidate answera _m Respectively and support knowledgee _j The degree of similarity between the two,αas the weight coefficient of the light-emitting diode,to support the knowledge set, sim () represents the similarity calculation.

In a fourth exemplary embodiment, each supporting knowledge constitutes a supporting knowledge set, and the determining the score of the current candidate answer based on the numerical relationship between each similarity and a preset similarity threshold includes:

and calculating the score of each candidate answer by calling an answer score calculation relational expression, wherein the answer score calculation relational expression is as follows:

；

in the method, in the process of the invention,Prepresenting a question-target image pair to be answered,SIM(P,a _m ) Candidate answers for a question-target image pair to be answereda _m Is a fraction of the number of (c),f(P) For the first type of teletext encoding features of the question-target image pair to be answered,representing candidate answersa _m The corresponding answer is characterized in that,Tindicating transpose,/->Which represents the adjustment parameters of the device,βfor the preset similarity threshold value to be set,Mas candidate answera _m Respectively and support knowledgee _j Similarity between them.

A third aspect of the present invention provides a visual question-answering model training device, including:

the training data acquisition module is used for acquiring a visual question-answer training sample data set; the visual question-answer training sample data set comprises a knowledge base and a plurality of groups of question-image pair samples with correct answer-correct event knowledge labels; the problem-image pair sample comprises a problem sample and a corresponding image sample thereof, wherein the problem sample comprises the behavior of a target object, and the image sample at least comprises an interaction object pointed by the behavior of the target object interacting with a scene;

The model training module is used for inputting the problem-image pair sample into a pre-constructed visual question-answer model; the visual question-answering model comprises an image-text encoder, an interactive decoder and an inference decoder; the image-text encoder carries out image-text encoding processing on the problem-image pair sample, and respectively inputs image-text encoding characteristics to the interactive decoder and the reasoning decoder; the interactive decoder extracts semantic features of the interactive object from the received image-text coding features and sends the extracted interactive object features to the reasoning decoder; and the reasoning decoder fuses the received image-text coding features and the interactive object features, and iteratively updates loss information between correct answers-correct event knowledge labels corresponding to the fused image-text coding features and answers and event knowledge retrieved from the knowledge base until a preset model training ending condition is met.

A fourth aspect of the present invention provides a visual question-answering apparatus, comprising:

the question and answer data acquisition module is used for acquiring questions to be answered and corresponding target images;

the answer output module is used for inputting the questions to be answered and the corresponding target images into a visual question-answering model which is trained in advance by the visual question-answering model training method according to any one of the previous claims; obtaining candidate answers, target interaction object characteristics and supporting knowledge of the questions to be answered according to the output of the visual question-answering model, and selecting correct answers from the candidate answers based on the similarity between the candidate answers and the supporting knowledge; the target interaction object features are features of interaction objects, in the target image, of which the target objects corresponding to the questions to be answered and the target image perform scene interaction; the supporting knowledge is characterized by retrieving event knowledge related to the reasoning process of the questions to be answered from a knowledge base.

A fifth aspect of the invention provides an electronic device comprising a processor for implementing the visual question-answering model training method according to any one of the preceding claims and/or the steps of the visual question-answering method according to any one of the preceding claims when executing a computer program stored in the memory.

The sixth aspect of the present invention also provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the visual question-answering model training method according to any one of the preceding claims and/or the visual question-answering method according to any one of the preceding claims.

The technical scheme provided by the invention has the advantages that the visual question-answer model can carry out event knowledge reasoning on the input question-image pairs based on answers and event knowledge in the knowledge base of the visual question-answer training sample data set, can also extract the characteristics of the interaction objects, and can also assist in reasoning of the event knowledge by utilizing the extracted semantic information of the interaction objects, so that the questions interacted with a scene can be accurately answered, the visual question-answer precision in a scene-based interaction task is effectively improved, a certain interpretability is provided for the answers through reasoning supporting the knowledge and the characteristics of the interaction objects, and the high-precision question-answer requirements and the interpretable answer requirements of users are met.

In addition, the invention also provides a visual question-answering method, a corresponding implementation device, electronic equipment and a readable storage medium aiming at the visual question-answering model training method, so that the method has more practicability, and the visual question-answering method, the device, the electronic equipment and the readable storage medium have corresponding advantages.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

For a clearer description of the present invention or of the technical solutions related thereto, the following brief description will be given of the drawings used in the description of the embodiments or of the related art, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained from these drawings without the inventive effort of a person skilled in the art.

FIG. 1 is a schematic flow chart of a training method of a visual question-answering model provided by the invention;

FIG. 2 is a schematic diagram of an exemplary structural framework of a visual question-answering model provided by the present invention;

FIG. 3 is a schematic diagram of a knowledge reasoning process in an illustrative example provided by the present invention;

FIG. 4 is a schematic flow chart of a visual question-answering method provided by the invention;

FIG. 5 is a schematic flow chart of a visual question-answering model for executing a visual question-answering task according to the present invention;

FIG. 6 is a schematic diagram of a hardware architecture to which the visual question-answering method provided by the present invention is applicable;

FIG. 7 is a schematic diagram of another exemplary structural framework of a visual question-answering model provided by the present invention;

FIG. 8 is a block diagram of an embodiment of a training device for a visual question-answering model provided by the present invention;

FIG. 9 is a block diagram of an embodiment of a visual question-answering apparatus provided by the present invention;

fig. 10 is a block diagram of an embodiment of an electronic device according to the present invention.

Detailed Description

In order to make the technical scheme of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the accompanying drawings and the detailed description. Wherein the terms "first," "second," and the like in the description and in the above-described figures are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations of the two, are intended to cover a non-exclusive inclusion. The term "exemplary" means "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Visual question-answering tasks based on scene interaction, such as AIVQA (Agent Interaction Visual Question Answering, visual question-answering agent interacting with environment), are used for examining the understanding ability of visual question-answering models to interact with human and scene, which give a picture and a question containing human behavior, and the visual question-answering models can make inferences in combination with the picture, the question and an external knowledge base and give corresponding answers.

In the current related technology, in the visual question-answering process based on scene interaction, two tasks of question-answering and knowledge reasoning can be completed simultaneously, and the visual question-answering model is also a fact basis for giving answers, but the positioning of an interactive object cannot be performed, meanwhile, because the scene requires that a visual interaction model can have very strong combined reasoning capability, reasoning needs to be performed among a knowledge graph, a question and an image, and the visual question-answering model of the related technology is relatively low in question-answering precision and knowledge reasoning precision.

In view of the above, the invention can realize that supporting knowledge and interactive objects can be inferred while question answering is carried out, so that the answers are more interpretable, and knowledge reasoning is assisted by utilizing the extracted semantic information of the interactive objects, so that the questions interacted with scenes can be accurately answered, and the visual question answering precision in the scene-based interaction task is effectively improved. Having described aspects of the invention, various non-limiting embodiments of the invention are described in detail below. Numerous specific details are set forth in the following description in order to provide a better understanding of the invention. It will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well known methods, procedures, components, and circuits have not been described in detail so as not to obscure the present invention.

Referring first to fig. 1, fig. 1 is a flow chart of a visual question-answer model training method provided in this embodiment, where the embodiment may include the following contents:

s101: a visual question and answer training sample dataset is obtained.

The visual question-answer training sample data set of this embodiment includes a knowledge base, a plurality of sets of question-image pair samples with correct answer-correct event knowledge labels. The number of the question-image pair samples contained in the visual question-answer training sample data set can be flexibly selected according to practical application, and in a certain range, the more the number of the question-image pair samples is, the better the performance of the visual question-answer model obtained through final training is. The set of question-image pair samples comprises at least one question sample and an image sample corresponding to at least one question sample, the questions in the question-image pair samples are questions of the actions of a user for interaction with a scene, the question samples comprise actions of a target object, the target object is a human or animal or microorganism, the image samples are scenes for interaction with the target object in a scene interaction task, and the scenes are carried in an image format. When the target object interacts with the scene, the specified object is the target pointed by the behavior of the target object, namely the object pointed by the behavior in the problem, and meanwhile, the object appears in the corresponding image sample, so that the image sample at least comprises the interaction object pointed by the behavior of the target object interacting with the scene. For example, the problem sample is how the user a trains a large language pre-training model by using the server B located in the data center in the pattern, the target object in the problem sample is the user a, the behavior is utilized, the image sample is a server including a data center and a mark B, and the interactive object is the server B. Each question-image pair sample of the visual question-answer training sample dataset has a plurality of labels including, but not limited to, correct answers corresponding to the question samples, interactive object features corresponding to the interactive objects, correct event knowledge involved in reasoning about correct answers corresponding to the question samples. For convenience of description, the embodiment defines the question-image pair sample answer label and the event knowledge label as correct answer-correct event knowledge label, and in the actual calculation process, the question-image pair sample includes, but is not limited to, correct answer characterization corresponding to the question sample, position information and semantic information of an interactive object in the image sample, and correct event knowledge characterization involved in reasoning the correct answer corresponding to the question sample. The correct answer characterization refers to the characteristic representation of the correct answer corresponding to the question, and the correct event knowledge characterization refers to the characteristic representation of the event knowledge related to the question used in the knowledge reasoning.

Where the knowledge base of the present embodiment includes knowledge of answers and events, the present invention contemplates that an event generally refers to the occurrence of a certain action or condition involving a participant, or a change in world state. At granularity, events are between words and sentences: compared with words, the event usually comprises a plurality of words, is used for describing the occurrence of the event and the constituent elements of the event, and is a text unit with more complete semantics; events are more focused on the description of actions or changes in the real world than sentences, a finer granularity depiction of the real world. In order to improve the accuracy of the visual question-answer model, the embodiment performs reasoning based on event knowledge. In the practical application process, for convenience of management, the knowledge base can be respectively an answer base and an event knowledge base, wherein the answer base comprises various answers, and the event knowledge base comprises various event knowledge. Based on the actual application scene, the interaction behavior of the user and the scene, and the question-image pair sample, constructing an event knowledge base and an answer base, and taking a set formed after the answer base is represented by vectorization as an answer space. Similarly, the event knowledge base is vectorized to obtain an event knowledge space, that is, the event knowledge space is a plurality of event feature sets, the deep learning field generally defines a set of field-specific features as a space, each event can be represented based on script formalization of a structure tuple, and of course, other manners of representing time can be adopted. For script formalization representing an event based on a structure tuple, it means that an event is formalized into a combination of a set of elements, which can be divided into a binary group, a ternary group, and a multiple group. For example, each event in the event knowledge space in this embodiment is a set of features of an event knowledge space, and the event knowledge space may be a triplet of features, where an event triplet may include a head entity, a relationship, and a tail entity, and an answer exists only on the entities, so that only the event entity is matched and the relationship is not matched when the supporting knowledge and the associated event knowledge are matched in the event knowledge space in the following.

S102: and inputting the question-image pair samples of the visual question-answer training sample data set into a pre-constructed visual question-answer model.

The visual question-answer model in the step is a pre-built network model framework, and a small batch of question-image pair samples are read from a visual question-answer training sample data set according to preset training parameters to train the visual question-answer model. The visual question-answering model comprises an image-text encoder, an inference decoder and an interaction decoder; the image-text encoder receives the question-image pair sample, text codes the question of the question-image pair sample, image codes the image, fuses the text coding feature and the image coding feature and then inputs the fused text coding feature and the fused image coding feature to the reasoning decoder and the interaction decoder respectively. The inference decoder is used for carrying out knowledge inference based on the text coding features and the interactive object features output by the interactive decoder so as to obtain correct answers obtained by current inference and associated event knowledge used in the inference process.

S103: the picture-text encoder is used for carrying out picture-text encoding processing on the problem-image pair sample, and picture-text encoding characteristics are respectively input into the interactive decoder and the reasoning decoder; the interactive decoder extracts semantic features of the interactive object from the received image-text coding features and sends the extracted interactive object features to the reasoning decoder; the reasoning decoder fuses the received image-text coding features and the interactive object features, and carries out iterative updating on the basis of the correct answer-correct event knowledge label corresponding to the fused image-text coding features and loss information between answers and event knowledge retrieved from a knowledge base until the preset model training ending condition is met.

In this embodiment, as shown in fig. 2, the input of the visual question-answer model is a question-image pair sample, that is, the question text and the image are input together into the graphic encoder, the graphic encoder encodes the question text and the image, then the encoded text features and the image features are fused, the fused features are defined as graphic encoding features, and the graphic encoding features are input to the interactive decoder and the inference decoder, respectively. The interactive decoder extracts semantic features of the interactive object from the image-text coding features, outputs images marked with the interactive object, and sends the extracted features of the interactive object to the inference decoder, in the deep learning, semantic information is information such as textures, colors of the images or categories of targets, for example, in a detection network, after one image is input into the network and is subjected to layer-by-layer convolution, the semantic information becomes more and more obvious, but relative position information becomes weaker, because when the higher-layer convolution is performed, the larger the receptive field of the feature map in the original image is mapped, so that the feeling of local position information is poor. The interactive object is the object of the interaction between the user and the scene, namely the object appointed in the question asked by the user, for example, how to light the indicator lamp of the server, the interactive object is the indicator lamp, the image is the image of the server, and the behavior is that the server is lighted. The interactive object features include, but are not limited to, location information of the interactive object, such as coordinate values, semantic information, and other desired features. The interactive object features are used for assisting the knowledge reasoning process of the reasoning decoder, and the interactive object features can be used as clues in the knowledge reasoning process. For each group of question-image samples, which have correct answers and correct event knowledge as labels, the differences between the correct answer characterizations and the answers retrieved from the knowledge base and the differences between the correct event knowledge and the event knowledge retrieved from the knowledge base are gradually reduced, and model parameters of the visual question-answer model are continuously updated, for example, a batch random gradient descent method can be adopted to train the visual question-answer model until a preset model training ending condition is adopted, and the preset model training ending condition can be that, for example, the iteration times reach a preset value, the visual question-answer model can be converged, and the accuracy of the visual question-answer model can reach a preset accuracy threshold, so that the implementation of the application is not affected. Before gradient update iteration, the model needs to initialize a gradient descent algorithm, set epoch (training period), batch_size (batch size), weight update period t, and iteration number iteration. For example, the total number of question-image sample pairs included in the visual question-answer training sample data set may be 6 ten thousand, the visual question-answer model is trained for at least 100 training periods, one training period refers to that model parameters of the neural network are not repeatedly updated by using all training samples in the training set, and one small batch (mini-batch) of data is taken each time for updating model parameters of the visual question-answer model, so that a training process is completed. In the gradient update iteration process, 500 problem-image samples are used per iteration update, and these 500 problem-image samples are referred to as one small batch (mini-batch) of data, i.e., batch_size number of samples. The iteration number iteration refers to the number of training using batch_size samples, and the iteration number iteration=60000/500=120 for one epoch is completed. The weight updating period refers to updating the weight once every iteration t times when the visual question-answering model is trained. And when the preset model training ending condition is reached, the visual question-answering model is the trained visual question-answering model.

For example, the task to be executed by the visual question-answer model is to give a picture and a question containing human behavior, the image is a scene image containing grassland, white building, blue sky and green big tree, the question is "what should be done if a person wants to chop a tall object in front of the white building in the image", the visual question-answer model combines the picture, the question and an external knowledge base, deduces an answer and locates the position of the interactive object, and meanwhile, a fact basis for the answer needs to be given. The visual question-answering model needs to be inferred in the image first, and the knowledge reasoning process of the visual question-answering model is as shown in fig. 3, with the white building located to the object in front, and then with the green and high located to the tree in front of the white building. And then taking the behaviors (cut-off) in the questions and the interactive objects (trees) positioned in the vision as clues of knowledge reasoning, finding related events in an event knowledge base, and reasoning out answers (sharpening). And meanwhile, outputting a tree which is formed by a picture frame and is used as a behavior interaction object, and giving event knowledge of reasoning as a basis to judge whether the model is truly an answer which is inferred instead of remembered only through a mapping relation.

In the technical scheme provided by the embodiment, the visual question-answer model can carry out event knowledge reasoning on the input question-image pairs based on answers and event knowledge in the knowledge base of the visual question-answer training sample data set, can also extract the characteristics of interaction objects, and can also assist in reasoning of the event knowledge by utilizing the extracted semantic information of the interaction objects, so that the questions interacted with scenes can be accurately answered, the visual question-answer precision in the scene-based interaction task is effectively improved, a certain interpretability is provided for the answers through reasoning supporting the knowledge and the characteristics of the interaction objects, and the high-precision question-answer requirements and the answer interpretable requirements of users are met.

In the above embodiment, the knowledge reasoning about how the reasoning decoder performs the knowledge reasoning is not limited, and based on the above embodiment, this embodiment further provides an exemplary knowledge reasoning implementation manner, which may include the following:

in this embodiment, the answer reasoning branch receives the first type of image-text coding feature output by the position corresponding to the answer output identifier of the image-text encoder, and carries out iterative update based on the correct answer label corresponding to the first type of image-text coding feature and the loss information between the answers retrieved from the knowledge base; the knowledge reasoning branch receives second-class image-text coding features output by the corresponding positions of event output identifiers of the image-text encoder, fuses the second-class image-text coding features with interactive object features, and carries out iterative updating based on the fused image-text coding features and loss information among event knowledge of the knowledge base.

In this embodiment, since the present invention needs to answer questions and use event knowledge to make inferences, and the inference decoder of this embodiment may include an answer inference branch and a knowledge inference branch, in order to identify to which branch the output of the graphic encoder has the graphic encoding feature input, the present embodiment may set an answer output identifier and an event output identifier at the same time, where the answer output identifier is used to identify the graphic encoding feature input to the answer inference branch by the graphic encoder, and the event output identifier is used to identify the graphic encoding feature input to the knowledge inference branch by the graphic encoder, that is, the answer output identifier and the event output identifier need to be added to the input of the graphic encoder. By way of example, the identifier may be in the format of a token, which is the smallest unit of processing in text, and which may be a word, a phrase, a punctuation mark, a character, depending on the requirements and method of text processing. As shown in FIG. 7, the answer output identifier may be an answer term [ answer ], and the event output identifier may be an event term [ event ]. In this embodiment, the teletext encoder comprises three outputs, one output position corresponding to the input position of the answer output identifier, for inputting the teletext encoding features to the answer reasoning branch; an output position corresponds to the input position of the event output identifier and is used for inputting the graphic coding characteristic to the knowledge reasoning branch; the last output position is output to the interactive decoder for inputting the teletext encoding features to the answer inference branch. For convenience of description, the image-text coding features of the input answer reasoning branches are defined as first-type image-text coding features, and the image-text coding features of the input knowledge reasoning branches are defined as second-type image-text coding features.

For the convenience of data processing, each answer of the knowledge base is vectorized in advance to obtain an answer space containing a plurality of answer characterizations; and similarly, vectorizing the correct answer-correct event knowledge labels of the sample of each question-image pair to obtain correct event knowledge representation and correct answer representation, so that the correct answer representation corresponding to the current first-type image-text coding feature is obtained based on the correct answer-correct event knowledge labels of the question-image sample pair corresponding to the current first-type image-text coding feature. The loss calculation process of the answer inference branch may include: for each first type of image-text coding feature, determining standard similarity between the current first type of image-text coding feature and the corresponding correct answer representation, and reference similarity between the current first type of image-text coding feature and the answer representation of the answer space; and determining the loss information of each answer characterization of the current first-type image-text coding feature and answer space according to the standard similarity and each reference similarity.

In this embodiment, the similarity between the first type of image-text coding feature and the corresponding correct answer representation is defined as standard similarity, and the similarity between the first type of image-text coding feature and each answer representation in the answer space is defined as reference similarity. The similarity calculation can refer to any similarity calculation mode in the related technology, such as similarity calculation based on hash and cosine similarity, which do not affect the implementation of the application. In order to improve the training efficiency of the visual question-answering model, a similarity calculation relation and an answer reasoning loss function calculation relation can be stored in advance, and the standard similarity between the current first-type image-text coding features and the corresponding correct answer characterizations can be calculated by calling the similarity calculation relation; in this embodiment, the similarity calculated by using the calculation relation is defined as a similarity, and correspondingly, the reference similarity is defined as a reference similarity, the standard similarity is defined as a standard similarity, and the similarity calculation relation can be expressed as:

；

Similarly, the standard similarity between the current first-type image-text coding features and each answer representation in the answer space can be calculated by using the similarity calculation relational expression; the similarity calculation relation may be:

；

wherein,for standard similarity, <' > for example>For the purpose of reference to the degree of similarity,Tthe transpose is represented by the number,P _n is given by index numbernIs a problem-image pair sample of (c),aindicate answer, ->Representation ofP _n Corresponding first-class graphic coding features +.>Representing the correct answer representation->Representing the adjustment parameters->Representing answers in answer space A, +.>Representing the answer characterizations in answer space a.

Correspondingly, the answer reasoning loss function calculation relation can be directly called, and the answer reasoning loss of each answer characterization of each first-type image-text coding feature and answer space is calculated; the answer reasoning loss function calculation relation is:

；

in the method, in the process of the invention,L _a for answer reasoning loss, N is the total number of question-image pairs samples,Tthe transpose is represented by the number,P _n is given by index numbernIs a problem-image pair sample of (c),athe answer is represented by the sign of the answer,representing corresponding first type of graphic coding features +.>Representing the correct answer representation->Representing the adjustment parameters->Representing answers in answer space A, +.>Representing the answer characterizations in answer space a.

Furthermore, the embodiment also provides an exemplary network structure of answer reasoning branches used in the practical application process, which can realize the learning of the answers more simply and efficiently, and can comprise a semantic space layer, an answer feature extraction layer and an answer feature representation layer. The answer characteristic representation layer carries out vectorization representation on each answer of the knowledge base, generates corresponding answer characterizations, and sends each answer characterizations to the answer characteristic extraction layer, the answer characteristic extraction layer maps each answer characterizations of the answer characteristic representation layer to a semantic space layer, and the answer characteristic extraction layer can adopt a multi-layer perceptron, for example, and can also adopt other network structures capable of carrying out mapping. The semantic space layer is used for receiving first-type image-text coding features output by the positions corresponding to answer output identifiers of the image-text encoder and calculating the similarity between each first-type image-text coding feature and each answer representation in the knowledge base; for example, the semantic space layer may have a built-in similarity computation relation, which may compute similarity with each answer token in the answer space each time a first type of teletext encoding feature is received.

In order to further improve the accuracy of knowledge reasoning, the embodiment also provides an exemplary fusion mode aiming at the fusion mode of the second-class image-text coding feature and the interactive object feature, which is favorable for improving the learning accuracy of knowledge reasoning branches, and can comprise the following contents:

Calculating distance measurement information of the interactive object features and the second-class image-text coding features respectively to obtain initial fusion image-text coding features; and carrying out feature sum addition on each initial fusion image-text coding feature and the corresponding image-text coding feature of the second type to obtain the fusion image-text coding feature.

The purpose of the fusion of the second-type image-text coding features and the interactive object features is to obtain semantic information of the interactive object through an interactive decoder to help learning of knowledge reasoning branches, calculate KL divergence (Kullback-Leibler Divergence) or cross entropy from the interactive object features to the second-type image-text coding features, and add the original second-type image-text coding features. The distance measurement method comprises the steps of determining a distance between two probability distribution functions, wherein the distance between the two probability distribution functions can be used for measuring the difference degree of one probability distribution relative to the other probability distribution, and given the degree of any distribution deviating from the true distribution, a better fusion effect can be achieved. In order to further improve the fusion efficiency, feature fusion relation can be stored in advance, and each second-type image-text coding feature and the interactive object feature can be fused by directly calling the feature fusion relation; the feature fusion relationship can be expressed as:

；

In the method, in the process of the invention,f _es in order to fuse the coded features of the graphics,f _e for the second type of teletext encoding feature,f _s for the characteristics of the interactive object,D _KL (f _e ||f _s ) Representing computing interaction object features respectively associated with a second classKL divergence of the teletext encoding features.

For example, in order to facilitate data processing, each event knowledge of the knowledge base is vectorized in advance to obtain an event knowledge space containing a plurality of event knowledge characterizations. And similarly, vectorizing the correct answer-correct event knowledge labels of the sample of each question-image pair to obtain correct event knowledge representation and correct answer representation, so that the correct event knowledge representation corresponding to the current fusion image-text coding feature can be obtained based on the correct answer-correct event knowledge labels of the question-image sample pair corresponding to the current fusion image-text coding feature. The loss calculation process of the knowledge reasoning branch can comprise: determining event standard similarity between the current fusion image-text coding feature and the corresponding correct event knowledge representation and event reference similarity between the current fusion image-text coding feature and the event knowledge representation of the event knowledge space; and determining loss information between the current fusion graphic coding feature and each event knowledge representation according to the event standard similarity and each event reference similarity.

In this embodiment, the similarity between the second-class image-text coding feature and the corresponding correct answer representation is defined as the event standard similarity, and the similarity between the second-class image-text coding feature and each event knowledge in the event knowledge space is defined as the event reference similarity. The similarity calculation can refer to any similarity calculation mode in the related technology, such as similarity calculation based on hash and cosine similarity, which do not affect the implementation of the application. In order to improve the training efficiency of the visual question-answering model, an event similarity calculation relational expression and a knowledge reasoning loss function calculation relational expression can be stored in advance locally, the event similarity calculation relational expression can be directly called, and the event standard similarity between the current fusion image-text coding feature and the corresponding correct event knowledge representation is calculated; in this embodiment, the similarity calculated by using the calculation relation is defined as a similarity, and correspondingly, the event reference similarity is defined as an event reference similarity, the event standard similarity is defined as an event standard similarity, and the event similarity calculation relation can be expressed as:

；

similarly, the event similarity calculation relation can be used for calculating the event reference similarity between the current second-class image-text coding features and each event knowledge representation in the event knowledge space; the similarity calculation relation is:

；

Wherein,as a result of the standard similarity of the events,for the event reference similarity degree,Tthe transpose is represented by the number,P _n is given by index numbernIs a problem-image pair sample of (c),ean event is represented by the fact that,srepresenting an interaction object->Representation ofP _n Corresponding fusion graphic coding feature->Representing the correct event knowledge representation +_>Representing the adjustment parameters->Representing event knowledge spaceEEvent knowledge in->Representing event knowledge spaceEIs characterized by event knowledge.

Correspondingly, a knowledge reasoning loss function calculation relational expression can be directly called, and knowledge reasoning losses between each fusion image-text coding feature and each event knowledge representation of the knowledge base are calculated; the knowledge reasoning loss function calculation relation can be expressed as:

；

in the method, in the process of the invention,L _e for answer reasoning loss, N is the total number of question-image pairs samples,Tthe transpose is represented by the number,P _n is given by index numbernIs a problem-image pair sample of (c),ean event is represented by the fact that,sthe interactive object is represented by a representation of the interactive object,representation ofP _n The corresponding fusion graphic-text coding characteristics,representing the correct event knowledge representation +_>Representing the adjustment parameters->Representing event knowledge spaceEIs provided with a knowledge of the event in question,representing event knowledge spaceEIs characterized by event knowledge.

Furthermore, the embodiment also provides an exemplary network structure of knowledge reasoning branches used in the practical application process, which can realize learning of event knowledge more simply and efficiently, and can comprise a feature fusion layer, an event space layer, an event feature extraction layer and an event knowledge feature representation layer. The feature fusion layer receives second-type image-text coding features output by the position corresponding to the event output identifier of the image-text encoder, fuses each second-type image-text coding feature with the interactive object feature, and sends the fused image-text coding features to the event space layer; the event knowledge feature representation layer carries out vectorization representation on each event knowledge in the knowledge base, generates corresponding event knowledge representation, and sends each event knowledge representation to the event feature extraction layer. The event feature extraction layer maps each event knowledge representation of the event knowledge feature representation layer to an event space layer; the event feature extraction layer may be, for example, a multi-layer perceptron, although other network structures capable of mapping may be used. The event space layer calculates the similarity between each fusion image-text coding feature and each event knowledge representation of the event knowledge space; for example, the event space layer may have built-in event similarity calculation relationships that can be calculated for similarity with each answer token in the answer space each time a fused teletext encoding feature is received.

As can be seen from the above, the present embodiment uses the interaction decoder to determine the feature information of the scene interaction object; and then, by utilizing a knowledge reasoning branch, the most accurate answer is obtained through the combined reasoning of the image, the question and the knowledge, the supporting knowledge of the answer and the position of the interactive object can be given, and a certain interpretability is provided for the reasoning of the answer.

The above embodiment does not limit how to perform image-text encoding on samples for each problem-image by using the image-text encoder, and this embodiment also provides an exemplary image-text encoding implementation manner, which may include the following:

carrying out text coding on each question-image pair sample, and carrying out text coding on the question sample corresponding to the current question-image pair sample to obtain text coding characteristics; image coding is carried out on the image sample corresponding to the current problem-image pair sample, so that image coding characteristics are obtained; and carrying out feature fusion on the text coding features and the image coding features, and outputting the image-text coding features generated by fusion to an interactive decoder and an inference decoder.

The text encoding may be implemented by any network structure capable of encoding the text features, including, but not limited to, roBERTa (Robustly Optimized Bidirectional Encoder Representations from Transformers, brute force optimized BERT (bi-directional encoder representation based on a converter model)), long-term memory network, and the image encoder may be implemented by any network structure capable of encoding the image, including, but not limited to, convolutional neural network, residual neural network 101 layer (res net-101), which does not affect the implementation of the present invention. Further, the process of feature fusion of the text encoding feature and the image encoding feature and adding the answer output identifier and the event output identifier may include: and (3) performing feature stitching on the text coding feature and the image coding feature, for example, an adjustable CONCATENTATE function, and merging the text strings corresponding to the text coding feature and the image coding feature into one text string. The splicing features are encoded, for example, encoding processing can be performed through a cross attention layer (cross attention layer) and a multi-head attention layer, and then image-text encoding features corresponding to the splicing features are output to an interactive decoder. For the other path, the text coding feature and the image coding feature can be spliced into an input sequence, an answer output identifier and an event output identifier are inserted before the input sequence, the input sequence is coded, for example, coding processing can be carried out through a cross attention layer and a multi-head attention layer, and the image coding feature corresponding to the input sequence is output to an inference decoder. The reasoning decoder and the interaction reasoning device carry out attention adjustment on the image-text coding characteristics output by the image-text coder so as to obtain image-text coder information related to the current decoding position. The image-text encoder encodes the input sequence into a series of feature vectors, and the inference decoder and the interaction inference engine gradually generate an output sequence according to the feature vectors, so that the inference decoder and the interaction inference engine can effectively model the context of the current generation position.

Furthermore, the embodiment also provides an exemplary network structure of the image-text encoder used in the practical application process, so that fusion and encoding of the text features and the image features of the problems can be realized more deeply, and the accuracy of the image-text encoding features is improved. The graphic encoder may include a text input for inputting a question, an image input for inputting an image, an answer output identifier input for inputting an answer output identifier, an event output identifier input for inputting an event output identifier, an image encoding layer, a text encoding layer, a feature stitching layer, a first cross-attention layer, and a second cross-attention layer; the feature splicing layer is used for carrying out feature splicing on the image coding features output by the image coding layer and the text coding features output by the text coding layer; the first cross attention layer is used for carrying out coding processing on the spliced features output by the feature splicing layer and inputting the generated graphic coding features to the interactive decoder; and the second cross attention layer is used for carrying out coding processing on the answer output identifier input by the answer output identifier input end, the event output identifier input by the event output identifier input end, the image coding feature output by the image coding layer and the text coding feature output by the text coding layer, and inputting the generated image-text coding feature to the reasoning decoder.

The above embodiment does not limit how to extract the features of the interactive object by using the interactive decoder, and the present embodiment provides an extraction method of the features of the interactive object, which may include the following:

in this embodiment, the interactive decoder includes an interactive object feature extraction model; the interactive object feature extraction model is used for positioning the interactive object in the corresponding image sample based on the received image-text coding feature and outputting the position information and semantic information of the interactive object.

The interactive object feature extraction model may be any network model structure capable of locating an interactive object, including, but not limited to, a pre-trained decoding transducer (converter network) model, YOLOv5 (You Only Look Once version, deep learning-based object detection) model in an MDETR (Multi-modal Detection with Transformers, multi-modal object detector) model. And inputting the received image-text coding features into an interactive object feature extraction model of an interactive decoder, processing the image-text coding features by the interactive object feature extraction model, positioning the position of the interactive object, outputting the position information and semantic information of the interactive object to an inference decoder, and outputting an image marking the position of the object.

From the above, the embodiment utilizes the position information and semantic information of the interactive object to assist knowledge reasoning, which is beneficial to improving learning accuracy of supporting knowledge.

It may be appreciated that, in order to improve the performance of the visual question model and improve the question-answer accuracy and the inference accuracy, based on the above embodiment, the present embodiment further provides a visual question-answer model loss function, where the total loss function relation of the visual question-answer model may be expressed as:

；

wherein,L _r as a function of the total loss,L _v loss is located for interactive objects of the interactive decoder, N is the problem-image pair sample total,Tthe transpose is represented by the number,P _n is given by index numbernIs a problem-image pair sample of (c),ean event is represented by the fact that,sthe interactive object is represented by a representation of the interactive object,representation ofP _n Corresponding fusion graphic coding feature->Representing the correct event knowledge representation +_>Which represents the adjustment parameters of the device,representing event knowledge spaceEEvent knowledge in->Representing event knowledge spaceEIs characterized by the knowledge of the event in the database,aindicate answer, ->Representing corresponding first type of graphic coding features +.>Representing the correct answer representation->Representing in answer space AAnswer (F)>Representing the answer characterizations in answer space a.

Illustratively, the interactive decoder employs a converter network architecture, and, correspondingly, L _v As a loss function of interactive object localization, it may employ the loss of contrast learning in MDETR,L _v any one of the comparison learning losses described in the related art may be selected, and the embodiment is not limited in any way.

After training to obtain the visual question-answering model, the visual question-answering model may be used to execute visual question-answering tasks based on scene interaction, referring to fig. 4 and 5, the task execution process may include the following:

s401: and acquiring the questions to be answered and the corresponding target images.

S402: and inputting the questions to be answered and the corresponding target images into a pre-trained visual question-answering model.

S403: and obtaining candidate answers of the questions to be answered, target interaction object characteristics and supporting knowledge according to the output of the visual question-answering model, and selecting correct answers from the candidate answers based on the similarity between the candidate answers and the supporting knowledge.

The questions to be answered are questions asked by the user in the visual question-answering task, and the target objects are scenes where the user interacts. The visual question-answering model is obtained by training the visual question-answering model described in any one of the embodiments of the visual question-answering model training method, the visual question-answering model sorts the answers in the knowledge base, the sorting is based on the probability of correct answers from high to low, and a plurality of answers with top ranks are output as candidate answers, and the number of the candidate answers can be flexibly set according to actual requirements such as answer precision and answer output efficiency, for example, 100, 50 and 1000 answers, which do not affect the implementation of the invention. In addition, in addition to outputting the candidate answers, the visual question-answer model may also output target interactive object features supporting the answers and at least one supporting knowledge. The correct answer is a final answer selected from the candidate answers by using supporting knowledge based on semantic similarity of the visual question-answer model; the target interaction object features are features of question answering objects of questions to be answered, which correspond to the features in the target image; further, the target interactive object features may be location information and semantic features of the target interactive object. Supporting knowledge is retrieving event knowledge characterizations from a knowledge base that are relevant to the question-to-be-answered reasoning process. The event knowledge is used for supporting answer reasoning, and for convenience of description and no ambiguity, the embodiment defines the associated event knowledge output in the execution process of the visual question-answering task as supporting knowledge.

Therefore, the visual question-answering accuracy in the scene-based interaction task can be effectively improved, and the answer is more interpretable.

The above embodiment does not limit the process of enhancing answer reasoning based on knowledge of semantic similarity, and the implementation process of selecting a correct answer based on semantic similarity is also provided in this embodiment, which may include the following:

the image-text encoder of the visual question-answering model carries out image-text encoding on the questions to be answered and the corresponding target images, and outputs image-text encoding characteristics to be processed; the interactive decoder of the visual question-answering model extracts the interactive object characteristics of the image-text coding characteristics to be processed and outputs target interactive object characteristics; the interactive decoder of the visual question-answering model searches a plurality of candidate answers in an answer space based on the image-text coding features to be processed, and searches a plurality of associated supporting knowledge in an event knowledge space; the correct answer is selected based on the similarity of each candidate answer and each supporting knowledge.

In this embodiment, the result of performing the image-text encoding on the question to be answered and the corresponding target image by using the image-text encoder is defined as the image-text encoding feature to be processed. The number of candidate answers may be selected according to the actual application scenario, for example, 100 candidate answers may be selected, or 50 candidate answers may be selected, which does not affect the implementation of the present invention. The process of using support knowledge to assist in selecting the optimal answer in this embodiment is: calculating the similarity between each candidate answer and each supporting knowledge, and determining the score of the current candidate answer based on the numerical relation between each similarity and a preset similarity threshold; and taking the candidate answer with the highest score as a correct answer. In other words, the present embodiment takes the answer most similar to the support knowledge as the optimal answer. Similarly, in calculating the similarity between the candidate answer and each supporting knowledge, any similarity calculation method may be used, for example, hash-based similarity calculation, and using a sense (sentence) -bert (Bidirectional Encoder Representations from Transformers, bi-directional encoder representation based on a converter model), which is not limited in this regard. As an exemplary method for calculating similarity with high precision, the present embodiment may generate sentence vectors using a twin network or a triplet network based on a sense-bert model, and then calculate the similarity of the sentence vectors through cosine similarity. In order to improve the similarity calculation efficiency of the candidate answers, answer similarity calculation relational expressions can be stored locally in advance, and then the answer similarity calculation relational expressions are directly called to calculate the similarity between each candidate answer and each support knowledge in the support knowledge set; the answer similarity calculation relation may be expressed as:

；

In the method, in the process of the invention,Mas candidate answera _m Respectively and support knowledgee _j The degree of similarity between the two,αas the weight coefficient of the light-emitting diode,to support the knowledge set, sim () represents the similarity calculation. />

After calculating the similarity between the candidate answer and each supporting knowledge, when the similarity between the candidate answer and each supporting knowledge is greater than or equal to a preset similarity threshold value, adding an increment value to the score of the candidate answer on the basis of the original score, wherein the increment value can directly adopt the preset similarity threshold value as a simple implementation mode, and when the similarity between the candidate answer and each supporting knowledge is smaller than the preset similarity threshold value, the score of the candidate answer is unchanged. Illustratively, the score of each candidate answer may calculate the score of each candidate answer using an answer score calculation relationship, which may be expressed as:

；

in the method, in the process of the invention,Prepresenting a question-target image pair to be answered,SIM(P,a _m ) Candidate answers for a question-target image pair to be answereda _m Is a fraction of the number of (c),f(P) For the first type of teletext encoding features of the question-target image pair to be answered,representing candidate answersa _m The corresponding answer is characterized in that,Tindicating transpose,/->Which represents the adjustment parameters of the device,βfor the preset similarity threshold value to be set,Mas candidate answer a _m Respectively and support knowledgee _j Similarity between them.

From the above, it can be known that the embodiment is suitable for all visual question-answering tasks providing a knowledge base, and the knowledge-aided answer reasoning is utilized, so that the visual question-answering model can simultaneously reason answers, support knowledge and interactive objects, thereby realizing combined reasoning of images, questions and the knowledge base, and improving the accuracy of the visual question-answering model to reason answers.

It should be noted that, in the present invention, the steps are not strictly executed sequentially, so long as they conform to the logic sequence, and the steps may be executed simultaneously or according to a certain preset sequence, and fig. 1 and fig. 4 are only schematic, and do not represent only such an execution sequence.

Finally, based on the above technical solution of the present invention, the following description will be given by way of example with reference to fig. 6, where fig. 6 is a schematic diagram of a hardware composition framework to which the visual question-answering method provided by the present invention is applicable, and the following may be included:

the hardware component framework may include a first electronic device 61 and a second electronic device 62, with the first electronic device 61 and the second electronic device 62 being connected by a network 63. The first electronic device 61 deploys a processor for executing the visual question-answer model training method described in any of the above embodiments, and trains the obtained visual question-answer model based on the visual network model structure framework shown in fig. 7, where the visual network model structure includes a graphic encoder, an inference decoder, and an interactive decoder, and the frame in the inference decoder is an inference phase structure, and the rest is a training phase structure. The image-text encoder comprises a text input end, an image input end, an answer word input end, an event word input end, an image encoding layer, a text encoding layer, a characteristic splicing layer, a first cross attention layer and a second cross attention layer; the image coding layer codes image features by using ResNet-101, and the text coding layer codes problem text by using RoBERTa model; and splicing the image features and the text features together through a splicing function of a calling feature splicing layer, and outputting the image features and the text features to obtain the image-text coding features input to the interactive decoder as an input sequence of the first cross attention layer. The image features and the text features are spliced into an input sequence, two token are inserted before the input sequence as corresponding identifiers, namely [ Answer ] and [ Event ], the sequence is input into a second cross attention layer, the position vectors corresponding to [ Answer ] and [ Event ] in the output sequence are used as a first type of image-text coding feature and a second type of image-text coding feature, and the first type of image-text coding feature and the second type of image-text coding feature are respectively input into an Answer reasoning branch and a knowledge reasoning branch of a reasoning decoder. The knowledge reasoning branch comprises a feature fusion layer, an event space layer, an event feature extraction layer and an event knowledge feature expression layer; the feature fusion layer receives second-type image-text coding features output by the corresponding positions of the event output identifiers of the image-text encoder, fuses each second-type image-text coding feature with the interactive object features, and sends the fused image-text coding features to the event space layer; the event space layer calculates the similarity between each fusion image-text coding feature and each event knowledge representation of the event knowledge space; the event feature extraction layer maps each event knowledge representation of the event knowledge feature representation layer to an event space layer; the event knowledge feature representation layer extracts each event knowledge representation from the event knowledge space and sends each event knowledge representation to the event feature extraction layer. The answer reasoning branch comprises a semantic space layer, an answer characteristic extraction layer and an answer characteristic representation layer; the semantic space layer receives first-type image-text coding features output by the positions corresponding to answer output identifiers of the image-text encoder, and calculates similarity between each first-type image-text coding feature and each answer representation; the answer characteristic extraction layer maps each answer characteristic of the answer characteristic representation layer to a semantic space layer; the answer characteristic representation layer extracts each answer characteristic from the answer space and sends each answer characteristic to the answer characteristic extraction layer. The interactive decoder includes location information and semantic information for locating interactive objects using a structure learning of a transducer.

The first electronic device 61 sends the trained visual question-answer model to the second electronic device 62, meanwhile, the second electronic device 62 is further deployed with a user end for providing a man-machine interaction interface, a user inputs a to-be-processed question and a target image through the man-machine interaction interface, the to-be-processed question and the target image are searched in an event knowledge space, the searched answer sets in the answer space are fused, and the best answer is inferred together through semantic similarity based on the event knowledge to assist in answer reasoning.

It should be noted that the above application scenario is only shown for the convenience of understanding the idea and principle of the present invention, and the embodiment of the present invention is not limited in any way. Rather, embodiments of the invention may be applied to any scenario where applicable.

From the above, the present embodiment uses the object positioning branch to determine the position and semantic information of the scene interaction object; and then, obtaining a related event knowledge set through combined reasoning of the image and the problem and semantic information of the interaction object by utilizing a knowledge reasoning branch of the reasoning decoder, obtaining an answer candidate set with the ranking of top 100 through an answer searching branch, and finally obtaining a final problem answer through the answer candidate set and the event knowledge set together. Finally, the most accurate answer is obtained through the combined reasoning of the image, the question and the knowledge, the supporting knowledge of the answer and the position of the interactive object can be given, and a certain interpretability is provided for the reasoning of the answer.

The invention also provides a corresponding device for the visual question-answering and model training method thereof, so that the method has more practicability. Wherein the device may be described separately from the functional module and the hardware. In the following description, a visual question-answering model training device and a visual question-answering device provided by the present invention are described, which are used to implement the visual question-answering and its corresponding model training method provided by the present invention, in this embodiment, the visual question-answering and model training device and the visual question-answering device may include or be divided into one or more program modules, where the one or more program modules are stored in a storage medium and executed by one or more processors, to complete the visual question-answering and its model training method disclosed in the first embodiment. Program modules in this embodiment refer to a series of computer program instruction segments capable of performing a specific function, and are more suitable than programs themselves for describing the execution of the visual question-answering model training device and the visual question-answering device in the storage medium. The following description will specifically describe the functions of each program module of the present embodiment, and the visual question-answering model training device and the visual question-answering device described below and the visual question-answering and corresponding model training method described above may be referred to correspondingly to each other.

Based on the angles of the functional modules, referring to fig. 8, fig. 8 is a structural diagram of the visual question-answering model training device provided in this embodiment under a specific implementation manner, where the device may include:

a training data acquisition module 801, configured to acquire a visual question-answer training sample data set;

model training module 802 for inputting question-image pair samples into a pre-constructed visual question-answer model; the visual question-answering model comprises an image-text encoder, an interactive decoder and an inference decoder; the image-text encoder carries out image-text encoding processing on the problem-image pair sample, and respectively inputs image-text encoding characteristics to the interactive decoder and the reasoning decoder; the interactive decoder extracts semantic features of the interactive object from the received image-text coding features and sends the extracted interactive object features to the reasoning decoder; and the reasoning decoder fuses the received image-text coding features and the interactive object features, and iteratively updates loss information between correct answers-correct event knowledge labels corresponding to the fused image-text coding features and answers and event knowledge retrieved from the knowledge base until a preset model training ending condition is met.

Illustratively, in some implementations of the present embodiment, the above-mentioned reasoning decoder includes a knowledge reasoning branch and an answer reasoning branch, where the input of the graphic encoder further includes an answer output identifier and an event output identifier, and the answer reasoning branch receives a first type of graphic encoding feature output by a location corresponding to the answer output identifier of the graphic encoder, and performs iterative update based on a correct answer label corresponding to the first type of graphic encoding feature and loss information between answers retrieved from the knowledge base; the knowledge reasoning branch receives second-class image-text coding features output by the position corresponding to the event output identifier of the image-text encoder, fuses the second-class image-text coding features with the interactive object features, and carries out iterative updating based on the fused image-text coding features and loss information among event knowledge of the knowledge base; the answer output identifier is used for identifying the image-text coding characteristics of the image-text coder input to the answer reasoning branch, and the event output identifier is used for identifying the image-text coding characteristics of the image-text coder input to the knowledge reasoning branch.

As an exemplary implementation of the above embodiment, the above model training module 802 may also be used to:

and determining the loss information of each answer characterization of the current first-type image-text coding feature and answer space according to the standard similarity and each reference similarity.

invoking a similarity calculation relation, and calculating the standard similarity between the current first-type image-text coding features and the corresponding correct answer characterizations; the similarity calculation relation is:

；

As another exemplary implementation of the above embodiment, the above model training module 802 may also be used to:

invoking an answer reasoning loss function to calculate a relation, and calculating answer reasoning loss between the first-class image-text coding features and each answer retrieved from the knowledge base; the answer reasoning loss function calculation relation is:

；

As another exemplary implementation of the above embodiment, the answer inference branch may further include a semantic space layer, an answer feature extraction layer, and an answer feature representation layer;

The answer characteristic extraction layer maps each answer characteristic of the answer characteristic representation layer to a semantic space layer;

and carrying out feature sum addition on each initial fusion image-text coding feature and the corresponding image-text coding feature of the second type to obtain the fusion image-text coding feature.

calling a feature fusion relation, and fusing the image-text coding features of each second type with the interactive object features; the feature fusion relation is:

；

in the method, in the process of the invention,f _es in order to fuse the coded features of the graphics,f _e for the second type of teletext encoding feature,f _s for the characteristics of the interactive object,D _KL (f _e ||f _s ) And the KL divergence of the interactive object features and the second-class image-text coding features is calculated.

determining event reference similarity between the current fusion graphic coding feature and the event knowledge representation of the event knowledge space; and determining loss information between the current fusion graphic coding feature and each event knowledge representation according to the event standard similarity and each event reference similarity.

invoking an event similarity calculation relation to calculate event standard similarity between the current fusion image-text coding feature and the corresponding correct event knowledge representation; the event similarity calculation relationship is:

；

invoking a knowledge reasoning loss function to calculate a relation, and calculating knowledge reasoning loss between each fusion image-text coding feature and each event knowledge representation of the knowledge base; the knowledge reasoning loss function calculates the relation as:

；

in the method, in the process of the invention,L _e in order to reason the loss of answer,n is the problem-image pair sample total,Tthe transpose is represented by the number,P _n is given by index numbernIs a problem-image pair sample of (c),ean event is represented by the fact that,sthe interactive object is represented by a representation of the interactive object,representation ofP _n The corresponding fusion graphic-text coding characteristics,representing the correct event knowledge representation +_>Representing the adjustment parameters->Representing event knowledge spaceEIs provided with a knowledge of the event in question,representing event knowledge spaceEIs characterized by event knowledge.

As another exemplary implementation manner of the foregoing embodiment, the foregoing knowledge reasoning branch includes a feature fusion layer, an event space layer, an event feature extraction layer, and an event knowledge feature representation layer;

The feature fusion layer receives second-type image-text coding features output by the position corresponding to the event output identifier of the image-text encoder, fuses each second-type image-text coding feature with the interactive object feature, and sends the fused image-text coding features to the event space layer;

the event space layer calculates the similarity between each fusion image-text coding feature and each event knowledge representation of the event knowledge space;

the event feature extraction layer maps each event knowledge representation of the event knowledge feature representation layer to an event space layer;

and the event knowledge feature representation layer is used for vectorizing and representing each event knowledge in the knowledge base, generating corresponding event knowledge representation and sending each event knowledge representation to the event feature extraction layer.

Illustratively, in other implementations of the present embodiment, the model training module 802 may be further configured to:

As an exemplary implementation of the above embodiment, the above model training module 802 may be further configured to:

performing feature splicing on the text coding features and the image coding features, coding the spliced features, and outputting image-text coding features corresponding to the spliced features to an interactive decoder;

splicing the text coding feature and the image coding feature into an input sequence, inserting an answer output identifier and an event output identifier before the input sequence, coding the input sequence, and outputting the image-text coding feature corresponding to the input sequence to an interactive decoder.

Illustratively, in other implementations of the present embodiment, the above-described graphic encoder includes a text input, an image input, an answer output identifier input, an event output identifier input, an image encoding layer, a text encoding layer, a feature stitching layer, a first cross-attention layer, and a second cross-attention layer;

the feature splicing layer is used for carrying out feature splicing on the image coding features output by the image coding layer and the text coding features output by the text coding layer;

the first cross attention layer is used for encoding the spliced features output by the feature splicing layer;

Illustratively, in other implementations of the present embodiment, the interactive decoder includes an interactive object feature extraction model;

Illustratively, in other implementations of the present embodiment, the model training module 802 described above may be further configured to: and calling a total loss function relation of the visual question-answering model to carry out model training, wherein the total loss function relation is as follows:

；

wherein,L _r as a function of the total loss,L _v loss is located for interactive objects of the interactive decoder, N is the problem-image pair sample total,Tthe transpose is represented by the number,P _n is given by index numbernIs a problem-image pair sample of (c), eAn event is represented by the fact that,sthe interactive object is represented by a representation of the interactive object,representation ofP _n Corresponding fusion graphic coding feature->Representing the correct event knowledge representation +_>Which represents the adjustment parameters of the device,representing event knowledge spaceEEvent knowledge in->Representing event knowledge spaceEIs characterized by the knowledge of the event in the database,aindicate answer, ->Representing corresponding first type of graphic coding features +.>Representing the correct answer representation->Representing answers in answer space A, +.>Representing the answer characterizations in answer space a.

Based on the angles of the functional modules, referring to fig. 9, fig. 9 is a block diagram of the visual question-answering device provided in this embodiment under a specific implementation manner, where the device may include:

the question and answer data acquisition module 901 is used for acquiring questions to be answered and corresponding target images;

the answer output module 902 is configured to input a to-be-answered question and a corresponding target image into a visual question-answering model trained in advance by using the visual question-answering model training method according to any one of the above; according to the output of the visual question-answering model, obtaining candidate answers, target interaction object characteristics and supporting knowledge of the questions to be answered, and selecting correct answers from the candidate answers based on the similarity between the candidate answers and the supporting knowledge; the characteristics of the target interaction object are the characteristics of an interaction object, which performs scene interaction between a target object corresponding to the to-be-answered question and the target image, in the target image; supporting knowledge is retrieving event knowledge characterizations from a knowledge base that are relevant to the question-to-be-answered reasoning process.

Illustratively, in some implementations of this embodiment, the answer output module 902 may be further configured to:

the interactive decoder of the visual question-answering model extracts the interactive object characteristics of the image-text coding characteristics to be processed and outputs the target interactive object characteristics

The interactive decoder of the visual question-answer model retrieves a plurality of candidate answers and a plurality of associated supporting knowledge in a knowledge base based on the graphic encoding features to be processed.

As an exemplary implementation of the foregoing embodiment, the answer output module 902 may be further configured to:

and taking the candidate answer with the highest score as a correct answer.

invoking an answer similarity calculation relation, and calculating the similarity between each candidate answer and each support knowledge in the support knowledge set; answer similarity calculation relation:

；

As another exemplary implementation of the above embodiment, the answer output module 902 may be further configured to:

and (3) calculating the score of each candidate answer by calling an answer score calculation relational expression, wherein the answer score calculation relational expression is as follows:

；

The functions of the above-mentioned visual question-answering model training device and each functional module of the visual question-answering device in this embodiment may be specifically implemented according to the method in the above-mentioned corresponding method embodiment, and the specific implementation process may refer to the relevant description of the corresponding method embodiment, which is not repeated herein.

As can be seen from the above, the present embodiment can solve the problem that the high-precision question-answering requirement and the answer interpretable requirement of the user cannot be satisfied, and can improve the visual question-answering precision in the scene-based interaction task, so that the answer has better interpretability.

The visual question-answering model training device and the visual question-answering device are described from the perspective of functional modules, and further, the invention also provides electronic equipment which is described from the perspective of hardware. Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 10, the electronic device comprises a memory 100 for storing a computer program; the processor 101 is configured to implement the steps of the visual question and answer and the model training method thereof, that is, the visual question and answer model training method and the visual question and answer method, as mentioned in any of the above embodiments when executing the computer program.

Processor 101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and processor 101 may also be a controller, microcontroller, microprocessor, or other data processing chip, among others. The processor 101 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 101 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 101 may be integrated with a GPU (Graphics Processing Unit, graphics processor) for taking care of rendering and drawing of content that the display screen is required to display. In some embodiments, the processor 101 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 100 may include one or more computer-readable storage media, which may be non-transitory. Memory 100 may also include high-speed random access memory as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. Memory 100 may be an internal storage unit of an electronic device, such as a hard disk of a server, in some embodiments. The memory 100 may also be an external storage device of the electronic device, such as a plug-in hard disk provided on a server, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card), etc. in other embodiments. Further, the memory 100 may also include both internal storage units and external storage devices of the electronic device. The memory 100 may be used to store not only application software installed in an electronic device, but also various types of data, such as: code or the like that performs the program during the visual questions and answers and their model training methods may also be used to temporarily store data that has been output or is to be output. In this embodiment, the memory 100 is at least used for storing a computer program 1001, where the computer program, when loaded and executed by the processor 101, is capable of implementing the relevant steps of the visual question-answering and model training method disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 100 may further include an operating system 1002, data 1003, and the like, and the storage manner may be transient storage or permanent storage. The operating system 1002 may include Windows, unix, linux, among other things. The data 1003 may include, but is not limited to, visual questions and answers, data corresponding to model training results thereof, and the like.

In some embodiments, the electronic device may further include a display 102, an input/output interface 103, a communication interface 104, or referred to as a network interface, a power supply 105, and a communication bus 106. Among other things, the display 102, input output interface 103 such as a Keyboard (Keyboard) pertain to user interfaces, which may also include standard wired interfaces, wireless interfaces, and the like. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device and for displaying a visual user interface. The communication interface 104 may illustratively include a wired interface and/or a wireless interface, such as a WI-FI interface, a bluetooth interface, etc., typically used to establish a communication connection between an electronic device and other electronic devices. The communication bus 106 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 10, but not only one bus or one type of bus.

Those skilled in the art will appreciate that the configuration shown in fig. 10 is not limiting of the electronic device and may include more or fewer components than shown, for example, may also include sensors 107 to perform various functions.

The functions of each functional module of the electronic device in this embodiment may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the relevant description of the foregoing method embodiment, which is not repeated herein.

It will be appreciated that the visual questions and answers and their model training methods of the above embodiments may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as stand alone products. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution contributing to the related art, or may be embodied in the form of a software product stored in a storage medium, which performs all or part of the steps of the methods of the various embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrically erasable programmable ROM, registers, a hard disk, a multimedia card, a card-type Memory (e.g., SD or DX Memory, etc.), a magnetic Memory, a removable disk, a CD-ROM, a magnetic disk, or an optical disk, etc., that can store program code.

Based on this, the invention also provides a readable storage medium storing a computer program which, when executed by a processor, performs the steps of the visual question-answering and model training method thereof according to any one of the embodiments above.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the hardware including the device and the electronic equipment disclosed in the embodiments, the description is relatively simple because the hardware includes the device and the electronic equipment corresponding to the method disclosed in the embodiments, and relevant places refer to the description of the method.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The visual question and answer and the model training method, the device, the electronic equipment and the readable storage medium provided by the invention are described in detail. The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present invention and its core ideas. It should be noted that, based on the embodiments of the present invention, all other embodiments obtained by a person skilled in the art without making any inventive effort fall within the scope of protection of the present invention. The present invention is capable of numerous modifications and adaptations without departing from the principles of the present invention, and such modifications and adaptations are intended to be within the scope of the present invention.

Claims

1. A visual question-answering model training method, comprising:

2. The visual question-answering model training method according to claim 1, wherein the inference decoder includes an answer inference branch and a knowledge inference branch; the input of the image-text encoder further comprises an answer output identifier and an event output identifier, the received image-text encoding characteristics and interactive object characteristics are fused, and the iteration update is carried out on the basis of a correct answer-correct event knowledge label corresponding to the fused image-text encoding characteristics and loss information between answers and event knowledge retrieved from the knowledge base, and the method comprises the following steps:

3. The training method of a visual question-answer model according to claim 2, wherein the iteratively updating the loss information between the correct answer labels corresponding to the first type of teletext coding features and the answers retrieved from the knowledge base comprises:

4. A visual question-answering model training method according to claim 3, wherein said determining standard similarity between the current first type of teletext encoding features and their corresponding correct answer characterizations comprises:

；

Wherein,for the standard similarity degree, the reference similarity degree,Tthe transpose is represented by the number,P _n is given by index numbernIs a problem-image pair sample of (c),aindicate answer, ->Representation ofP _n Corresponding first-class graphic coding features +.>Representing the correct answer representation->Representing the adjustment parameters.

5. The training method of a visual question-answer model according to claim 2, wherein the iteratively updating the loss information between the correct answer labels corresponding to the first type of teletext coding features and the answers retrieved from the knowledge base comprises:

；

in the method, in the process of the invention,L _a in order to reason the loss of answer,Nfor the question-image pair sample total,Tthe transpose is represented by the number,P _n is given by index numbernIs a problem-image pair sample of (c),athe answer is represented by the sign of the answer,representing corresponding first type of graphic coding features +.>Representing the correct answer representation->Representing the adjustment parameters->Representing answers in answer space A, +.>Representing the answer characterizations in answer space a.

6. The visual question-answering model training method according to claim 2, wherein the answer reasoning branches include a semantic space layer, an answer feature extraction layer, an answer feature representation layer;

7. The visual question-answering model training method according to claim 2, wherein the fusing of the second-type teletext coding features with the interactive object features comprises:

8. The visual question-answering model training method according to claim 2, wherein the fusing of the second-type teletext coding features with the interactive object features comprises:

；

9. The visual question-answering model training method according to claim 2, wherein the iterative updating based on the fusion of the teletext coding features and the loss information between the event knowledge of the knowledge base comprises:

10. The visual question-answering model training method according to claim 9, wherein the determining of event criteria similarity between the current fusion teletext coding feature and its corresponding correct event knowledge representation comprises:

；

11. The visual question-answering model training method according to claim 2, wherein the iterative updating based on the fusion of the teletext coding features and the loss information between the event knowledge of the knowledge base comprises:

；

12. The visual question-answering model training method according to claim 2, wherein the knowledge reasoning branches comprise a feature fusion layer, an event space layer, an event feature extraction layer and an event knowledge feature representation layer;

13. A visual question-answering model training method according to claim 1, wherein the subjecting of the question-image pair samples to the teletext encoding process comprises:

14. The training method of a visual question-answering model according to claim 13, wherein the feature-fusing the text-encoding features and the image-encoding features and outputting the fusion-generated image-text-encoding features to the interactive decoder and the inference decoder comprises:

Performing feature stitching on the text coding features and the image coding features, coding the stitching features, and outputting image-text coding features corresponding to the stitching features to the interactive decoder;

15. The visual question-answering model training method according to claim 1, wherein the graphic encoder comprises a text input, an image input, an answer output identifier input, an event output identifier input, an image encoding layer, a text encoding layer, a feature stitching layer, a first cross-attention layer, and a second cross-attention layer;

16. The visual question-answering model training method according to claim 1, wherein the interactive decoder includes an interactive object feature extraction model;

17. A visual question-answering model training method according to any one of claims 1 to 16, wherein the overall loss function relationship of the visual question-answering model is:

；

wherein,L _r as a function of the total loss,L _v the loss is located for the interactive object of the interactive decoder, N is the problem-image pair sample total,Tthe transpose is represented by the number,P _n is given by index number nIs a problem-image pair sample of (c),ean event is represented by the fact that,sthe interactive object is represented by a representation of the interactive object,representation ofP _n Corresponding fusion graphic coding feature->Representing the correct event knowledge representation +_>Which represents the adjustment parameters of the device,representing event knowledge spaceEEvent knowledge in->Representing event knowledge spaceEIs characterized by the knowledge of the event in the database,aindicate answer, ->Representing corresponding first type of graphic coding features +.>Representing the correct answer representation->Representing answers in answer space A, +.>Representing the answer characterizations in answer space a.

18. A method of visual question answering, comprising:

acquiring a to-be-answered question and a corresponding target image;

inputting the questions to be answered and the corresponding target images into a visual question-answering model trained in advance by the visual question-answering model training method according to any one of claims 1 to 17;

19. The visual question answering method according to claim 18, wherein the obtaining the candidate answers to the questions to be answered, the target interactive object features and the supporting knowledge according to the output of the visual question answering model comprises:

20. The visual question-answering method according to claim 19, wherein the selecting a correct answer from among the candidate answers based on the similarity between the candidate answers and the supporting knowledge comprises:

And taking the candidate answer with the highest score as a correct answer.

21. The visual question-answering method according to claim 20, wherein each supporting knowledge constitutes a supporting knowledge set, and wherein the calculating of the similarity between the current candidate answer and each supporting knowledge, respectively, comprises:

；

22. The visual question-answering method according to claim 20, wherein the determining the score of the current candidate answer based on the numerical relationship between each similarity and a preset similarity threshold comprises:

；

in the method, in the process of the invention,Prepresenting a question-target image pair to be answered,SIM(P，a _m ) Candidate answers for a question-target image pair to be answereda _m Is a fraction of the number of (c),f(P) For the first type of teletext encoding features of the question-target image pair to be answered, Representing candidate answersa _m The corresponding answer is characterized in that,Tindicating transpose,/->Which represents the adjustment parameters of the device,βfor the preset similarity threshold value to be set,Mas candidate answera _m Respectively and support knowledgee _j Similarity between them.

23. A visual question-answering model training device, comprising:

24. A visual question-answering apparatus, comprising:

the answer output module is used for inputting the questions to be answered and the corresponding target images into a visual question-answering model which is trained in advance by the visual question-answering model training method according to any one of claims 1 to 17; obtaining candidate answers, target interaction object characteristics and supporting knowledge of the questions to be answered according to the output of the visual question-answering model, and selecting correct answers from the candidate answers based on the similarity between the candidate answers and the supporting knowledge; the target interaction object features are features of interaction objects, in the target image, of which the target objects corresponding to the questions to be answered and the target image perform scene interaction; the supporting knowledge is characterized by retrieving event knowledge related to the reasoning process of the questions to be answered from a knowledge base.

25. An electronic device comprising a processor and a memory, the processor being configured to implement the steps of the visual question-answering model training method according to any one of claims 1 to 17 and/or the visual question-answering method according to any one of claims 18 to 22 when executing a computer program stored in the memory.

26. A readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the steps of the visual question-answering model training method according to any one of claims 1 to 17 and/or the visual question-answering method according to any one of claims 18 to 22.