CN114595348A

CN114595348A - Answer information acquisition method, device, equipment and storage medium

Info

Publication number: CN114595348A
Application number: CN202210209684.0A
Authority: CN
Inventors: 钱羽希; 王小捷; 江会星
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2022-03-04
Filing date: 2022-03-04
Publication date: 2022-06-07

Abstract

The application discloses an answer information acquisition method, an answer information acquisition device, answer information acquisition equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: acquiring description information of a target image based on the target image, wherein the description information comprises a plurality of pieces of descriptor information; determining the association degree of each piece of descriptor information in the description information and the question information based on the question information of the target image; weakening at least one piece of descriptor information in the description information based on the determined association degree to obtain processed description information; and determining answer information of the question information based on the question information and the processed description information. According to the scheme, the noise information in the target image can be weakened, and the interference of the noise information is reduced, so that the answer information of the question information is determined more accurately, and the accuracy of the visual question answering is improved.

Description

Answer information acquisition method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an answer information recommendation method, apparatus, device, and storage medium.

Background

Visual Question Answering (VQA) is a typical task combining two fields of computer vision and natural language processing, and is widely concerned by people. VQA the system may generate an answer based on a given image and a question posed to the given image. In short, VQA is a question and answer for a given image. For example, where the designated image is an image of a football game for which the question is "how many players are in the picture", the VQA system may take as input the designated image and question information and combine these two pieces of information to produce as output an answer (e.g., 11).

In the related art, the VQA system obtains description information of an input image by processing the image, and determines answer information matching the image and question information from a plurality of candidate answer information based on the description information of the image and the input question information. However, the accuracy of VQA system was found to be low in practical applications.

Disclosure of Invention

The embodiment of the application provides an answer information acquisition method, device and equipment and a storage medium, and improves the accuracy of visual question answering. The technical scheme is as follows:

in one aspect, an answer information obtaining method is provided, and the method includes: acquiring description information of a target image based on the target image, wherein the description information comprises a plurality of pieces of descriptor information; determining the association degree of each piece of descriptor information in the description information and the question information based on the question information of the target image; weakening at least one piece of descriptor information in the description information based on the determined association degree to obtain processed description information; and determining answer information of the question information based on the question information and the processed description information.

In one aspect, an answer information obtaining apparatus is provided, the apparatus including: the information acquisition module is used for acquiring description information of a target image based on the target image, wherein the description information comprises a plurality of pieces of descriptor information; the relevancy determining module is used for determining the relevancy of each piece of descriptor information in the description information and the question information based on the question information of the target image; the processing module is used for weakening at least one piece of descriptor information in the descriptor information based on the determined association degree to obtain processed descriptor information; and the answer determining module is used for determining the answer information of the question information based on the question information and the processed description information.

In one possible implementation, the description information includes relationship description information, the relationship description information includes a plurality of pieces of relationship descriptor information, and different pieces of relationship descriptor information are used to describe different relationships between objects in the target image; the question information is used for representing a question associated with an object in the target image; the relevancy determining module is configured to determine, based on the problem information, relevancy between each piece of relationship descriptor information in the relationship description information and the problem information, where the relevancy represents relevancy between a relationship represented by the relationship descriptor information and the problem information.

In one possible implementation, the relationship descriptor information refers to one relationship diagram of the target image, and the different relationship descriptor information refers to different relationship diagrams of the target image; one node in the relation graph refers to one object in the target image, and the node characteristic of the node refers to the object characteristic of the object; an edge between two nodes in the relationship graph refers to a relationship between two corresponding objects, an edge feature of the edge refers to a relationship feature corresponding to the relationship between the two corresponding objects, and a direction of the edge is a direction indicated by the relationship between the two objects; or an edge between two nodes in the relationship graph refers to the association degree corresponding to the correspondence between the two objects, and the direction of the edge is the direction indicated by the association degree between the two corresponding objects; the relevancy determining module is used for determining a first global feature of each relation graph in the multiple relation graphs of the target image; determining a second global feature of the issue information; and determining the similarity between the first global feature and the second global feature of each relation graph as the association degree of each relation graph and the question information.

In one possible implementation, the processing module includes: a weight determination unit, configured to determine a weight of each relationship graph based on the determined association degree; the fusion unit is used for performing weighted fusion on the first global features of the multiple relation graphs based on the weight of each relation graph to obtain a third global feature of the target image, wherein the processed description information comprises the third global feature of the target image; or, the processed description information includes a target relationship graph of the target image, and the multiple relationship graphs are weighted and fused based on the weight of each relationship graph to obtain the target relationship graph of the target image.

In one possible implementation, the description information includes object description information, the object description information includes a plurality of pieces of object descriptor information, different object descriptor information is used for describing different objects in the target image, and the question information is used for representing a question associated with an object in the target image; the relevancy determining module is configured to determine, based on the problem information, relevancy between each piece of object descriptor information in the object descriptor information and the problem information, where the relevancy is used to indicate relevancy between an object indicated by the object descriptor information and the problem information.

In one possible implementation manner, the association degree determining module includes: the fusion unit is used for fusing each piece of object descriptor information in the description information with the problem information to obtain a plurality of pieces of fused object descriptor information; a determining unit, configured to determine, based on the plurality of pieces of fused object descriptor information, a degree of association between every two pieces of fused object descriptor information; the determining unit is further configured to determine, for each piece of fused object descriptor information, a degree of association between the fused object descriptor information and the problem information based on a degree of association and a value of association between the fused object descriptor information and the fused object descriptor information except the fused object descriptor information in the plurality of pieces of fused object descriptor information.

In one possible implementation, the object descriptor information refers to an object feature; the fusion unit is used for acquiring the text characteristics of each word in the question information; and for each object feature in the description information, acquiring the similarity between the object feature and each text feature, and fusing each text feature and the object feature based on the acquired similarity to obtain a fused object feature.

In one possible implementation, the object descriptor information refers to an object feature; the determining unit is configured to obtain, based on each fused object feature, a query feature, a key feature, and a value feature of the fused object feature, obtain a similarity between the query feature of the fused object feature and a key feature of another fused object feature of the plurality of fused object features except the fused object feature, and determine, based on the obtained similarity, a degree of association between the fused object feature and the other fused object feature.

In a possible implementation manner, the processing module is configured to perform weakening processing on at least one piece of fused object descriptor information in the description information based on the determined association degree, so as to obtain the processed description information.

In a possible implementation manner, the processing module is configured to determine a weight of each piece of descriptor information based on the determined association degree, and perform weakening processing on the plurality of pieces of descriptor information based on the weight of each piece of descriptor information to obtain the processed descriptor information; or, the processing module is configured to determine at least one piece of descriptor information from the description information based on the determined association degree, and perform weakening processing on the at least one piece of descriptor information to obtain the processed description information.

In a possible implementation manner, the processing module is configured to determine a weight of each piece of descriptor information in the at least one piece of descriptor information based on the determined association degree, and perform weakening processing on the at least one piece of descriptor information in the piece of descriptor information based on the weight of each piece of descriptor information to obtain the processed descriptor information; or, the processing module is configured to delete the at least one piece of descriptor information from the description information, so as to obtain the processed description information.

In one aspect, a computer device is provided, which includes one or more processors and one or more memories, where at least one program code is stored in the one or more memories, and the at least one program code is loaded by the one or more processors and executed to implement the operations performed by the answer information obtaining method according to any one of the possible implementations described above.

In one aspect, a computer-readable storage medium is provided, in which at least one program code is stored, and the at least one program code is loaded and executed by a processor to implement the operations performed by the answer information obtaining method according to any one of the above possible implementation manners.

In one aspect, there is provided a computer program or computer program product comprising: computer program code, which, when executed by a computer, causes the computer to implement the operations performed by the answer information acquisition method according to any one of the above possible implementations.

According to the answer information acquisition method, the answer information acquisition device, the answer information acquisition equipment and the storage medium, considering that a lot of information in the target image is irrelevant to the question information, the information irrelevant to the question information can be regarded as noise information. Therefore, the method and the device can find the descriptor information with low association degree with the question information by determining the association degree of each piece of descriptor information of the target image and the question information. Based on the determined association degree, at least one descriptor information in the description information is weakened, so that the descriptor information with low association degree with the question information can be weakened, the effect of weakening the noise information in the target image is achieved, the interference of the noise information is reduced, the answer information of the question information can be determined more accurately, and the accuracy of the visual question answering is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application;

fig. 2 is a flowchart of an answer information obtaining method according to an embodiment of the present disclosure;

fig. 3 is a flowchart of an answer information obtaining method according to an embodiment of the present disclosure;

FIG. 4 is a flowchart of a weakening method according to an embodiment of the present application;

fig. 5 is a flowchart of an answer information obtaining method according to an embodiment of the present application;

FIG. 6 is a flowchart of a weakening method according to an embodiment of the present application;

fig. 7 is a flowchart of an answer information obtaining method according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an answer information obtaining apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of another answer information obtaining apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a terminal provided in an embodiment of the present application;

fig. 11 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It will be understood that the terms "first," "second," and the like as used herein may be used herein to describe various concepts, which are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. For example, a first object may be termed a second object, and, similarly, a second object may be termed a first object, without departing from the scope of the present application.

As used herein, the term "at least one," "a plurality," "each," or "any," at least one of which includes one, two, or more than two, and a plurality of which includes two or more than two, and each of which refers to each of the corresponding plurality, and any of which refers to any of the plurality, for example, a plurality of information includes 3 pieces of information, and each of which refers to each of the 3 pieces of information, and any of which refers to any of the 3 pieces of information, which may be the first, second, or third.

It should be noted that information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are authorized by the user or sufficiently authorized by various parties, and the collection, use, and processing of the relevant data is required to comply with relevant laws and regulations and standards in relevant countries and regions.

The answer information obtaining method provided by the embodiment of the application is executed by computer equipment. In one possible implementation, the computer device is a terminal, for example, the terminal is any type of terminal such as a desktop computer, a tablet computer, or a mobile phone. In another possible implementation, the computer device is a server. For example, the server may be a server, a server cluster composed of several servers, or a cloud computing service center. In another possible implementation, the computer device includes a terminal and a server.

Fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application, and as shown in fig. 1, the implementation environment includes a terminal 101 and a server 102. The terminal 101 and the server 102 are connected by a wireless or wired network.

Alternatively, the terminal 101 is any type of terminal such as a desktop computer, a tablet computer, or a mobile phone. The server 102 is a server, or a server cluster composed of a plurality of servers, or a cloud computing service center.

The terminal 101 has installed thereon a target application served by the server 102, through which the terminal 101 can implement functions such as data transmission, message interaction, and the like. Optionally, the target application is an application in an operating system of the terminal 101 or an application provided by a third party. For example, the target application is a question and answer application having a function of providing answers, but of course, the question and answer application can also have other functions, such as a comment function, a share function, and the like.

Alternatively, the terminal 101 transmits a target image and question information input by the user for the target image to the server 102, and the server 102 determines answer information of the question information based on the target image and the question information and transmits the answer information to the terminal 101. The terminal 101 receives the answer information and displays the answer information.

The visual question answering method provided by the embodiment of the application can be applied to any question answering scene.

For example, in an e-commerce question and answer scenario.

The user can input the picture of the target object and question information (for example, the price of the target object) asked for the picture in the e-commerce application, and if the answer information acquisition method provided by the embodiment of the application is adopted, the e-commerce application can more accurately acquire the answer information of the question information.

It should be noted that, in the embodiments of the present application, the e-commerce question and answer scenario is only exemplarily described, and the question and answer scenario is not limited.

Fig. 2 is a flowchart of an answer information obtaining method according to an embodiment of the present disclosure. In the embodiment of the present application, an execution subject is taken as an example of a computer device for exemplary explanation, and the embodiment includes:

201. the computer device acquires description information of the target image based on the target image, wherein the description information comprises a plurality of pieces of descriptor information.

The target image may be any image. For example, the target image may be an image local to the computer device; as another example, the target image may be an image taken by a computer device; as another example, the target image may be an image that the computer device acquired from another device.

The target image may be an image of at least one object, i.e. the target image may comprise at least one object. The object may be any object such as a person, an article, a tree, a flower, a beach, a mountain peak, and the like, and the content of the target image is not limited in the embodiments of the present application.

The description information of the target image is information describing the content of the target image. For example, the description information is information for describing an object in the target image. The description information includes a plurality of pieces of descriptor information for describing the target image from different angles. For example, the pieces of descriptor information are used to describe different objects in the target image (the target image is a photograph at sea, descriptor information 1 is used to describe a beach in the target image, descriptor information 2 is used to describe a sea in the target image, and descriptor information 3 is used to describe a coconut tree in the target image); as another example, the plurality of pieces of descriptor information are used to describe different attribute information of the same object in the target image (the target image is a beach photo, descriptor information 1 is used to describe the color of the beach, and descriptor information 2 is used to describe the size of the beach).

Optionally, the target image includes a plurality of objects, and the description information is used to describe the plurality of objects. The description information may describe each of the plurality of objects, or may describe at least two of the plurality of objects together. In the embodiment of the present application, the description information describing at least two objects together in the plurality of objects may be: the description information describes a relationship of at least two objects among the plurality of objects.

In one possible implementation, the description information includes relationship description information for describing a relationship between objects in the target image. The relationship description information includes a plurality of pieces of relationship descriptor information; the computer equipment acquires the description information of the target image based on the target image, and comprises the following steps: the computer equipment acquires a plurality of pieces of relation descriptor information of the target image based on the target image.

Optionally, the different relationship descriptor information is used to describe a relationship between different objects in the target image. For example, the target image includes an object 1, an object 2, and an object 3, and the plurality of pieces of relationship descriptor information include 3 pieces of relationship descriptor information: relationship descriptor information 1, relationship descriptor information 2, and relationship descriptor information 3.

Wherein, the relationship descriptor information 1 is used for describing the relationship between the object 1 and the object 2, the relationship descriptor information 2 is used for describing the relationship between the object 2 and the object 3, and the relationship descriptor information 3 is used for describing the relationship between the object 1 and the object 3.

Optionally, different relationship descriptor information is used to describe different relationships between objects in the target image. In the embodiment of the present application, the different relationship descriptor information is used to describe different relationships between objects in the target image, and means: the different relationship descriptor information is used to describe different types of relationships between objects in the target image. The computer equipment acquires a plurality of pieces of relation descriptor information of the target image based on the target image, wherein the relation descriptor information comprises at least one of the following items: (1) the relationship description information includes semantic relationship description sub-information that represents semantic relationships between objects in the target image; and the computer equipment acquires the semantic relation descriptor information of the target image based on the target image. (2) The relationship description information includes spatial relationship descriptor information representing a spatial relationship between objects in the target image; and the computer equipment acquires the spatial relationship descriptor information of the target image based on the target image. (3) The relationship description information comprises implicit relationship descriptor information which represents the implicit relationship between the objects in the target image; the computer equipment acquires spatial implicit descriptor information of the target image based on the target image.

In another possible implementation, the description information includes object description information, and the object description information includes a plurality of pieces of object descriptor information. Optionally, different object descriptor information is used to describe different objects in the target image; optionally, different object descriptor information is used to describe different attributes of the same object in the target image. The computer equipment acquires the description information of the target image based on the target image, and comprises the following steps: the computer device acquires a plurality of pieces of object descriptor information of the target image based on the target image.

In another possible implementation, the description information includes relationship description information and object description information. The computer equipment acquires the description information of the target image based on the target image, and comprises the following steps: the computer device obtains the relationship description information and the object description information of the target image based on the target image.

202. The computer device determines the association degree of each piece of descriptor information in the description information and the question information based on the question information of the target image.

The problem information is problem information provided for the target image, and the problem information may be input by a user or acquired from a network. For example, the target image is a picture of a soccer game, and the question information of the target image is "how many players are in total in the picture", "what movement the character is doing in the picture", and the like.

In general, the target image includes information not related to the question information in addition to information related to the question information. For example, the target image is a picture of a soccer game, and the question information of the target image is "how many players are in the picture". In this way, the information on the position of the audience in the target image is not related to the problem information, and the information on the player wearing the uniform in the target image is related to the problem information.

203. And the computer equipment weakens at least one piece of descriptor information in the descriptor information based on the determined association degree to obtain the processed descriptor information.

In the embodiment of the present application, at least one piece of descriptor information may be partial information in the description information, or may be all information in the description information. Weakening the descriptor information means: the influence of the descriptor information in the description information is reduced.

In one possible implementation, the at least one piece of descriptor information is all information in the description information. The computer equipment weakens at least one piece of descriptor information in the descriptor information based on the determined association degree to obtain processed descriptor information, and the method comprises the following steps: determining a weight of each piece of descriptor information based on the determined association degree, and weakening the plurality of pieces of descriptor information based on the weight of each piece of descriptor information to obtain the processed descriptor information.

When the computer equipment determines the weight of each piece of descriptor information based on the determined association degree, the higher the association degree of the descriptor information is, the higher the weight is; the lower the degree of association of the descriptor information, the lower the weight. In this way, the higher the association degree of the descriptor information and the question information is, the higher the proportion of the descriptor information in the descriptor information is, and the more accurately the computer device can determine the answer information based on the descriptor information.

In another possible implementation manner, at least one piece of descriptor information is part of the information in the description information. The computer equipment weakens at least one piece of descriptor information in the descriptor information based on the determined association degree to obtain processed descriptor information, and the method comprises the following steps: and determining at least one piece of descriptor information from the description information based on the determined association degree, and weakening the at least one piece of descriptor information to obtain the processed description information.

Optionally, the weakening the at least one piece of descriptor information by the computer device to obtain the processed descriptor information includes: determining a weight of each piece of descriptor information in the at least one piece of descriptor information based on the determined association degree, and weakening the at least one piece of descriptor information in the piece of descriptor information based on the weight of each piece of descriptor information to obtain the processed descriptor information.

Optionally, the weakening the at least one piece of descriptor information by the computer device to obtain the processed descriptor information includes: and deleting the at least one piece of descriptor information from the description information to obtain the processed description information.

In another possible implementation manner, the weakening processing is performed on at least one piece of descriptor information in the descriptor information by the computer device based on the determined association degree, so as to obtain processed descriptor information, including: weakening the descriptor information of which the relevance degree with the problem information is smaller than a relevance degree threshold value in the description information to obtain the processed description information.

The threshold of the degree of association is any numerical value, and the threshold of the degree of association may be an empirical numerical value, a numerical value set by a technician, or a default numerical value of a computer device.

In another possible implementation manner, the weakening processing is performed on at least one piece of descriptor information in the descriptor information by the computer device based on the determined association degree, so as to obtain processed descriptor information, including: determining a first number of relevance degrees from the determined relevance degrees, wherein the first number of relevance degrees is smaller than the relevance degrees of the determined relevance degrees except the first number of relevance degrees; and weakening the descriptor information corresponding to the first number of relevance degrees to obtain the processed descriptor information.

For example, after the computer device obtains the association degrees, the obtained association degrees are sorted according to the order of magnitude, if the association degrees are sorted from the magnitude to the magnitude, the last first number of association degrees are selected, and the descriptor information corresponding to the last first number of association degrees is weakened to obtain the processed descriptor information. And if the association degrees are sorted from small to large, selecting the first number of association degrees, and weakening the descriptor information corresponding to the first number of association degrees to obtain the processed descriptor information.

204. The computer device determines answer information for the question information based on the question information and the processed description information.

In the embodiment of the present application, answer information matching with the question information may be determined from a plurality of candidate answers based on the question information and the processed description information. The candidate answers may be answers in an answer library, or answer information recalled based on question information and a target image, and the candidate answers are not limited in the embodiments of the present application.

According to the answer information acquisition method provided by the embodiment of the application, considering that a lot of information in the target image is irrelevant to the question information, the information irrelevant to the question information can be regarded as noise information. Therefore, by determining the association degree of each piece of descriptor information of the target image and the question information, the descriptor information with low association degree with the question information can be found. Based on the determined association degree, at least one descriptor information in the description information is weakened, so that the descriptor information with low association degree with the question information can be weakened, the effect of weakening the noise information in the target image is achieved, the interference of the noise information is reduced, the answer information of the question information can be determined more accurately, and the accuracy of the visual question answering is improved.

It should be noted that, in the embodiment of the present application, the description information may include relationship description information, where the relationship description information includes a plurality of pieces of relationship descriptor information; or, the description information includes object description information including a plurality of pieces of object descriptor information; alternatively, the description information includes relationship description information and object description information. The embodiment of the present application takes the embodiment shown in fig. 3 as an example to illustrate a case where "the description information includes relationship description information, and the relationship description information includes a plurality of pieces of relationship descriptor information". The embodiment of the present application takes the embodiment shown in fig. 5 as an example to illustrate a case where "the description information includes object description information, and the object description information includes a plurality of pieces of object descriptor information". The embodiment of the present application takes the embodiment shown in fig. 6 as an example to exemplarily explain a case where "the description information includes object description information and relationship description information".

Fig. 3 is a flowchart of an answer information obtaining method according to an embodiment of the present disclosure. In the embodiment of the present application, an execution subject is taken as an example of a computer device for exemplary explanation, and the embodiment includes:

301. the computer device obtains description information of the target image based on the target image, wherein the description information comprises relationship description information, and the relationship description information comprises a plurality of pieces of relationship descriptor information.

Optionally, the different relationship descriptor information is used to describe different relationships between objects in the target image. For example, the description information includes semantic relationship descriptor information, spatial relationship descriptor information, and implicit relationship descriptor information. The semantic relation descriptor information is used for describing semantic relations among objects in the target image; the spatial relationship descriptor information is used for describing the spatial relationship between the objects in the target image; the implicit relationship descriptor information is used to describe the implicit relationship between objects in the target image.

Optionally, the different relationship descriptor information is used to describe a relationship between different objects in the target image. For example, the description information includes relationship descriptor information 1, relationship descriptor information 2, and relationship descriptor information 3. Wherein, the relationship descriptor information 1 is used for describing the relationship between the object 1 and the object 2; the relationship descriptor information 2 is used to describe the relationship between the object 2 and the object 3; the relationship descriptor information 3 is used to describe the relationship between the object 1 and the object 3.

302. The computer device determines the association degree of each piece of relationship descriptor information and the question information in the description information based on the question information of the target image.

The question information is used to indicate a question associated with the content of the target image. For example, the target image is a picture of a soccer game, and the question information is "how many players are in the picture". Optionally, the target image comprises at least one object, and the question information is used to represent a question associated with the object in the target image.

Since the question information indicates a question associated with the content of the target image, the higher the similarity of the question information to a certain content in the target image, the higher the association degree of the question information with the content. For example, the target image is a picture of a soccer game, the question information is "how many players are in total in the picture", and the question information includes "players", so that the content of the target image having a high similarity to the question information is content related to "players", and the content of the target image related to "players" is also content related to the question information.

In one possible implementation manner, the computer device determines, based on the question information of the target image, a degree of association between each piece of relationship descriptor information in the description information and the question information, including: the computer equipment determines the similarity between each piece of relationship descriptor information in the description information and the question information based on the question information of the target image; and determining the association degree of each piece of relation descriptor information and the question information based on the similarity of each piece of relation descriptor information and the question information.

Optionally, the determining, by the computer device, the association degree of each piece of relationship descriptor information with the question information based on the similarity between each piece of relationship descriptor information and the question information includes: and the computer equipment determines the similarity of each piece of relationship descriptor information and the question information as the association degree of the relationship descriptor information and the question information. Optionally, the determining, by the computer device, the association degree of each piece of relationship descriptor information with the question information based on the similarity between each piece of relationship descriptor information and the question information includes: and normalizing the similarity of each piece of relationship descriptor information and the problem information to obtain the association degree of each piece of relationship descriptor information and the problem information.

It should be noted that information such as the description information and the relationship descriptor information in the embodiment of the present application may be in any information form, and the description information and the relationship descriptor information are not limited in the embodiment of the present application. For example, the relationship descriptor information refers to a relationship label, for example, the relationship label is "object a squats on the right side of object B", "object a and object B are looking at", and the like. For another example, the relationship descriptor information refers to a relationship diagram of the target image, such as a semantic relationship diagram, a spatial relationship diagram, an implicit relationship diagram, and the like of the target image.

In one possible implementation, the relationship descriptor information refers to one relationship diagram of the target image, and the different relationship descriptor information refers to different relationship diagrams of the target image; one node in the relational graph refers to one object in the target image, and the node characteristic of the node refers to the object characteristic of the object.

In addition, an edge between two nodes in the relationship graph refers to a relationship corresponding to two objects, an edge characteristic of the edge refers to a relationship characteristic corresponding to the relationship between the two objects, and the direction of the edge is a direction indicated by the relationship corresponding to the two objects; or, an edge between two nodes in the relationship graph refers to the association degree corresponding to the correspondence between the two objects, and the direction of the edge is the direction indicated by the association degree between the two corresponding objects. The embodiment of the present application does not limit the expression form of the relationship diagram.

The computer equipment determines the association degree of each piece of relation descriptor information in the relation description information and the problem information based on the problem information, and the method comprises the following steps: the computer equipment determines a first global feature of each relation graph in the plurality of relation graphs of the target image; determining a second global feature of the issue information; and determining the similarity between the first global feature and the second global feature of each relation graph as the association degree between each relation graph and the question information.

It should be noted that any method for extracting the global feature may be adopted to obtain the first global feature and the second global feature in the embodiment of the present application, and details of the embodiment of the present application are not described herein any more.

303. And the computer equipment weakens at least one piece of relationship descriptor information in the description information based on the determined association degree to obtain the processed description information.

It should be noted that the above step 303 is the same as the above step 203, and any weakening method provided in the above step 203 is also applicable to the step 303, which is not described in detail herein.

It should be noted that, in the embodiment of the present application, only the relational descriptor refers to a relational graph as an example, and "weakening at least one piece of descriptor information in the descriptor information based on a weight of each piece of descriptor information" is exemplarily illustrated. In a possible implementation manner, the processed description information includes a third global feature of the target image, and the weakening processing is performed on at least one piece of descriptor information in the description information by the computer device based on the determined association degree to obtain the processed description information, including: determining the weight of each relation graph based on the determined association degree; and performing weighted fusion on the first global features of the multiple relation graphs based on the weight of each relation graph to obtain a third global feature of the target image.

In another possible implementation manner, the processed description information includes an object relationship graph of the object image, and the weakening processing is performed on at least one piece of descriptor information in the description information by the computer device based on the determined association degree to obtain the processed description information, including: and performing weighted fusion on the multiple relation graphs based on the weight of each relation graph to obtain a target relation graph of the target image.

For example, as shown in fig. 4, association degrees of the semantic relationship diagram, the spatial relationship diagram, and the implicit relationship diagram of the target image and the question information are respectively determined, and the semantic relationship diagram, the spatial relationship diagram, and the implicit relationship diagram are fused according to the association degrees corresponding to the semantic relationship diagram, the spatial relationship diagram, and the implicit relationship diagram, so as to obtain the target relationship diagram of the target image.

304. The computer device determines answer information of the question information based on the question information and the processed description information.

The step 304 is similar to the step 204, and is not described in detail herein.

According to the answer information acquisition method provided by the embodiment of the application, considering that a lot of information in the target image is irrelevant to the question information, the information irrelevant to the question information can be regarded as noise information. Therefore, the method and the device can find the relation descriptor information with low relevance degree with the question information by determining the relevance degree of each piece of relation descriptor information of the target image and the question information. Based on the determined association degree, at least one piece of relationship descriptor information in the description information is weakened, the relationship descriptor information with low association degree with the question information can be weakened, the effect of weakening the noise information in the target image is achieved, the interference of the noise information is reduced, the answer information of the question information can be determined more accurately, and the accuracy of the visual question answering is improved.

Fig. 5 is a flowchart of an answer information obtaining method according to an embodiment of the present disclosure. In the embodiment of the present application, an execution subject is taken as an example of a computer device for exemplary explanation, and the embodiment includes:

501. the computer device acquires description information of a target image based on the target image, the description information including object description information including a plurality of pieces of object descriptor information.

In this embodiment, the target image includes at least one object, and the object may be any object such as a person, an article, a flower, a tree, a beach, a mountain peak, and the like. Optionally, different object descriptor information is used to describe different objects in the target image. For example, the object descriptor information 1 is used to describe a person a in the target image, and the object descriptor information 2 is used to describe a person B in the target image. Optionally, different object descriptor information is used to describe different attributes of the object in the target image. For example, the object descriptor information 1 is used to describe the identity of an object, the object descriptor information 2 is used to describe the location of an object, the object descriptor information 3 is used to describe the shape of an object, the object descriptor information 4 is used to describe the behavior of an object, and so on.

502. The computer device determines the association degree of each piece of object descriptor information in the description information with question information based on the question information of the target image.

In an embodiment of the application, the target image comprises at least one object, and the question information is used for representing a question associated with the object in the target image. For example, the target image is a picture of a soccer game, and the question information is "how many players are in the picture".

Since the question information is a question associated with an object in the target image, the higher the similarity between the question information and the object descriptor information of the target image, the higher the association between the question information and the object descriptor information, and the more accurate answer can be obtained according to the object descriptor information.

In one possible implementation manner, the computer device determines, based on the question information, a degree of association between each piece of object descriptor information in the description information and the question information, including: the computer equipment determines the similarity between each piece of object descriptor information in the description information and the question information based on the question information; and determining the association degree of each piece of object descriptor information and the question information based on the similarity of each piece of object descriptor information and the question information.

Optionally, the determining, by the computer device, the association degree of each piece of object descriptor information with the question information based on the similarity between each piece of object descriptor information and the question information includes: and determining the similarity between each piece of object descriptor information and the question information as the association degree between each piece of object descriptor information and the question information. Optionally, the determining, by the computer device, the association degree between each piece of object descriptor information and the question information based on the similarity between each piece of object descriptor information and the question information includes: and normalizing the similarity of each piece of object descriptor information and the question information to obtain the association degree of each piece of object descriptor information and the question information.

In some embodiments, the problem information is associated with a plurality of objects in the target image, and for this reason, the present application further provides another method for determining the association degree. In one possible implementation manner, the computer device determines, based on the question information, a degree of association between each piece of object descriptor information in the description information and the question information, including: fusing each piece of object descriptor information in the description information with the problem information to obtain a plurality of pieces of fused object descriptor information; determining the association degree between every two pieces of fused object descriptor information based on the plurality of pieces of fused object descriptor information; for each piece of fused object descriptor information, determining the association degree between the fused object descriptor information and the problem information based on the association degree and the value between the fused object descriptor information and the fused object descriptor information except the fused object descriptor information in the plurality of pieces of fused object descriptor information.

Optionally, the object descriptor information refers to an object feature; the computer device fuses each piece of object descriptor information in the description information with the question information to obtain a plurality of pieces of fused object descriptor information, and the method comprises the following steps: the computer equipment acquires the text characteristics of each word in the question information; and for each object feature in the description information, acquiring the similarity between the object feature and each text feature, and fusing each text feature and the object feature based on the acquired similarity to obtain a fused object feature.

In the embodiment of the present application, when the computer device obtains the fused object feature, a cross-modal self-attention mechanism may be adopted. For example, the computer device obtains, for each object feature in the description information, a similarity between the object feature and each text feature, and fuses each text feature and the object feature based on the obtained similarity to obtain a fused object feature, including: acquiring query features, key features and value features of each object feature and each text feature; and for each object feature, fusing the value feature of the object feature and the value feature of each text feature based on the similarity between the query feature of the object feature and the key feature of each text feature to obtain the fused object feature.

Optionally, the determining, by the computer device, the association degree between every two pieces of fused object descriptor information based on the multiple pieces of fused object descriptor information includes: acquiring query features, key features and value features of the fused object features based on each fused object feature, acquiring similarity between the query features of the fused object features and the key features of other fused object features except the fused object features in the fused object features, and determining the association degree between the fused object features and the other fused object features based on the acquired similarity. For example, the computer device directly determines the obtained similarity as the association degree of the fused object feature and the other fused object features; for another example, the computer device performs normalization processing on the obtained similarity to obtain the association degree between the fused object feature and the other fused object features.

503. And the computer equipment weakens at least one piece of object descriptor information in the description information based on the determined association degree to obtain the processed description information.

Step 503 is the same as step 303 and step 203, and is not described in detail here.

504. The computer device determines answer information for the question information based on the question information and the processed description information.

Step 504 is similar to step 204, and is not described in detail herein.

According to the answer information acquisition method provided by the embodiment of the application, considering that a lot of information in the target image is irrelevant to the question information, the information irrelevant to the question information can be regarded as noise information. Therefore, the object descriptor information with low relevance degree with the question information can be found by determining the relevance degree of each piece of object descriptor information of the target image and the question information. Based on the determined association degree, at least one piece of object descriptor information in the description information is weakened, so that the object descriptor information with low association degree with the question information can be weakened, the effect of weakening the noise information in the target image is achieved, the interference of the noise information is reduced, the answer information of the question information can be determined more accurately, and the accuracy of the visual question answering is improved.

In addition, the object descriptor information of the object with low relevance degree with the question information can be discarded, so that the object irrelevant to the question information in the image does not need to be concerned, the content in the image can be concerned more accurately, the answer information of the question information can be determined more accurately, and the accuracy of the visual question answering is improved.

Fig. 6 is a flowchart of an answer information obtaining method according to an embodiment of the present disclosure. In the embodiment of the present application, an execution subject is taken as an example of a computer device for exemplary explanation, and the embodiment includes:

601. the computer device obtains description information of a target image based on the target image, the description information including relationship description information and object description information, the relationship description information including a plurality of pieces of relationship descriptor information, the object description information including a plurality of pieces of object descriptor information.

602. The computer equipment determines the association degree of each piece of relation descriptor information in the relation description information and the question information based on the question information of the target image.

603. And the computer equipment weakens at least one piece of relationship description sub-information in the relationship description information based on the determined association degree to obtain the processed relationship description information.

604. The computer device determines the association degree of each piece of object descriptor information in the object description information with the question information based on the question information of the target image.

605. And the computer equipment weakens at least one piece of object description sub-information in the object description information based on the determined association degree to obtain the processed object description information.

606. The computer device determines answer information of the question information based on the question information, the processed relationship description information, and the processed object description information.

It should be noted that steps 601 to 603 are the same as steps 301 to 303, and are not described in detail herein. The above steps 604 to 606 are similar to the above steps 502 to 504, and are not described in detail herein.

It should be noted that, in the embodiment of the present application, steps 602 to 603 may be performed first, and then steps 604 to 605 are performed; step 602 to step 603 and step 604 to step 605 may also be performed simultaneously; step 604 to step 605 may be performed first, and then step 602 to step 603 may be performed. The embodiment of the present application does not limit this.

According to the answer information acquisition method provided by the embodiment of the application, considering that a lot of information in the target image is irrelevant to the question information, the information irrelevant to the question information can be regarded as noise information. Therefore, the object descriptor information with low relevance degree with the question information can be found by determining the relevance degree of each piece of object descriptor information of the target image and the question information. Weakening at least one piece of object descriptor information in the description information based on the determined association degree, so that the object descriptor information with low association degree with the problem information can be weakened; by determining the association degree of each piece of relationship descriptor information of the target image with the question information, the relationship descriptor information with low association degree with the question information can be found. Based on the determined association degree, at least one piece of object descriptor information in the description information is weakened, so that the relation descriptor information with low association degree with the question information can be weakened, the effect of weakening the noise information in the target image is achieved, the interference of the noise information is reduced, the answer information of the question information can be determined more accurately, and the accuracy of the visual question answering is improved.

It should be noted that, in one possible implementation, the relationship descriptor information refers to a relationship graph of the target image, one node in the relationship graph refers to one object in the target image, and a node feature of the node refers to an object feature of the object. The object descriptor information may refer to an object feature. Thus, the description information includes relationship description information including object description information. In the embodiment of the present application, the embodiment shown in fig. 3 may be used to weaken at least one relationship diagram of multiple relationship diagrams of the target image, then the embodiment shown in fig. 5 is used to weaken the object feature in the relationship diagram, and then answer information is obtained based on the weakened result.

For example, as shown in fig. 7, 3 kinds of relational graphs are weighted and fused into a first target relational graph according to the degree of association between each kind of relational graph and question information, an object feature with a low degree of association with the question information is determined based on the degree of association between the question information and the object feature in the first target relational graph, a node and an edge corresponding to the object feature are deleted from the first target relational graph to obtain a second target relational graph, and answer information is obtained based on the second target relational graph and the question information.

Another point to be noted is that, in the embodiment of the present application, a method of "obtaining description information of a target image based on the target image" is also provided.

In one possible implementation, the computer device obtains description information of a target image based on the target image, and the description information includes at least one of the following: (1) the description information includes relationship description information including a plurality of pieces of relationship descriptor information; and acquiring a plurality of pieces of relation descriptor information of the target image based on the target image. (2) The description information includes object description information including a plurality of pieces of object descriptor information, different object descriptor information being used to describe different objects in the target image; based on the target image, a plurality of pieces of object descriptor information of the target image are acquired.

Optionally, different relationship descriptor information is used to describe different relationships between objects in the target image; the obtaining of the plurality of pieces of relationship descriptor information of the target image based on the target image includes at least one of: (1) the relationship description information includes semantic relationship description sub-information that represents a semantic relationship between objects in the target image; acquiring semantic relation descriptor information of the target image based on the target image; (2) the relationship description information comprises spatial relationship descriptor information which represents the spatial relationship between the objects in the target image; acquiring spatial relationship descriptor information of the target image based on the target image; (3) the relationship description information comprises implicit relationship descriptor information, and the implicit relationship descriptor information represents the implicit relationship between the objects in the target image; and acquiring spatial implicit descriptor information of the target image based on the target image.

First, an exemplary description will be given by taking an acquisition method of semantic relationship descriptor information as an example. In one possible implementation, the semantic relationship descriptor information includes semantic relationship tags; the computer equipment acquires semantic relation descriptor information of the target image based on the target image, and the semantic relation descriptor information comprises the following steps: processing the target image through a first relation determination model to obtain at least one semantic relation label, wherein the semantic relation label represents a semantic relation between two objects in the target image, and the first relation determination model is used for determining the semantic relation label for the two objects from a plurality of semantic relation labels.

The first relation determination model processes the target image, the probability of each semantic relation label in a plurality of semantic relation labels can be obtained for any two objects in the target image, and the semantic relation label corresponding to the highest probability is determined as the semantic relation label of the two objects.

In another possible implementation manner, the semantic relationship descriptor information refers to a first semantic relationship diagram of the target image, a node in the first semantic relationship diagram refers to an object in the target image, a node feature of the node refers to an object feature of the object, an edge between two nodes in the first semantic relationship diagram refers to a semantic relationship between two corresponding objects, an edge feature of the edge refers to a relationship feature of the semantic relationship between two corresponding objects, and a direction of the edge is a direction indicated by the semantic relationship between two corresponding objects. The computer equipment acquires semantic relation descriptor information of the target image based on the target image, and the semantic relation descriptor information comprises the following steps: processing the target image through the first relation determination model to obtain at least one semantic relation label, wherein the semantic relation label represents the relation between two objects in the target image, and the first relation determination model is used for determining the semantic relation label for the two objects from a plurality of semantic relation labels; extracting the characteristics of each semantic relation label and the semantic information of each object to obtain the relation characteristics of each semantic relation label and the object characteristics of each object; and determining a first semantic relation graph of the target image based on the relation feature of each semantic relation label and the object feature of each object.

Where semantic information of each object may be determined by an object recognition model, alternatively, the semantic information of the object may be category information of the object, for example, the semantic information may be "man", "woman", "old person", "child", and the like.

Optionally, the computer device determines a first semantic relationship graph of the target image based on the relationship features of each semantic relationship label and the object features of each object, including: and determining the relation characteristic of each semantic relation label as the edge characteristic of a corresponding edge, determining the object characteristic of each object as the node characteristic of a corresponding node, and obtaining a first semantic relation graph of the target image.

In another possible implementation manner, the semantic relationship descriptor information refers to a first semantic relationship graph of the target image, a node in the first semantic relationship graph refers to an object in the target image, a node feature of the node refers to an object feature of the object, an edge between two nodes in the first semantic relationship graph refers to a semantic relationship between two corresponding objects, an edge feature of the edge refers to a relationship feature of the semantic relationship between two corresponding objects, and a direction of the edge is a direction indicated by the semantic relationship between two corresponding objects; the computer equipment acquires semantic relation descriptor information of the target image based on the target image, and the semantic relation descriptor information comprises the following steps: acquiring at least one semantic relation label of the target image and semantic information of each object in the target image; performing feature extraction on the at least one semantic relation label and semantic information of each object through a first relation determination model to obtain a relation feature of each semantic relation label and an object feature of each object; determining a model through the first relation, and fusing other characteristics with each acquired relation characteristic and object characteristic according to the similarity between the characteristic and other characteristics except the characteristic in the acquired characteristics to acquire the updated characteristic; based on each feature after updating, the first semantic relationship graph is determined.

Optionally, the step of, by the computer device, determining a model through the first relationship, for each of the obtained relationship feature and the object feature, according to a similarity between the feature and another feature except the feature in the obtained feature, fusing the another feature and the feature to obtain an updated feature, includes: determining a model through the first relation, and processing each feature in the obtained features to obtain query features, key features and value features of the features; and according to the similarity between the query feature of the feature and the key features of the features except the feature in the obtained features, fusing the value features of the feature with the value features of the features except the feature in the obtained features to obtain the updated feature.

In some embodiments, a multi-head attention mechanism may be employed when updating features. In one possible implementation, the first relationship-determining model includes a feature extraction layer and a multi-head attention layer, the multi-head attention layer including a plurality of attention sub-layers; the determining a model through the first relationship, extracting the at least one semantic relationship label and the semantic information of each object to obtain the relationship characteristic of each semantic relationship label and the object characteristic of each object, including: performing feature extraction on each semantic relation label and semantic information of each object through the feature extraction layer to obtain a relation feature of each semantic relation label and an object feature of each object, wherein the dimension number of the obtained features is the same as the number of the plurality of attention sublayers; the determining a model through the first relationship, for each of the obtained relationship features and object features, fusing the other features with the feature according to the similarity between the feature and the other features except the feature in the obtained features to obtain the updated feature, including: inputting the sub-features belonging to the same dimension in each obtained feature into the corresponding attention sublayer through the feature extraction layer; and through each attention sub-layer, for each input sub-feature, fusing the other sub-features with the feature according to the similarity between the sub-feature and other sub-features except the sub-feature in the input sub-feature to obtain the updated sub-feature.

It should be noted that, in the embodiments of the present application, the number of dimensions of each of the object feature and the relationship feature is the same as the number of the plurality of attention sublayers, which is taken as an example only. In other embodiments, the number of dimensions of the relationship feature is the same as the number of the plurality of attention sublayers, and the number of dimensions of the object feature is greater than the number of the plurality of attention sublayers. In this case, the sub-feature corresponding to the first dimension in the object feature and the relationship feature is input into the first attention sublayer, the sub-feature corresponding to the second dimension in the object feature and the relationship feature is input into the second attention sublayer, and so on. For the last attention sublayer, inputting the sub-features corresponding to the last dimension in the relation features into the last attention sublayer, and inputting the object feature sub-features which are remained and input in the object features into the last attention sublayer together.

In another possible implementation manner, the semantic relationship descriptor information refers to a second semantic relationship diagram of the target image, a node in the second semantic relationship diagram refers to an object in the target image, a node feature of the node refers to an object feature of the object, an edge between two nodes in the second semantic relationship diagram refers to a degree of association between two corresponding objects, and a direction of the edge is a direction indicated by the degree of association between the two corresponding objects. The computer equipment acquires semantic relation descriptor information of the target image based on the target image, and the semantic relation descriptor information comprises the following steps: acquiring at least one semantic relation label of the target image and semantic information of each object in the target image; performing feature extraction on the at least one semantic relation label and semantic information of each object through a first relation determination model to obtain a relation feature of each semantic relation label and an object feature of each object; updating the object features to obtain updated object features based on the object features, other object features except the object features in the obtained object features and each relation feature through the first relation determination model; determining the relevance degree of the corresponding edge reference according to the first relation determination model and for every two updated object features and based on the two updated object features; and determining the second semantic relation graph based on the updated object characteristics and the relevance.

Optionally, the computer device updates, by using the first relationship determination model, for each obtained object feature, the object feature based on the object feature, other object features except the object feature in the obtained object feature, and each relationship feature, to obtain an updated object feature, including: processing each object feature through the first relation determination model to obtain a query feature, a key feature and a value feature of the object feature; through the first relation determination model, for each object feature, obtaining a product value of the query feature of the object feature and the key feature of each other object feature, obtaining a sum value of each product value and the corresponding relation feature, normalizing each obtained sum value to obtain a weight of each other object feature, fusing each other object feature and the object feature based on the weight of each other object feature to obtain an updated object feature, wherein the relation feature corresponding to the product value is the relation feature corresponding to the two object features used for obtaining the product value.

Optionally, the determining, by the computer device, for each two updated object features through the first relationship determination model, the association degree of the corresponding edge reference based on the two updated object features includes: processing each updated object feature through the first relation determination model to obtain an updated query feature, key feature and value feature of the object feature; and determining the relevance of corresponding edge references based on the similarity of the query feature of one object feature and the key feature of the other object feature for the two updated object features through the first relation determination model, wherein the corresponding edge is an edge of one object pointing to the other object, the one object is the object of which one object feature refers, and the other object is the object of which the other object feature refers.

Optionally, the first relationship-determining model comprises a feature extraction layer and a multi-head attention layer, the multi-head attention layer comprising a plurality of attention sublayers. The computer device performs feature extraction on the at least one semantic relation label and the semantic information of each object through a first relation determination model to obtain a relation feature of each semantic relation label and an object feature of each object, and the method comprises the following steps: and performing feature extraction on each semantic relation label and semantic information of each object through the feature extraction layer to obtain the relation feature of each semantic relation label and the object feature of each object, wherein the dimension number of the obtained features is the same as the number of the plurality of attention sublayers. The computer device updates the object feature for each obtained object feature based on the object feature, other object features except the object feature in the obtained object feature, and each relationship feature through the first relationship determination model, so as to obtain an updated object feature, including: inputting the relation sub-features of different dimensions in each relation feature into the corresponding attention sublayer and inputting the object sub-features belonging to the same dimension in each object feature into the corresponding attention sublayer through the feature extraction layer; through each attention sub-layer, for each input object sub-feature, updating the object sub-feature based on the object sub-feature, other object sub-features except the object sub-feature in the input object sub-feature and the input relation sub-feature to obtain an updated object sub-feature.

Optionally, the determining, by the computer device, for each two updated object features through the first relationship determination model, the association degree of the corresponding edge reference based on the two updated object features includes: determining, by each attention sub-layer, for each two updated object sub-features, a similarity of the two updated object sub-features as a sub-association degree referred by the corresponding edge; and fusing the sub-relevance of each edge determined by each attention sublayer to obtain the relevance of each edge.

Optionally, the computer device determines the second semantic relationship graph based on the updated object feature and the association degree, including: and determining each updated object feature as a node feature of a corresponding node, and determining each association degree as an association degree of a corresponding edge reference to obtain a second semantic relation graph of the target image.

Optionally, the computer device determines the second semantic relationship graph based on the updated object feature and the association degree, including: determining a model through the first relation, and continuously updating the updated object characteristics and the association degree until the updating times reach the target times; and determining the second semantic relation graph based on the object characteristics and the association degree obtained by the last updating.

The target number may be any positive integer greater than 1, and the target number may be an empirical value, a numerical value set by a technician, or a default numerical value of equipment.

It should be noted that the obtaining manner of the spatial relationship description information and the implicit relationship description is the same as the obtaining manner of the semantic relationship description information, and is not described herein again.

For example, in one possible implementation, the spatial descriptor information includes a spatial relationship tag, and the obtaining the spatial relationship descriptor information of the target image based on the target image includes: determining a position condition satisfied by the position information of any two objects in the target object based on the position information of the two objects; and determining the spatial relationship label corresponding to the position condition as the spatial relationship labels of the two objects.

In one possible implementation, the spatial descriptor information refers to a first spatial relationship diagram of the target image, a node in the first spatial relationship diagram refers to an object in the target image, a node feature of the node refers to an object feature of the object, an edge between two nodes in the first spatial relationship diagram refers to a spatial relationship between two corresponding objects, an edge feature of the edge refers to a relationship feature of the spatial relationship between two corresponding objects, and a direction of the edge is a direction indicated by the spatial relationship between two corresponding objects. The obtaining of the spatial relationship descriptor information of the target image based on the target image includes: acquiring at least one spatial relationship label of the target image and position information of each object in the target image; performing feature extraction on the obtained spatial relationship labels and the position information of each object to obtain the relationship features of each spatial relationship label and the object features of each object; and determining a first spatial relationship graph of the target image based on the relationship features of each spatial relationship label and the object features of each object.

In one possible implementation, the determining a first spatial relationship diagram of the target image based on the relationship feature of each spatial relationship label and the object feature of each object includes: and determining the relation characteristic of each spatial relation label as an edge characteristic of a corresponding edge, determining the object characteristic of each object as a node characteristic of a corresponding node, and obtaining a first spatial relation graph of the target image.

In one possible implementation, the spatial descriptor information refers to a first spatial relationship diagram of the target image, a node in the first spatial relationship diagram refers to an object in the target image, a node feature of the node refers to an object feature of the object, an edge between two nodes in the first spatial relationship diagram refers to a spatial relationship between two corresponding objects, an edge feature of the edge refers to a relationship feature of the spatial relationship between two corresponding objects, and a direction of the edge is a direction indicated by the spatial relationship between two corresponding objects. The obtaining of the spatial relationship descriptor information of the target image based on the target image includes: acquiring at least one spatial relationship label of the target image and position information of each object in the target image; performing feature extraction on the at least one spatial relationship label and the position information of each object through a third relationship determination model to obtain a relationship feature of each spatial relationship label and an object feature of each object; determining a model through the third relation, and fusing other characteristics with each characteristic in the obtained relation characteristics and the object characteristics according to the similarity between the characteristic and other characteristics except the characteristic in the obtained characteristics to obtain the updated characteristic; based on each feature after updating, the first spatial relationship graph is determined.

In a possible implementation manner, the determining, by the third relationship determination model, for each of the obtained relationship feature and the object feature, fusing the other feature and the obtained feature according to the similarity between the feature and the other feature except the feature in the obtained feature to obtain the updated feature includes: determining a model through the third relation, and processing each feature in the obtained features to obtain query features, key features and value features of the features; and for each feature in the obtained features, fusing the value feature of the feature and the value feature of the feature except the feature in the obtained features according to the similarity between the query feature of the feature and the key feature of the feature except the feature in the obtained features through the third relation determination model to obtain the updated feature.

In one possible implementation, the third relationship-determining model includes a feature-extraction layer and a multi-head attention layer, the multi-head attention layer including a plurality of attention sub-layers; the determining a model through a third relationship, performing feature extraction on the at least one spatial relationship label and the position information of each object to obtain a relationship feature of each spatial relationship label and an object feature of each object, including: performing feature extraction on each spatial system label and the position information of each object through the feature extraction layer to obtain the relationship features of each spatial relationship label and the object features of each object, wherein the dimension number of the obtained features is the same as the number of the plurality of attention sublayers; the determining, by the third relationship determination model, for each of the obtained relationship features and object features, fusing the other features with the feature according to the similarity between the feature and the other features except the feature in the obtained features to obtain the updated feature, including: inputting the sub-features belonging to the same dimension in each obtained feature into the corresponding attention sublayer through the feature extraction layer; and through each attention sub-layer, for each input sub-feature, fusing the other sub-features and the feature according to the similarity between the sub-feature and other sub-features except the sub-feature in the input sub-feature to obtain the updated sub-feature.

In one possible implementation manner, the spatial relationship descriptor information refers to a second spatial relationship diagram of the target image, a node in the second spatial relationship diagram refers to an object in the target image, a node feature of the node refers to an object feature of the object, an edge between two nodes in the second spatial relationship diagram refers to a degree of association between two corresponding objects, and a direction of the edge is a direction indicated by the degree of association between the two corresponding objects. The obtaining of the spatial relationship descriptor information of the target image based on the target image includes: acquiring at least one spatial relationship label of the target image and position information of each object in the target image; performing feature extraction on the at least one spatial relationship label and the position information of each object through a third relationship determination model to obtain a relationship feature of each spatial relationship label and an object feature of each object; through a third relation determination model, for each obtained object feature, updating the object feature based on the object feature, other object features except the object feature in the obtained object feature and each relation feature to obtain an updated object feature; determining, by the third relation determination model, for each two updated object features, a degree of association of a corresponding edge reference based on the two updated object features; and determining the second spatial relationship graph based on the updated object characteristics and the association degree.

In a possible implementation manner, the determining, by the third relationship, the model, for each obtained object feature, updating the object feature based on the object feature, other object features except the object feature in the obtained object feature, and each relationship feature, and obtaining the updated object feature includes: processing each object feature through the third relation model to obtain query features, key features and value features of the object features; through the third relation model, for each object feature, obtaining a product value of the query feature of the object feature and the key feature of each other object feature, obtaining a sum value of each product value and the corresponding relation feature, performing normalization processing on each obtained sum value to obtain a weight value of each other object feature, fusing each other object feature and the object feature based on the weight value of each other object feature to obtain an updated object feature, wherein the relation feature corresponding to the product value is the relation feature corresponding to the two object features used for obtaining the product value.

In one possible implementation, the determining, by the third relation determination model, for each two updated object features, the relevance of the corresponding edge reference based on the two updated object features includes: processing each updated object feature through the third relation determination model to obtain an updated query feature, key feature and value feature of the object feature; determining, by the third relation determination model, for each two updated object features, based on similarity between a query feature of one of the object features and a key feature of another one of the object features, a relevance of a corresponding edge reference, where the corresponding edge is an edge where one of the objects points to another object, where the one of the objects is an object to which the one of the object features refers, and the another one of the objects is an object to which the another one of the objects refers.

In one possible implementation, the third relationship-determining model includes a feature extraction layer and a multi-head attention layer, the multi-head attention layer including a plurality of attention sub-layers; the determining a model through a third relationship, performing feature extraction on the at least one spatial relationship tag and the position information of each object to obtain a relationship feature of each spatial relationship tag and an object feature of each object, including: performing feature extraction on each spatial relationship label and the spatial information of each object through the feature extraction layer to obtain a relationship feature of each spatial relationship label and an object feature of each object, wherein the obtained dimension number of the feature is the same as the number of the plurality of attention sublayers; the determining, by the third relationship determination model, for each obtained object feature, updating the object feature based on the object feature, the object features other than the object feature in the obtained object feature, and each relationship feature, to obtain an updated object feature, including: inputting the relation sub-features of different dimensions in each relation feature into the corresponding attention sublayer and inputting the object sub-features belonging to the same dimension in each object feature into the corresponding attention sublayer through the feature extraction layer; through each attention sub-layer, for each input object sub-feature, updating the object sub-feature based on the object sub-feature, other object sub-features except the object sub-feature in the input object sub-feature and the input relation sub-feature, so as to obtain the updated object sub-feature.

In one possible implementation, the determining, by the third relation determination model, for each two updated object features, the relevance of the corresponding edge reference based on the two updated object features includes: determining, by each attention sub-layer, for each two updated object sub-features, a sub-association degree to which the corresponding edge refers based on a similarity of the two updated object sub-features; and fusing the sub-relevance of each edge determined by each attention sublayer to obtain the relevance of each edge.

In one possible implementation, the determining the second spatial relationship graph based on the updated object feature and the association degree includes: and determining each updated object feature as a node feature of a corresponding node, and determining each association degree as an association degree of a corresponding edge reference to obtain a second spatial relationship graph of the target image.

In one possible implementation, the determining the second spatial relationship graph based on the updated object feature and the association degree includes: determining a model through the third relation, and continuously updating the updated object characteristics and the association degree until the updating times reach the target times; and determining the second spatial relationship diagram based on the object characteristics and the association degree obtained by the last update.

In a possible implementation manner, the implicit relationship descriptor information refers to a first implicit relationship diagram of the target image, a node in the first implicit relationship diagram refers to an object in the target image, a section feature of the node refers to an object feature of the object, an edge between two nodes in the first implicit relationship diagram refers to an implicit relationship between two corresponding objects, an edge feature of the edge refers to a relationship feature of the implicit relationship between two corresponding objects, and a direction of the edge is a direction indicated by the implicit relationship between two corresponding objects. The obtaining of the spatial implicit descriptor information of the target image based on the target image includes: performing feature extraction on the semantic information and the position information of each object through a second relation determination model to obtain object features of each object; determining a model through the second relation, and fusing other object features and the object features according to the similarity between the object features and other object features except the object features in the obtained plurality of object features to obtain the updated object features for each object feature; determining, by the second relation determination model, based on every two updated object features, edge features corresponding to the two updated object features, where the corresponding edge features are features of an edge between two objects corresponding to the two updated object features; the first implicit relationship graph is determined based on each updated object feature and each edge feature.

In a possible implementation manner, the determining, by the second relationship determination model, for each object feature, fusing the other object features and the object feature according to the similarity between the object feature and the other object features except the object feature in the obtained plurality of object features to obtain the updated object feature includes: processing each object feature through the second relation determination model to obtain a query feature, a key feature and a value feature of each object feature; and for each object feature, fusing the value feature of the object feature and the value feature of the other object feature according to the similarity between the query feature of the object feature and the key feature of the other object feature except the object feature in the obtained plurality of object features through the second relation determination model to obtain the updated object feature.

In a possible implementation manner, the determining, by the second relation determination model, an edge feature corresponding to each of two updated object features based on the two updated object features includes: and determining a model through the second relation, and performing fusion processing on each two updated object features to obtain edge features corresponding to the two updated object features.

In one possible implementation, the second relationship-determining model includes a multi-head attention layer including a plurality of attention sub-layers; the determining a model through the second relationship, for each object feature, fusing the other object features and the object feature according to the similarity between the object feature and the other object features except the object feature in the obtained plurality of object features to obtain the updated object feature, including: inputting the object sub-features belonging to the same dimension in each object feature into the corresponding attention sublayer; and fusing the other object sub-features and the object sub-features according to the similarity between the object sub-features and other object sub-features except the object sub-features in the input object sub-features through each attention sub-layer to obtain the updated object sub-features.

In one possible implementation manner, the implicit relationship descriptor information refers to a second implicit relationship diagram of the target image, a node in the second implicit relationship diagram refers to an object in the target image, a node feature of the node refers to an object feature of the object, an edge between two nodes in the second implicit relationship diagram refers to a degree of association between two corresponding objects, and a direction of the edge is a direction indicated by the degree of association between the two corresponding objects; the obtaining of the spatial implicit descriptor information of the target image based on the target image includes: performing feature extraction on the semantic information and the position information of each object through a second relation determination model to obtain object features of each object; determining a model through the second relation, and updating the object feature based on the object feature and other object features except the object feature in the obtained multiple object features for each object feature to obtain the updated object feature; determining the relevance degree of the corresponding edge reference according to the second relation determination model and for every two updated object features and based on the two updated object features; and determining the second implicit relationship graph based on the updated object characteristics and the association degree.

In a possible implementation manner, the determining, through the second relationship, a model, for each object feature, updating the object feature based on the object feature and other object features except the object feature in the obtained plurality of object features, and obtaining the updated object feature includes: processing each object feature through the second relation determination model to obtain query features, key features and value features of the object features; and determining a model through the second relation, acquiring a product value of the query feature of the object feature and the key features of other object features for each object feature, normalizing each acquired product value to obtain a weight of each other object feature, and fusing each other object feature and the object feature to obtain an updated object feature.

In one possible implementation manner, the determining, by the second relationship determination model, for each two updated object features, the association degree of the corresponding edge reference based on the two updated object features includes: processing each updated object feature through the second relation determination model to obtain an updated query feature, key feature and value feature of the object feature; and determining the relevance of corresponding edge references based on the similarity of the query feature of one object feature and the key feature of another object feature for the two updated object features through the second relation determination model, wherein the corresponding edge is an edge of one object pointing to another object, the one object is the object of which one object feature refers, and the another object is the object of which the other object feature refers.

In one possible implementation, the second relationship-determining model includes a multi-head attention layer including a plurality of attention sub-layers; the determining, by the second relationship determination model, for each object feature, updating the object feature based on the object feature and other object features except the object feature in the obtained plurality of object features, to obtain an updated object feature, including: inputting the object sub-features belonging to the same dimension in each object feature into the corresponding attention sublayer; through each attention sub-layer, for each input object sub-feature, updating the object sub-feature based on the object sub-feature and other object sub-features except the object sub-feature in the input object sub-feature, so as to obtain the updated object sub-feature.

In one possible implementation manner, the determining, by the second relationship determination model, for each two updated object features, the association degree of the corresponding edge reference based on the two updated object features includes: determining, by each attention sub-layer, for each two updated object sub-features, a sub-association degree to which the corresponding edge refers based on a similarity of the two updated object sub-features; and fusing the sub-relevance of each edge determined by each attention sublayer to obtain the relevance of each edge.

In one possible implementation, the determining the second implicit relationship graph based on the updated object feature and the association degree includes: and determining each updated object feature as a node feature of a corresponding node, and determining each association degree as an association degree of a corresponding edge reference to obtain a second implicit relation graph of the target image.

In one possible implementation, the determining the second implicit relationship graph based on the updated object feature and the association degree includes: determining a model through the second relation, and continuously updating the updated object characteristics and the association degree until the updating times reach the target times; and determining the second implicit relation graph based on the object characteristics and the association degree obtained by the last updating.

Another point to be illustrated is that after the semantic relationship diagram, the spatial relationship diagram and the implicit relationship diagram are obtained, the semantic relationship diagram, the spatial relationship diagram and the implicit relationship diagram may be fused to obtain a target relationship diagram, and node features in the target relationship diagram are used as object descriptor information.

Fig. 8 is a schematic structural diagram of an answer information obtaining apparatus according to an embodiment of the present application, and referring to fig. 8, the apparatus includes: an information obtaining module 801, configured to obtain description information of a target image based on the target image, where the description information includes a plurality of pieces of descriptor information; an association degree determining module 802, configured to determine, based on the problem information of the target image, an association degree between each piece of descriptor information in the description information and the problem information; a processing module 803, configured to perform weakening processing on at least one piece of descriptor information in the description information based on the determined association degree, to obtain processed description information; an answer determining module 804, configured to determine answer information of the question information based on the question information and the processed description information.

As shown in fig. 9, in one possible implementation, the description information includes relationship description information, the relationship description information includes a plurality of pieces of relationship descriptor information, and different pieces of relationship descriptor information are used to describe different relationships between objects in the target image; the question information is used for representing a question associated with an object in the target image; the association degree determining module 802 is configured to determine, based on the question information, an association degree between each piece of relationship descriptor information in the relationship description information and the question information, where the association degree indicates an association degree between a relationship indicated by the relationship descriptor information and the question information.

In one possible implementation, the relationship descriptor information refers to one relationship diagram of the target image, and the different relationship descriptor information refers to different relationship diagrams of the target image; one node in the relation graph refers to one object in the target image, and the node characteristic of the node refers to the object characteristic of the object; an edge between two nodes in the relationship graph refers to a relationship between two corresponding objects, an edge feature of the edge refers to a relationship feature corresponding to the relationship between the two corresponding objects, and a direction of the edge is a direction indicated by the relationship between the two objects; or an edge between two nodes in the relationship graph refers to the association degree corresponding to the correspondence between the two objects, and the direction of the edge is the direction indicated by the association degree between the two corresponding objects; the relevancy determining module 802 is configured to determine a first global feature of each relationship graph in the multiple relationship graphs of the target image; determining a second global feature of the issue information; and determining the similarity between the first global feature and the second global feature of each relation graph as the association degree of each relation graph and the question information.

In one possible implementation, the processing module 803 includes: a weight determination unit 8031, configured to determine a weight of each relationship graph based on the determined association degree; a fusion unit 8032, configured to perform weighted fusion on the first global features of the multiple relationship graphs based on the weight of each relationship graph, where the processed description information includes a third global feature of the target image, so as to obtain a third global feature of the target image; or, the processed description information includes a target relationship graph of the target image, and the multiple relationship graphs are weighted and fused based on the weight of each relationship graph to obtain the target relationship graph of the target image.

In one possible implementation, the description information includes object description information, the object description information includes a plurality of pieces of object descriptor information, different object descriptor information is used for describing different objects in the target image, and the question information is used for representing a question associated with an object in the target image; the relevance determining module 802 is configured to determine, based on the question information, a relevance between each piece of object descriptor information in the object descriptor information and the question information, where the relevance is used to indicate a relevance between an object indicated by the object descriptor information and the question information.

In one possible implementation manner, the association degree determining module 802 includes: a fusion unit 8021, configured to fuse each piece of object descriptor information in the description information with the problem information to obtain multiple pieces of fused object descriptor information; a determining unit 8022, configured to determine, based on the plurality of pieces of fused object descriptor information, a degree of association between every two pieces of fused object descriptor information; the determining unit 8022 is further configured to, for each piece of fused object descriptor information, determine a degree of association between the fused object descriptor information and the question information based on a degree of association and a value of association between the fused object descriptor information and other fused object descriptor information in the plurality of pieces of fused object descriptor information except the fused object descriptor information.

In one possible implementation, the object descriptor information refers to an object feature; the fusion unit 8021 is configured to obtain text features of each word in the question information; and for each object feature in the description information, acquiring the similarity between the object feature and each text feature, and fusing each text feature and the object feature based on the acquired similarity to obtain a fused object feature.

In one possible implementation, the object descriptor information refers to an object feature; the determining unit 8022 is configured to obtain, based on each fused object feature, a query feature, a key feature, and a value feature of the fused object feature, obtain a similarity between the query feature of the fused object feature and a key feature of another fused object feature, except the fused object feature, in the plurality of fused object features, and determine, based on the obtained similarity, a degree of association between the fused object feature and the another fused object feature.

In a possible implementation manner, the processing module 803 is configured to weaken at least one piece of fused object descriptor information in the description information based on the determined association degree, so as to obtain the processed description information.

In a possible implementation manner, the processing module 803 is configured to determine a weight of each piece of descriptor information based on the determined association degree, and perform weakening processing on the plurality of pieces of descriptor information based on the weight of each piece of descriptor information to obtain the processed descriptor information; or, the processing module 803 is configured to determine at least one piece of descriptor information from the description information based on the determined association degree, and perform weakening processing on the at least one piece of descriptor information to obtain the processed description information.

In a possible implementation manner, the processing module 803 is configured to determine a weight of each piece of descriptor information in the at least one piece of descriptor information based on the determined association degree, and perform weakening processing on the at least one piece of descriptor information in the piece of descriptor information based on the weight of each piece of descriptor information to obtain the processed descriptor information; or, the processing module 803 is configured to delete the at least one piece of descriptor information from the description information, so as to obtain the processed description information.

It should be noted that: in the answer information obtaining apparatus provided in the above embodiment, when recommending information, only the division of the above function modules is exemplified, and in practical applications, the function distribution may be completed by different function modules according to needs, that is, the internal structure of the computer device is divided into different function modules to complete all or part of the above described functions. In addition, the answer information obtaining device and the answer information obtaining method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

In an exemplary embodiment, a computer device is provided, which includes one or more processors and one or more memories, in which at least one program code is stored, and the at least one program code is loaded and executed by the one or more processors to implement the answer information acquisition method as in the above embodiments.

Optionally, the computer device is provided as a terminal. Fig. 10 shows a block diagram of a terminal 1000 according to an exemplary embodiment of the present application. The terminal 1000 can be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 1000 can also be referred to as user equipment, portable terminal, laptop terminal, desktop terminal, or the like by other names.

Terminal 1000 can include: a processor 1001 and a memory 1002.

Processor 1001 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 1001 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1001 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also referred to as a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1001 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 1001 may further include an AI (Artificial Intelligence) processor for processing a computing operation related to machine learning.

Memory 1002 may include one or more computer-readable storage media, which may be non-transitory. The memory 1002 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1002 is used to store at least one program code for execution by the processor 1001 to implement the answer information acquisition method provided by the method embodiments in the present application.

In some embodiments, terminal 1000 can also optionally include: a peripheral interface 1003 and at least one peripheral. The processor 1001, memory 1002 and peripheral interface 1003 may be connected by a bus or signal line. Various peripheral devices may be connected to peripheral interface 1003 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1004, display screen 1005, camera 1006, audio circuitry 1007, positioning components 1008, and power supply 1009.

The peripheral interface 1003 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 1001 and the memory 1002. In some embodiments, processor 1001, memory 1002, and peripheral interface 1003 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1001, the memory 1002, and the peripheral interface 1003 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.

The Radio Frequency circuit 1004 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 1004 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1004 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1004 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1004 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 1004 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 1005 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1005 is a touch display screen, the display screen 1005 also has the ability to capture touch signals on or over the surface of the display screen 1005. The touch signal may be input to the processor 1001 as a control signal for processing. At this point, the display screen 1005 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display screen 1005 can be one, providing a front panel of terminal 1000; in other embodiments, display 1005 can be at least two, respectively disposed on different surfaces of terminal 1000 or in a folded design; in still other embodiments, display 1005 can be a flexible display disposed on a curved surface or on a folded surface of terminal 1000. Even more, the display screen 1005 may be arranged in a non-rectangular irregular figure, i.e., a shaped screen. The Display screen 1005 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.

The camera assembly 1006 is used to capture images or video. Optionally, the camera assembly 1006 includes a front camera and a rear camera. The front camera is arranged on the front panel of the terminal, and the rear camera is arranged on the back of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1006 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 1007 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1001 for processing or inputting the electric signals to the radio frequency circuit 1004 for realizing voice communication. For stereo sound collection or noise reduction purposes, multiple microphones can be provided, each at a different location of terminal 1000. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1001 or the radio frequency circuit 1004 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuit 1007 may also include a headphone jack.

A Location component 1008 is employed to locate a current geographic Location of terminal 1000 for purposes of navigation or LBS (Location Based Service). The Positioning component 1008 may be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.

Power supply 1009 is used to supply power to various components in terminal 1000. The power source 1009 may be alternating current, direct current, disposable batteries, or rechargeable batteries. When the power source 1009 includes a rechargeable battery, the rechargeable battery may support wired charging or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1000 can also include one or more sensors 1010. The one or more sensors 1100 include, but are not limited to: acceleration sensor 1011, gyro sensor 1012, pressure sensor 1013, fingerprint sensor 1014, optical sensor 1015, and proximity sensor 1016.

Those skilled in the art will appreciate that the configuration shown in FIG. 10 is not intended to be limiting and that terminal 1000 can include more or fewer components than shown, or some components can be combined, or a different arrangement of components can be employed.

Optionally, the computer device is provided as a server. Fig. 11 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1100 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1101 and one or more memories 1102, where the memory 1102 stores at least one program code, and the at least one program code is loaded and executed by the processors 1101 to implement the methods provided by the above method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

The server 1100 is configured to perform the steps performed by the server in the above-described method embodiments.

In an exemplary embodiment, there is also provided a computer-readable storage medium, such as a memory, including program code, which is executable by a processor in a computer device to perform the answer information acquisition method in the above-described embodiments. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program or a computer program product including computer program code, which, when executed by a computer, causes the computer to implement the answer information acquisition method in the above-described embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by hardware related to instructions of a program, and the program may be stored in a computer readable storage medium, where the above mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk.

The above description is intended only to illustrate the alternative embodiments of the present application, and should not be construed as limiting the present application, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. An answer information acquisition method, characterized by comprising:

acquiring description information of a target image based on the target image, wherein the description information comprises a plurality of pieces of descriptor information;

determining the association degree of each piece of descriptor information in the description information and the question information based on the question information of the target image;

weakening at least one piece of descriptor information in the description information based on the determined association degree to obtain processed description information;

and determining answer information of the question information based on the question information and the processed description information.

2. The method according to claim 1, wherein the description information includes relationship description information, the relationship description information includes a plurality of pieces of relationship descriptor information, different relationship descriptor information is used for describing different relationships between objects in the target image; the question information is used for representing a question associated with an object in the target image;

the determining the association degree of each piece of descriptor information in the description information and the question information based on the question information of the target image includes:

and determining the association degree of each piece of relationship descriptor information in the relationship description information and the problem information based on the problem information, wherein the association degree represents the association degree of the relationship represented by the relationship descriptor information and the problem information.

3. The method according to claim 2, wherein the relationship descriptor information refers to one relationship diagram of the target image, and different relationship descriptor information refers to different relationship diagrams of the target image; one node in the relation graph refers to one object in the target image, and the node characteristic of the node refers to the object characteristic of the object;

an edge between two nodes in the relationship graph refers to a relationship between two corresponding objects, an edge feature of the edge refers to a relationship feature corresponding to the relationship between the two corresponding objects, and a direction of the edge is a direction indicated by the relationship between the two objects; or an edge between two nodes in the relationship graph refers to the association degree corresponding to the correspondence between the two objects, and the direction of the edge is the direction indicated by the association degree between the two corresponding objects;

the determining, based on the question information, the association degree between each piece of relationship descriptor information in the relationship description information and the question information includes:

determining a first global feature of each relationship graph in a plurality of relationship graphs of the target image;

determining a second global feature of the issue information;

and determining the similarity between the first global feature and the second global feature of each relation graph as the association degree of each relation graph and the question information.

4. The method according to claim 3, wherein the weakening at least one piece of descriptor information in the descriptor information based on the determined association degree to obtain the processed descriptor information comprises:

determining a weight value of each relation graph based on the determined association degree;

the processed description information comprises a third global feature of the target image, and the first global features of the multiple relation graphs are subjected to weighted fusion based on the weight of each relation graph to obtain the third global feature of the target image; or, the processed description information includes a target relationship graph of the target image, and the multiple relationship graphs are weighted and fused based on the weight of each relationship graph to obtain the target relationship graph of the target image.

5. The method according to claim 1, wherein the description information includes object description information, the object description information includes a plurality of pieces of object descriptor information, different object descriptor information is used for describing different objects in the target image, and the question information is used for representing a question associated with an object in the target image;

and determining the association degree of each piece of object descriptor information in the object description information and the question information based on the question information, wherein the association degree is used for expressing the association degree of the object represented by the object descriptor information and the question information.

6. The method according to claim 5, wherein the determining the association degree of each piece of object descriptor information in the object description information with the question information based on the question information comprises:

fusing each piece of object descriptor information in the description information with the problem information to obtain a plurality of pieces of fused object descriptor information;

determining the association degree between every two pieces of fused object descriptor information based on the multiple pieces of fused object descriptor information;

for each piece of fused object descriptor information, determining the association degree of the fused object descriptor information and the problem information based on the association degree and the value of the fused object descriptor information and other fused object descriptor information except the fused object descriptor information in the plurality of pieces of fused object descriptor information.

7. The method of claim 6, wherein the object descriptor information refers to object features; the fusing each piece of object descriptor information in the description information with the question information to obtain a plurality of pieces of fused object descriptor information, including:

acquiring text characteristics of each word in the question information;

and for each object feature in the description information, acquiring the similarity between the object feature and each text feature, and fusing each text feature and the object feature based on the acquired similarity to obtain a fused object feature.

8. The method according to claim 7, wherein the determining the association degree between each two pieces of fused object descriptor information based on the plurality of pieces of fused object descriptor information comprises:

acquiring query features, key features and value features of the fused object features based on each fused object feature, acquiring similarity between the query features of the fused object features and the key features of other fused object features except the fused object features in the fused object features, and determining the association degree between the fused object features and the other fused object features based on the acquired similarity.

9. The method according to claim 6, wherein the weakening at least one piece of descriptor information in the descriptor information based on the determined association degree to obtain the processed descriptor information comprises:

and weakening at least one piece of fused object descriptor information in the description information based on the determined association degree to obtain the processed description information.

10. The method according to claim 1, wherein the weakening at least one piece of descriptor information in the descriptor information based on the determined association degree to obtain the processed descriptor information comprises:

determining a weight of each piece of descriptor information based on the determined association degree, and weakening the plurality of pieces of descriptor information based on the weight of each piece of descriptor information to obtain the processed descriptor information; alternatively, the first and second electrodes may be,

and determining at least one piece of descriptor information from the description information based on the determined association degree, and weakening the at least one piece of descriptor information to obtain the processed description information.

11. The method according to claim 10, wherein said weakening the at least one descriptor information to obtain the processed descriptor information comprises:

determining a weight of each piece of descriptor information in the at least one piece of descriptor information based on the determined association degree, and weakening the at least one piece of descriptor information in the piece of descriptor information based on the weight of each piece of descriptor information to obtain the processed descriptor information; alternatively, the first and second electrodes may be,

and deleting the at least one piece of descriptor information from the description information to obtain the processed description information.

12. An answer information acquisition apparatus, characterized in that the apparatus comprises:

the information acquisition module is used for acquiring description information of a target image based on the target image, wherein the description information comprises a plurality of pieces of descriptor information;

the relevancy determining module is used for determining the relevancy of each piece of descriptor information in the description information and the question information based on the question information of the target image;

the processing module is used for weakening at least one piece of descriptor information in the descriptor information based on the determined association degree to obtain processed descriptor information;

and the answer determining module is used for determining the answer information of the question information based on the question information and the processed description information.

13. A computer device comprising one or more processors and one or more memories having stored therein at least one program code, the at least one program code being loaded into and executed by the one or more processors to implement the operations executed by the answer information acquisition method according to any one of claims 1 to 11.

14. A computer-readable storage medium, wherein at least one program code is stored in the storage medium, and the at least one program code is loaded and executed by a processor to implement the operations performed by the answer information acquisition method according to any one of claims 1 to 11.