CN113139121A

CN113139121A - Query method, model training method, device, equipment and storage medium

Info

Publication number: CN113139121A
Application number: CN202010065633.6A
Authority: CN
Inventors: 黄龙涛; 张东杰
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2021-07-20

Abstract

The embodiment of the application provides a query method, a model training method, a device, equipment and a storage medium. In the query method, when a query request is processed, interactive operation is carried out according to the multi-modal description characteristics of an object to be queried, so as to obtain interactive characteristics. And calculating the matching degree of the object to be inquired and the object to be matched based on the interactive features of the object to be inquired and the interactive features of the object to be matched, and determining a target object matched with the object to be inquired from the objects to be matched according to the matching degree. In the embodiment, the description features of the multiple modalities are adopted, and the description features of the multiple modalities are interactively operated, so that the description features of different modalities and the interaction relationship between the description features of different modalities jointly act on the process of object recognition, ambiguity of the features of a single modality is reduced or eliminated, and the accuracy of object recognition is favorably improved effectively.

Description

Query method, model training method, device, equipment and storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a query method, a model training method, an apparatus, a device, and a storage medium.

Background

In the field of object recognition, there is a method of object recognition based on textual query description information of an object. However, in some scenarios, when the text is ambiguous, it may result in an inability to accurately identify the object from the text query description information.

For example, in some scenarios, a user disguises published illicit content with ambiguous text that is not readily identifiable as illicit content. For another example, in other scenarios, when a user searches for an item using ambiguous text, the ambiguous text causes the returned item search results to be inconsistent with the item that the user's target intended to search. Therefore, a new solution is yet to be proposed.

Disclosure of Invention

Aspects of the present application provide a query method, a model training method, an apparatus, a device, and a storage medium, which are beneficial to effectively improving accuracy of object recognition.

The embodiment of the application provides a query method, which comprises the following steps: responding to the query request, and acquiring multi-modal description characteristics of the object to be queried; performing feature interaction operation on the multi-modal description features of the object to be queried to obtain first interaction features; calculating the matching degree of the object to be inquired and the at least one object to be matched according to the first interactive characteristic and the interactive characteristic of each object to be matched; and according to the matching degree, determining a target object matched with the object to be inquired from the at least one object to be matched.

The embodiment of the present application further provides a model training method, including: acquiring multi-modal description characteristics of an object to be queried according to query description information of the object to be queried; respectively carrying out feature interaction operation on the multi-mode description features of the object to be inquired and the multi-mode representation features of the object to be matched to obtain a first interaction feature and a second interaction feature; calculating the matching degree of the first interaction characteristic and the second interaction characteristic according to the model parameters of the query link model; and updating the model parameters according to the matching degree and the true value of the matching degree of the object to be inquired and the object to be matched so as to optimize the inquiry link model.

An embodiment of the present application further provides an inquiry apparatus, including: a first feature acquisition module to: responding to the query request, and acquiring multi-modal description characteristics of the object to be queried; a first interaction module to: performing feature interaction operation on the multi-modal description features of the object to be queried to obtain first interaction features; a matching degree calculation module for: calculating the matching degree of the object to be inquired and the at least one object to be matched according to the first interactive characteristic and the interactive characteristic of each object to be matched; a target object determination module to: and according to the matching degree, determining a target object matched with the object to be inquired from the at least one object to be matched.

The embodiment of the present application further provides a model training device, including: a first feature acquisition module to: acquiring multi-modal description characteristics of an object to be queried according to query description information of the object to be queried; a first interaction module to: respectively carrying out feature interaction operation on the multi-mode description features of the object to be inquired and the multi-mode representation features of the object to be matched to obtain a first interaction feature and a second interaction feature; a matching degree calculation module for: calculating the matching degree of the first interaction characteristic and the second interaction characteristic according to the model parameters of the query link model; a parameter optimization module to: and updating the model parameters according to the matching degree and the true value of the matching degree of the object to be inquired and the object to be matched so as to optimize the inquiry link model.

An embodiment of the present application further provides an electronic device, including: a memory and a processor; the memory is to store one or more computer instructions; the processor is to execute the one or more computer instructions to: and executing the steps in the query method or the model training method provided by the embodiment of the application.

Embodiments of the present application further provide a computer-readable storage medium storing a computer program, where the computer program can implement the steps in the query method or the model training method provided in the embodiments of the present application when executed.

In the query method provided by the embodiment of the application, when the query request is processed, interactive operation is performed according to the multi-modal description characteristics of the object to be queried, so as to obtain interactive characteristics. And calculating the matching degree of the object to be inquired and the object to be matched based on the interactive features of the object to be inquired and the interactive features of the object to be matched, and determining a target object matched with the object to be inquired from the objects to be matched according to the matching degree. In the embodiment, the description features of the multiple modalities are adopted, and the description features of the multiple modalities are interactively operated, so that the description features of different modalities and the interaction relationship between the description features of different modalities jointly act on the process of object recognition, ambiguity of the features of a single modality is reduced or eliminated, and the accuracy of object recognition is favorably improved effectively.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flowchart illustrating a query method according to an exemplary embodiment of the present application;

FIG. 2a is a schematic flow chart diagram illustrating a query method according to another exemplary embodiment of the present application;

FIG. 2b is a schematic diagram of the self-interacting operation provided by an exemplary embodiment of the present application;

FIG. 2c is a schematic diagram of bi-directional interaction provided by an exemplary embodiment of the present application;

FIG. 2d is a schematic diagram of bi-directional interaction provided by another exemplary embodiment of the present application;

FIG. 2e is a schematic diagram of a merchandise query scenario provided by an exemplary embodiment of the present application;

FIG. 3 is a schematic flow chart diagram of a model training method provided in an exemplary embodiment of the present application;

fig. 4 is a schematic structural diagram of a query device according to an exemplary embodiment of the present application;

FIG. 5 is a schematic structural diagram of a model training apparatus according to an exemplary embodiment of the present application;

FIG. 6 is a schematic structural diagram of a query device according to an exemplary embodiment of the present application;

fig. 7 is a schematic structural diagram of a model training apparatus according to an exemplary embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the prior art, when a user initiates an object query, if the provided query description information is ambiguous, an object to be queried cannot be accurately identified according to the query description information. In view of this technical problem, in some embodiments of the present application, a solution is provided. Technical solutions provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a query method according to an exemplary embodiment of the present application, and as shown in fig. 1, the method includes:

step 101, responding to a query request, and acquiring a multi-modal description feature of an object to be queried.

And 102, performing characteristic interaction operation on the multi-mode description characteristics to obtain first interaction characteristics.

Step 103, calculating the matching degree of the object to be queried and the at least one object to be matched according to the first interactive feature and the interactive feature of the at least one object to be matched.

And 104, determining a target object matched with the object to be inquired from the at least one object to be matched according to the matching degree.

Wherein the object may be implemented as an objectively present entity, such as a commodity, an animal, a plant, a building, etc.; the object may also be implemented as abstract information, such as a mathematical formula, a concept definition, a written work, and the like, which is not limited in this embodiment.

Based on different implementation forms of the object, the embodiment can be applied to a plurality of different object query scenarios: for example, entity query scenarios such as a commodity query scenario, an animal query scenario, and a plant query scenario, or information query scenarios such as a mathematical formula query scenario, a concept query scenario, and an illegal information query scenario.

When the user has the object query requirement, the query description information can be input. From the query description information, an object matching the query description information can be selected from known objects. For convenience of description and distinction, an object described by the query description information input by the user is called an object to be queried, and an object known in a pre-established query knowledge base is called an object to be matched.

Wherein, the Modality (english) can represent different information sources or different information representation forms. For example, when information is collected by different sensors such as a radar, an infrared sensor, and a camera, the information collected by the radar, the infrared sensor, and the camera may be referred to as information of different modalities. For example, when information is expressed in different expression forms such as image, video, audio, and semantic, information expressed in the form of image, video, audio, and semantic may be referred to as information of a plurality of different modalities.

Obtaining multi-modal description features of an object to be queried, comprising: acquiring features which come from different information sources and are used for describing an object to be queried; or acquiring characteristics of different expressions for describing the object to be queried.

And the first interactive feature is obtained according to feature interactive operation among the multi-modal description features of the object to be queried. And the characteristic interaction operation can comprise cross calculation and combination calculation of the characteristics of a plurality of different modalities. Based on the interactive operation among the features of multiple modalities, the mutual influence, mutual association and interaction among the description features of different modalities can be effectively realized, and further, the ambiguity of the features of a single modality can be reduced or eliminated.

The at least one object to be matched may include one object to be matched or a plurality of objects to be matched, which is not limited in this embodiment. And each object to be matched is respectively corresponding to an interactive characteristic. The calculation method of the interactive feature of the object to be matched is the same as or similar to the calculation method of the first interactive feature, and details are not repeated here. It should be understood that, in this embodiment, the interactive features are described by using "the first" only for convenience of distinction, and no limitation is made to other information such as the order, number, or level of the interactive features.

Based on the first interactive feature of the object to be queried and the respective interactive feature of the at least one object to be matched, the matching degree of the object to be queried and the at least one object to be matched can be calculated. For example, for any object to be matched, the matching degree between the object to be matched and the object to be queried may be obtained according to the matching degree between the interactive feature of the object to be matched and the first interactive feature of the object to be queried. Based on the at least one matching degree obtained by calculation, a target object matched with the object to be queried can be determined from the at least one object to be matched. For example, the object to be matched with the highest matching degree may be selected as the target object, or the object to be matched with the matching degree greater than the set threshold may be selected as the target object, which is not limited in this embodiment.

In this embodiment, when the query request is processed, an interactive operation is performed according to the multi-modal description features of the object to be queried, so as to obtain interactive features. And calculating the matching degree of the object to be inquired and the object to be matched based on the interactive features of the object to be inquired and the interactive features of the object to be matched, and determining a target object matched with the object to be inquired from the objects to be matched according to the matching degree. In the embodiment, the description features of the multiple modalities are adopted, and the description features of the multiple modalities are interactively operated, so that the description features of different modalities and the interaction relationship between the description features of different modalities jointly act on the process of object recognition, ambiguity of the features of a single modality is reduced or eliminated, and the accuracy of object recognition is favorably improved effectively.

Fig. 2a is a schematic flowchart of a query method according to another exemplary embodiment of the present application, and as shown in fig. 2, the method includes:

step 201, responding to the query request, and acquiring first query description information provided by a user.

Step 202, providing at least one other description information which has an interaction relation with the first query description information to the user.

Step 203, responding to the selection operation of the user for the at least one other description information, and acquiring the selected description information as second query description information.

Step 204, inputting the first query description information and the second query description information into a query link model.

Step 205, in the query link model, according to the first query description information and the second query description information, respectively extracting the description feature of the first modality and the description feature of the second modality as the multi-modality description feature of the object to be queried.

And step 206, in the first interaction layer in the query link model, performing feature interaction operation on the multi-modal description feature to obtain a first interaction feature.

Step 207, in the second interaction layer of the query link model, performing feature interaction operation on the first interaction feature and the respective interaction feature of at least one object to be matched to obtain at least one third interaction feature; the interactive features of the object to be matched are obtained by performing feature interactive operation on the multi-modal representation features of the object to be matched.

And step 208, in the layering of the query link model, scoring the at least one third interactive feature according to the parameters of the layering to obtain the matching degree of the first interactive feature and the interactive feature of the at least one object to be matched.

Step 209, in the output layer of the query link model, according to the at least one matching degree, determining a target object adapted to the object to be queried from the at least one object to be matched, and outputting the target object.

In step 201, optionally, the query request may be sent by the user through the terminal device. For example, the user may input the query request on a query page provided by the terminal device, or input the voice query instruction through a voice input device provided by the terminal device, which is not limited in this embodiment.

Wherein the first query description information is query description information provided by a user. In some cases, the first query description information includes a single feature, and if the single feature is ambiguous, the result of the response to the query request is inaccurate. To address this issue, the next step 202 may be executed to provide the user with other descriptive information for selection by the user.

In step 202, optionally, the manner of providing the at least one other description information to the user may include: and displaying the at least one other description information through a display screen provided by the terminal, or playing the at least one other description information in a voice broadcast mode. The present embodiment is not limited.

Wherein the first query description information and the at least one other description information have an interaction relationship, and the interaction relationship refers to: the at least one other description information and the first query description information may describe the same object, and the at least one other description information may assist in reducing or disambiguating the first query description information; accordingly, the first query description may also assist in reducing or disambiguating the at least one other description.

For example, when a user initiates a search query operation, the input first query description information is: the text description information of "an animal that is eaten by picking leaves from a tree" results in that a search result cannot be accurately returned to a user because there are many animals that can pick leaves and eat directly. Based on this step, other query description information may be provided to the user, such as: pictures of koala, pictures of monkeys, or pictures of giraffes. The picture can effectively assist in reducing or eliminating ambiguity of the first query description information.

Wherein the at least one other description information may be obtained by the following implementation:

optionally, a plurality of objects to be matched to which the first query description information is adapted may be obtained from a multimodal knowledge graph.

The multi-mode knowledge graph is pre-established, and the full amount of objects to be matched, the multi-mode representation information of the objects to be matched and the corresponding relation of the objects to be matched and the multi-mode representation information of the objects to be matched are stored in the multi-mode knowledge graph, so that the objects to be matched are uniformly represented in a knowledge level, and a foundation for reducing or eliminating ambiguity in the object identification process is laid.

Based on this, the first query description information and the representation information in the multimodal knowledge graph may be matched. For example, if the first query description information is in a text format, the first query description information may be text-matched with the representation information in the multimodal knowledge-graph. If the representation information matched with the first query description information exists, the object to be matched corresponding to the representation information can be used as the object to be matched adaptive to the first query description information. Generally, when the first query description information is ambiguous, the number of objects to be matched with the first query description information is multiple.

Then, from the multi-modal representation information corresponding to each of the plurality of objects to be matched, other representation information belonging to a different modality from the first query description information is selected as the at least one other description information. After the user acquires the at least one type of other description information, the description information matched with the query requirement can be selected from the at least one type of other description information according to the query requirement.

Taking over the above example, the first query description information is: an animal is eaten by picking leaves from a tree. If the multi-modal knowledge map has the expression information matched with the 'picking leaves from the tree and eating' in the expression information of a plurality of animals such as the koala, the monkey and the giraffe, the expression information of other modes of the koala, the monkey and the giraffe can be respectively obtained. For example, pictures of koala, monkey, and giraffes may be obtained and shown to the user, from which pictures of koala may be selected if the user actually wants to query koala.

In step 203, the other description information selected by the user may be used as the second query description information based on the selection operation of the user.

In some preferred embodiments, the first query description information includes: text description information; the second query description information includes: image description information. Based on the above, the text description information can be supplemented by the image description information to reduce or eliminate ambiguity of the text description information. For example, taking the above example, if the user selects the picture of the koala, the picture of the koala may be used as the second query description information.

Next, in step 204, the first query description information and the second query description information may be input to a query linking model. The query link model is obtained by pre-training and is used for identifying the multi-modal characteristics contained in the query description information and pointing the query result chain to one or more objects. The internal workings of the query linking model will be exemplarily described below in connection with the subsequent steps.

In step 205, after the query link model obtains the first query description information and the second query description information, feature extraction may be performed on the two query description information to obtain a description feature of the first modality and a description feature of the second modality.

The description features of the first modality and the description features of the second modality may respectively include one or more different description features, and the embodiment is not limited.

Optionally, if the query description information is information in a text format, semantic feature extraction may be performed on the information in the text format. The semantic feature extraction operation may be implemented based on a word2vec (word-to-vector) model, or may be implemented based on an ELMo model (a deep contextualized word representation model), which is not limited in this embodiment. For example, semantic feature extraction may be performed on the name of the object to be queried included in the query description information based on a word2vec (word-to-vector) model, so as to obtain a semantic feature of the name of the object to be queried. Or, semantic feature extraction can be performed on the context information of the name of the object to be queried based on the ELMo model, so as to obtain the context semantic feature of the name of the object to be queried.

Optionally, if the query description information is information in a picture format, image feature extraction may be performed on the information in the picture format. The manner of extracting the image features may be implemented based on at least one of a neural Network model (vgnet), a neural Network model (AlexNet), and an inclusion Network model (neural Network model), which is not limited in this embodiment.

Optionally, if the query description information is information in an audio format, speech recognition may be performed on the information in the audio format, and then semantic feature extraction may be performed based on a speech recognition result. Or, the information audio features of the audio format may be directly extracted, which is not limited in this embodiment.

When the query description information is in other formats, feature extraction can be performed by using feature extraction methods corresponding to the information in other formats, which is not described again.

After the multi-modal descriptive characteristics of the object to be queried are obtained, step 206 may be executed next, and a characteristic interaction operation is executed on the multi-modal descriptive characteristics of the object to be queried, so as to obtain a first interaction characteristic.

Optionally, the multi-modal description feature of the object to be queried may include: at least one of semantic features of the name of the object to be queried, contextual semantic features of the name of the object to be queried, and image features of the object to be queried. In step 206, optionally, for convenience of description and distinction, the computation layer for performing feature interaction operations on the multi-modal description features in the query link model can be referred to as a first interaction layer.

Optionally, in the first interaction layer of the query link model, self-interaction processing may be performed on the multi-modal description features of the object to be queried based on an Attention Mechanism (Attention Mechanism), so as to obtain self-interaction feature vectors of the multi-modal description features of the object to be queried.

Alternatively, the self-interaction based on the attention mechanism will be exemplified below by taking the description feature of any one modality in the multi-modal description feature of the object to be queried as an example. For convenience of description, the description feature of any modality is referred to as a first description feature.

Optionally, for the first description feature, the similarity between the first description feature and each feature in the multi-modal description feature of the object to be queried may be calculated to obtain a plurality of self-interaction weights corresponding to the first description feature; then, according to the self-interaction weights, the multi-modal description features of the object to be queried are weighted and calculated to obtain a self-interaction vector of the first description feature.

A specific example will be further described below in conjunction with fig. 2 b.

Suppose that: the description characteristics of the object to be queried comprise: features X1, X2, X3 are described. As shown in fig. 2b, when the self-interaction calculation is performed on the descriptive feature X1, the similarity of the descriptive feature X1 and the descriptive feature X1 can be calculated respectively: s11 ═ S (X1, X2), describes the similarity of feature X1 and feature X2: s12 ═ S (X1, X2) and the similarity of descriptive feature X1 and descriptive feature X3: s13 ═ S (X1, X3).

After obtaining S11, S12, and S13, normalization calculation may be performed thereon to obtain three corresponding self-attention weights a11, a12, and a 13. Then, weight calculation is performed based on X1, X2, X3, a11, a12, and a13, and a self-cross vector X1 ═ X1 ═ a1+ X2 × a2+ X3 × A3 corresponding to X1 is obtained.

Correspondingly, by adopting the method, self-interactive vectors of the description features of other modes in the multi-mode description features of the object to be queried can be calculated. For example, according to the calculation process illustrated in fig. 2b, a self-interaction vector X2 'corresponding to the description feature X2 and a self-interaction vector X3' corresponding to the description feature X3 can be calculated.

Then, self-interaction feature vectors of the multi-modal description features of the object to be queried can be fused to obtain a first interaction feature. Alternatively, the fusion operation may be implemented as a concatenation operation of vectors. For example, when the respective self-interaction vectors of the multi-modal description features are X1 ', X2 ', and X3 ', respectively, the first interaction feature can be implemented as: (X1 ', X2 ', X3 ').

In the next steps and 207, a feature interaction operation may be performed on the first interaction feature and the respective interaction feature of the at least one object to be matched. And the interactive feature of each object to be matched is obtained by performing feature interactive operation on the multi-modal representation feature of the object to be matched. For an alternative embodiment of performing feature interaction on the multi-modal representation feature of the object to be matched, reference may be made to the aforementioned alternative embodiment of performing feature interaction on the multi-modal representation feature of the object to be queried, which is not described herein again.

It should be noted that the multi-modal knowledge graph constructed in advance in the embodiments of the present application includes a full amount of multi-modal representation features of the object to be matched. In some optional embodiments, the interactive features of each object to be matched can be calculated in advance according to the multi-modal representation features of each object to be matched, and the interactive features of each object to be matched are stored in the multi-modal knowledge map for later use. In other alternative embodiments, when the first interaction feature is calculated, the multi-modal representation feature of each object to be matched may be obtained from the multi-modal knowledge graph, and the interaction feature of each object to be matched may be calculated in real time at the first interaction layer of the query link model, which is not limited in this embodiment.

Optionally, for any object to be matched, the multi-modal representation feature of the object to be matched may include: at least one of the structural feature of the object to be matched, the image feature of the object to be matched, the semantic feature of the name of the object to be matched and the abstract feature of the object to be matched.

An example of any one of the at least one object to be matched will be described below. For convenience of description, the interactive object of the object to be matched is marked as the second interactive feature.

Optionally, the feature interaction operation is performed on the first interaction feature and the second interaction feature to obtain an operation of a third interaction feature, which may be implemented in a second interaction layer in the query link model.

Optionally, at the second interaction layer, bidirectional interaction processing may be performed on the first interaction feature and the second interaction feature based on an attention mechanism, so as to obtain a bidirectional interaction feature vector of the object to be queried and a bidirectional interaction feature vector of the object to be matched.

Optionally, as can be seen from the foregoing embodiment, the first interactive feature includes a plurality of feature vectors corresponding to multi-modal description features of the object to be queried, and the second interactive feature includes a plurality of feature vectors corresponding to multi-modal representation features of the object to be matched.

The following will exemplarily describe the bidirectional interaction operation based on the attention mechanism by taking any feature vector included in the first interaction feature as an example. For convenience of description, any one of the feature vectors is referred to as a first feature vector.

Optionally, for the first feature vector, the similarity of a plurality of feature vectors in the first feature vector and the second interactive feature may be calculated to obtain a plurality of bidirectional interaction weights; then, according to the plurality of bidirectional interaction weights, a plurality of features in the second interaction features are weighted and calculated to obtain bidirectional interaction vectors corresponding to the first feature vectors. A specific example will be further described below in conjunction with fig. 2c and 2 d.

Suppose that: the first interactive feature includes: the feature vectors X1 ', X2 ', and X3 ', and the second interaction features include: feature vectors Y1 ', Y2', Y3 'and Y4'. As shown in fig. 2c, when the feature vector X1 'is interactively calculated according to the feature vectors included in the second interactive features, the similarity between the feature vector X1' and the feature vectors Y1 ', Y2', Y3 ', and Y4' can be respectively calculated: s11 ═ S (X1 ', Y1'), similarity of eigenvector X1 'and eigenvector Y2': s12 ═ S (X1 ', Y2'), similarity of eigenvector X1 'and eigenvector Y3': s13 ═ S (X1 ', Y3'), and the similarity between feature vector X1 'and feature vector Y4': s14 ═ S (X1 ', Y4').

After obtaining S11 ', S12', S13 'and S14', normalization calculation can be performed to obtain three corresponding self-attention weights A11 ', A12', A13 'and A14'. Then, weight calculation is performed according to Y1 ', Y2', Y3 ', Y4', a11 ', a 12', a13 'and a 14', so as to obtain a bidirectional interaction vector X1 ″ ═ Y1 'a 1' + Y2 'a 2 ″ + Y3' + A3 '+ Y4' a4 'corresponding to X1'.

Correspondingly, by adopting the method described above, in conjunction with fig. 2c, bidirectional interaction vectors corresponding to other feature vectors in the first interaction feature are calculated.

Then, the bidirectional interaction vectors corresponding to the feature vectors in the first interaction features can be fused to obtain the bidirectional interaction feature vectors of the object to be queried. Alternatively, the fusion operation may be implemented as a concatenation operation of vectors. For example, when the bidirectional interaction vectors corresponding to the feature vectors in the first interaction feature are X1 ", X2", and X3 ", respectively, the bidirectional interaction feature vector of the object to be queried can be implemented as: (X1 ', X2 ', X3 ').

In step 207, the bidirectional interaction feature vector of the object to be matched may be obtained by the method described above, which is not described herein again. For example, taking the foregoing example into consideration, the calculated bidirectional interaction feature vector of the object to be matched can be implemented as: (Y1 ', Y2', Y3 ', Y4').

Optionally, the method for obtaining the third interactive feature by fusing the bidirectional interactive feature vector of the object to be queried and the bidirectional interactive feature vector of the object to be matched may include: and splicing the bidirectional interactive feature vector of the object to be queried and the bidirectional interactive feature vector of the object to be matched, as shown in fig. 2 d.

For example, in connection with the above example, the third interactive feature may be implemented as (X1 ', X2 ', X3 ', Y1 ', Y2 ', Y3 ', Y4 ').

Based on the steps, the first interactive feature and the interactive feature corresponding to the at least one object to be matched can be subjected to feature interactive operation respectively to obtain at least one third interactive feature.

After the at least one third feature is obtained, step 208 and step 209 may be executed, in which the layering of the link model is queried, and the at least one third interactive feature is scored according to the parameters of the layering, so as to obtain the matching degree of the first interactive feature and the interactive feature of the at least one object to be matched. And determining a target object matched with the object to be inquired from at least one object to be matched according to the at least one matching degree obtained by calculation in an output layer of the inquiry link model, and outputting the target object. The target object may be returned to the user as a query result of the query request.

In this embodiment, description features of multiple modalities of an object to be queried are interactively operated to obtain an interactive feature, and object identification is performed according to the interactive feature, so that the description features of different modalities and an interactive relationship between the description features of different modalities jointly act on an object identification process, thereby reducing or eliminating ambiguity of features of a single modality, and facilitating effectively improving accuracy of object identification. In addition, the interaction characteristics of the object to be inquired and the interaction characteristics of the object to be matched are further interacted, so that the accuracy and the reliability of the object identification result are improved.

The above or below embodiments of the present application are applicable to a variety of different object recognition scenarios. As will be exemplified below.

A typical application scenario is: and (5) inquiring scenes of the commodities. When a user inquires commodities on an e-commerce platform displayed by the terminal, inquiry description information of the commodities can be input. For example, the query description information in text format as shown in fig. 2 e: a warm shoe. The terminal can send the description information to the server, and the server provides the query result. In the actual commodity category, there are various warm-keeping shoes, such as cotton slippers, snow boots, padded boots, electric heating warm-keeping shoes, and the like. When the query description information in the text format is ambiguous, the server cannot provide a more accurate search result for the user. Next, the server may obtain the description information of other modalities of the above-mentioned multiple shoe commodities, such as image description information or commodity detail description information, from the pre-established commodity multi-modal knowledge map. In fig. 2e, the image description information of the shoe article is illustrated as an example. The server can send the images of the various shoe commodities to the terminal for displaying.

After the user views the images of the shoe commodities through the terminal, one or more images can be selected from the images of the shoe commodities according to actual requirements. Assume that the user has selected an image of a cotton slipper from images of a variety of footwear items, as shown in fig. 2 e. The terminal can send the images of the cotton slippers to the server after the user selects the images. Next, the server may use the text format query description information and the images of the cotton slippers as multi-modal description information of the to-be-queried goods, and calculate the to-be-matched goods adapted to the to-be-queried goods based on the query methods provided in the foregoing embodiments. Fig. 2e shows the query results of the commodities of various cotton slippers returned by the server. Based on the implementation mode, when the commodity query description information input by the user is ambiguous, the commodity search result can be accurately returned to the user, and the query efficiency and the order conversion rate of the e-commerce platform are improved.

Another typical application scenario is: a contraband content retrieval scenario. The existing forbidden content retrieval method generally relies on analyzing the information in the text format, so that when some information publishers publish the forbidden content, ambiguous content in the text format is published to avoid the illegal risk. Meanwhile, in order to facilitate the audience of the content to understand the forbidden content, the information publisher publishes the pictures at the same time, and the forbidden content is explicitly shown in the pictures. The above information distribution manner results in that the prohibited content cannot be accurately screened out. Based on the query method provided by the embodiment, the forbidden content in the suspicious text format and the corresponding picture can be used as the multi-mode description characteristics of the forbidden content to be retrieved, so that the forbidden content can be accurately identified, and the network environment can be purified.

The query linking model provided by the foregoing embodiment can be trained by using the following alternative embodiments, which will be described below with reference to the drawings.

Fig. 3 is a schematic flowchart of a model training method according to an exemplary embodiment of the present application, and as shown in fig. 3, the method includes:

step 301, obtaining a multi-modal description feature of the object to be queried according to the query description information of the object to be queried.

Step 302, respectively performing feature interaction operation on the multi-modal description features of the object to be queried and the multi-modal representation features of the object to be matched to obtain a first interaction feature and a second interaction feature.

And 303, calculating the matching degree of the first interactive feature and the second interactive feature according to the model parameters of the query link model.

And step 304, updating model parameters of the query link model according to the matching degree and the true value of the matching degree of the object to be queried and the object to be matched so as to optimize the query link model.

The query description information of the object to be queried can be obtained according to the historical query record of the user. For example, query corpora input when a user initiates a history query operation may be acquired as query description information. And the query description information is used as a training sample of the query link model to optimize the model parameters of the query link model.

In some exemplary embodiments, the multi-modal description of the object to be queried may include: at least one of semantic features of the name of the object to be queried, contextual semantic features of the name of the object to be queried, and image features of the object to be queried. For an alternative implementation of obtaining the multi-modal description feature of the object to be queried, reference may be made to the descriptions of the foregoing embodiments, which are not repeated herein.

In some exemplary embodiments, the multi-modal representation of the object to be matched may include: at least one of structural features of the object to be matched, image features of the object to be matched, semantic features of the name of the object to be matched and abstract features of the object to be matched.

And the multi-modal representation characteristics of the object to be matched can be stored in the multi-modal knowledge graph. When the multi-modal knowledge graph is constructed, at least one of the following methods can be executed to obtain the multi-modal representation characteristics of the object to be matched:

the method comprises the following steps: and generating the structural characteristics of the object to be matched according to the object to be matched and other objects with set relations.

Optionally, for each object to be matched in the dataset, triple data may be constructed, the triple data being composed of a head object h, a relation r, and a tail object t, formally represented as: (h, r, t). Then, other objects having a set relationship with the object to be matched may be screened from the data set based on a translation Embedding algorithm (TransE). Namely, other objects h to be matched and the relation r which can meet h + r ≈ t are searched through a TransE algorithm. Then, the vector composed of the object to be matched, the set relationship and other objects can be used as the structure vector of the object to be matched.

When other objects having a set relationship with the object to be matched are calculated based on TransE, the loss function can be set as a hinge loss function. The hinge loss function can be implemented as:

L＝∑∑max(γ+E(h,r,t)-E(h',r,t'),0)

where γ is margin (margin) of the hinge loss function, E (h, r, t) is a positive case, and E (h ', r, t') is a negative case. Wherein negative examples may be generated by randomly replacing head objects and/or tail objects.

For example, when the ternary set of data corresponding to the object to be matched is (h, r, t), and the head object and/or the tail object are randomly replaced, several momentum functions can be generated as follows:

E₁＝||h`+r-t`||

E₂＝||h+r-t`||

E₃＝||h+r-t||

E₄＝||h`+r-t||

E₅＝||(h`+h)+r-(t`+t)||

where h 'represents a randomly replaced head object and t' represents a randomly replaced tail object.

E (h ', r, t') in the hinge loss function can be realized as E₁-E₅Any one or a combination of more of them, and the embodiment is not limited. In some embodiments, E (h ', r, t') may be implemented as E₁-E₅And (c) the sum, i.e.:

the method 2 comprises the following steps: the image characteristics of the object to be matched can be calculated according to the image of the object to be matched. The operation of calculating the image features of the object to be matched can be realized based on at least one model of VGGNet and AlexNet Inception Net, and is not described again.

It should be understood that each object to be matched may correspond to a plurality of images, and based on the plurality of images, a plurality of image feature vectors of the object to be matched may be calculated. Then, the cosine similarity between the two image feature vectors can be calculated, and an image feature similarity network is constructed according to the cosine similarity value obtained through calculation. In the image feature similarity network, one node represents an image feature vector, and an edge in the network represents a cosine similarity value between two nodes connected by the edge.

Then, a PageRank algorithm can be used to calculate the PageRank value of each node in the image feature vector similarity network, and a plurality of nodes with the highest PageRank values are selected as the image feature vectors of the object to be matched. Optionally, in some embodiments, a bitwise averaging or a maximum value may be further performed on the several image feature vectors to obtain an aggregate image feature vector of the object to be matched.

The method 3 comprises the following steps: the word vector can be generated according to the name of the object to be matched, and the semantic features of the name of the object to be matched are obtained.

The method 4 comprises the following steps: the word vector can be generated according to the abstract information of the object to be matched, and the abstract characteristics of the object to be matched are obtained.

In methods 3 and 4, the operation of generating the word vector may be implemented based on the word2vec model or the ELMo model, which is not described herein again.

In some exemplary embodiments, a way of performing feature interaction on the multi-modal description feature of the object to be queried to obtain a first interaction feature includes: on the basis of an attention mechanism, self-interacting processing is carried out on the multi-modal description features of the object to be queried on a first interaction layer of a query link model, and self-interacting feature vectors of the multi-modal description features of the object to be queried are obtained; and fusing self-interaction feature vectors of the multi-modal description features of the object to be queried to obtain the first interaction feature. In some exemplary embodiments, a manner of obtaining self-interaction feature vectors of respective multi-modal descriptive features of the object to be queried by performing self-interaction processing on the multi-modal descriptive features of the object to be queried based on an attention mechanism includes: calculating the similarity of the description characteristics and each characteristic in the multi-modal description characteristics of the object to be queried aiming at the description characteristics of any one modal in the multi-modal description characteristics of the object to be queried to obtain a plurality of self-interaction weights corresponding to the description characteristics; and according to the self-interaction weights, performing weighted calculation on the multi-modal description characteristics of the object to be queried to obtain self-interaction vectors of the description characteristics.

In some exemplary embodiments, a way of calculating a degree of matching of the first interaction feature and the second interaction feature according to model parameters of a query link model includes: in a query link model, performing feature interaction operation on the first interaction feature and the second interaction feature to obtain a third interaction feature; and scoring the third interactive feature according to the parameters of the scoring layer on the scoring layer of the query link model to obtain the matching degree of the first interactive feature and the second interactive feature.

In some exemplary embodiments, one way of performing a feature interaction operation on the first interaction feature and the second interaction feature to obtain a third interaction feature comprises: on the basis of an attention mechanism, performing bidirectional interaction processing on the first interaction feature and the second interaction feature on a second interaction layer in the query link model to obtain a bidirectional interaction feature vector of the object to be queried and a bidirectional interaction feature vector of the object to be matched; and fusing the bidirectional interaction feature vector of the object to be inquired and the bidirectional interaction feature vector of the object to be matched to obtain the third interaction feature.

In some exemplary embodiments, the first interactive feature comprises a plurality of feature vectors corresponding to multi-modal descriptive features of the object to be queried, and the second interactive feature comprises a plurality of feature vectors corresponding to multi-modal representational features of the object to be matched; correspondingly, a way of performing bidirectional interaction processing on the first interaction feature and the second interaction feature based on an attention mechanism to obtain a bidirectional interaction feature vector of the object to be queried includes: calculating the similarity of a plurality of feature vectors in the feature vectors and the second interactive features aiming at any feature vector in the first interactive features to obtain a plurality of bidirectional interactive weights; according to the bidirectional interaction weights, carrying out weighted calculation on a plurality of features in the second interaction features to obtain bidirectional interaction vectors of the features; and splicing the respective bidirectional interaction vectors of the plurality of features contained in the first interaction feature to obtain the bidirectional interaction feature vector of the object to be queried.

In the embodiment, by constructing the multi-modal knowledge graph, the representation modes of the objects to be matched on the knowledge level can be unified, and mutual influence and mutual combination of the representation characteristics of the multiple modes are facilitated. The interaction characteristics are calculated based on the multi-modal characteristics of the object to be inquired and the object to be matched, the inquiry link model is trained based on the interaction characteristics obtained through calculation, so that the inquiry link model can continuously and automatically learn the interaction relationship between the multi-modal characteristics, and automatically learn the relationship between the interaction characteristics of the object to be inquired and the interaction characteristics of the object to be matched, and the object matched with the inquiry request can be accurately identified according to the multi-modal characteristics with the interaction relationship in the actual application of the model.

It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subjects of step 201 to step 204 may be device a; for another example, the execution subject of

steps

201 and 202 may be device a, and the execution subject of step 203 may be device B; and so on.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the sequence numbers of the operations, such as 201, 202, etc., are merely used for distinguishing different operations, and the sequence numbers do not represent any execution order per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

Fig. 4 is a schematic structural diagram of an inquiry apparatus according to an exemplary embodiment of the present application, where as shown in fig. 4, the apparatus includes:

a first feature obtaining module 401 configured to: and responding to the query request, and acquiring the multi-modal description characteristics of the object to be queried.

A first interaction module 402 for: and performing characteristic interactive operation on the multi-modal description characteristics of the object to be queried to obtain a first interactive characteristic.

A matching degree calculating module 403, configured to: and calculating the matching degree of the object to be inquired and the at least one object to be matched according to the first interactive characteristic and the interactive characteristic of the object to be matched.

A target object determination module 404 for: and according to the matching degree, determining a target object matched with the object to be inquired from the at least one object to be matched.

In some exemplary embodiments, the first feature obtaining module 401, when obtaining the multi-modal description feature of the object to be queried in response to the query request, is specifically configured to: responding to the query request, and acquiring first query description information provided by a user; providing the user with at least one other descriptive information having an interactive relationship with the first query descriptive information; responding to the selection operation of the user from at least one other description information, and acquiring the selected description information as second query description information; and respectively acquiring the description features of the first modality and the description features of the second modality from the first query description information and the second query description information as the multi-modal description features of the object to be queried.

In some exemplary embodiments, the first feature obtaining module 401 is further configured to: acquiring a plurality of objects to be matched, which are adapted to the first query description information, from a multi-modal knowledge graph; and selecting other representation information which belongs to different modalities from the first query description information from the multi-modal representation information corresponding to the plurality of objects to be matched as the at least one other description information.

In some exemplary embodiments, the first query description information includes: text description information; the second query description information includes: image description information.

In some exemplary embodiments, when performing the feature interaction operation on the multi-modal description feature of the object to be queried to obtain the first interaction feature, the first interaction module 402 is specifically configured to: on the basis of an attention mechanism, self-interacting processing is carried out on the multi-modal description features of the object to be queried on a first interaction layer of a query link model, and self-interacting feature vectors of the multi-modal description features of the object to be queried are obtained; and fusing self-interaction feature vectors of the multi-modal description features of the object to be queried to obtain the first interaction feature.

In some exemplary embodiments, the first interaction module 402, when performing self-interaction processing on the multi-modal descriptive features of the object to be queried based on an attention mechanism to obtain respective self-interaction feature vectors of the multi-modal descriptive features of the object to be queried, is specifically configured to: calculating the similarity of the description characteristics and each characteristic in the multi-modal description characteristics of the object to be queried aiming at the description characteristics of any one modal in the multi-modal description characteristics of the object to be queried to obtain a plurality of self-interaction weights corresponding to the description characteristics; and according to the self-interaction weights, performing weighted calculation on the multi-modal description characteristics of the object to be queried to obtain self-interaction vectors of the description characteristics.

In some exemplary embodiments, the first interaction module 402 is further configured to: and performing feature interaction operation on the multi-modal representation features of the object to be matched aiming at any object to be matched in the at least one object to be matched to obtain the interaction features of the object to be matched as second interaction features.

In some exemplary embodiments, the matching degree calculating module 403 includes: a second interaction module 4031 and a scoring module 4032; the matching degree calculating module 403 is specifically configured to, when calculating the matching degree between the object to be queried and the at least one object to be matched according to the first interactive feature and the respective interactive feature of the at least one object to be matched: performing feature interaction operation on the first interaction feature and the second interaction feature in a query link model through a second interaction module 4031 to obtain a third interaction feature; and scoring the third interactive characteristics according to parameters of the scoring layer by a scoring module 4032 on the scoring layer of the query link model to obtain the matching degree of the object to be queried and the object to be matched.

In some exemplary embodiments, the second interaction module 4031 performs a feature interaction operation on the first interaction feature and the second interaction feature to obtain a third interaction feature, and is specifically configured to: on the basis of an attention mechanism, performing bidirectional interaction processing on the first interaction feature and the second interaction feature on a second interaction layer in the query link model to obtain a bidirectional interaction feature vector of the object to be queried and a bidirectional interaction feature vector of the object to be matched; and fusing the bidirectional interaction feature vector of the object to be inquired and the bidirectional interaction feature vector of the object to be matched to obtain the third interaction feature.

In some exemplary embodiments, the first interactive feature comprises a plurality of feature vectors corresponding to multi-modal descriptive features of the object to be queried, and the second interactive feature comprises a plurality of feature vectors corresponding to multi-modal representational features of the object to be matched; the second interaction module 4031, when performing bidirectional interaction processing on the first interaction feature and the second interaction feature based on an attention mechanism to obtain a bidirectional interaction feature vector of the object to be queried, is specifically configured to: calculating the similarity of a plurality of feature vectors in the feature vectors and the second interactive features aiming at any feature vector in the first interactive features to obtain a plurality of bidirectional interactive weights; according to the bidirectional interaction weights, carrying out weighted calculation on a plurality of feature vectors in the second interaction features to obtain bidirectional interaction vectors of the features; and splicing the respective bidirectional interaction vectors of the plurality of feature vectors contained in the first interaction feature to obtain the bidirectional interaction feature vector of the object to be queried.

Fig. 5 is a schematic structural diagram of a model training apparatus according to an exemplary embodiment of the present application, and as shown in fig. 5, the apparatus includes:

a first feature obtaining module 501, configured to: and acquiring the multi-modal description characteristics of the object to be queried according to the query description information of the object to be queried.

A first interaction module 502 to: and respectively carrying out feature interaction operation on the multi-mode description features of the object to be inquired and the multi-mode representation features of the object to be matched to obtain a first interaction feature and a second interaction feature.

A matching degree calculating module 503, configured to: and calculating the matching degree of the first interaction characteristic and the second interaction characteristic according to the model parameters of the query link model.

A parameter optimization module 504 configured to: and updating the model parameters according to the matching degree and the true value of the matching degree of the object to be inquired and the object to be matched so as to optimize the inquiry link model.

In some exemplary embodiments, the multi-modal description feature of the object to be queried may include: at least one of semantic features of the name of the object to be queried, contextual semantic features of the name of the object to be queried, and image features of the object to be queried.

In some exemplary embodiments, the multi-modal representation feature of the object to be matched may include: at least one of structural features of the object to be matched, image features of the object to be matched, semantic features of the name of the object to be matched and abstract features of the object to be matched.

In some exemplary embodiments, the apparatus further comprises: a second feature obtaining module 505, configured to: performing at least one of the following operations to obtain multi-modal representation characteristics of the object to be matched: generating the structural characteristics of the object to be matched according to the object to be matched and other objects with set relations; calculating the image characteristics of the object to be matched according to the image of the object to be matched; generating a word vector according to the name of the object to be matched to obtain semantic features of the name of the object to be matched; and generating a word vector according to the summary information of the object to be matched to obtain the summary characteristics of the object to be matched.

In some exemplary embodiments, when the second feature obtaining module 505 generates the structural feature of the object to be matched according to the object to be matched and another object having a set relationship, it is specifically configured to: calculating other objects having a set relationship with the object to be matched based on a translation embedding algorithm; and taking a vector formed by the object to be matched, the set relation and the other objects as a structural vector of the object to be matched.

In some exemplary embodiments, when performing feature interaction on the multi-modal description feature of the object to be queried to obtain a first interaction feature, the first interaction module 502 is specifically configured to: on the basis of an attention mechanism, self-interacting processing is carried out on the multi-modal description features of the object to be queried on a first interaction layer of a query link model, and self-interacting feature vectors of the multi-modal description features of the object to be queried are obtained; and fusing self-interaction feature vectors of the multi-modal description features of the object to be queried to obtain the first interaction feature.

In some exemplary embodiments, the first interaction module 502, when performing self-interaction processing on the multi-modal descriptive features of the object to be queried based on an attention mechanism to obtain respective self-interaction feature vectors of the multi-modal descriptive features of the object to be queried, is specifically configured to: calculating the similarity of the description characteristics and each characteristic in the multi-modal description characteristics of the object to be queried aiming at the description characteristics of any one modal in the multi-modal description characteristics of the object to be queried to obtain a plurality of self-interaction weights corresponding to the description characteristics; and according to the self-interaction weights, performing weighted calculation on the multi-modal description characteristics of the object to be queried to obtain self-interaction vectors of the description characteristics.

In some exemplary embodiments, the apparatus further comprises: the matching degree calculation module 503 includes: a second interaction module 5031 and a scoring module 5032; the matching degree calculating module 503 is specifically configured to, when calculating the matching degree between the first interactive feature and the second interactive feature according to the model parameter of the query link model: performing feature interaction operation on the first interaction feature and the second interaction feature in the query link model through a second interaction module 5031 to obtain a third interaction feature; through a scoring module 5032, scoring is performed on the third interactive feature according to the parameters of the scoring layer in the scoring layer of the query link model, so as to obtain the matching degree of the first interactive feature and the second interactive feature.

In some exemplary embodiments, the second interaction module 5031, when performing the feature interaction operation on the first interaction feature and the second interaction feature to obtain a third interaction feature, is specifically configured to: on the basis of an attention mechanism, performing bidirectional interaction processing on the first interaction feature and the second interaction feature on a second interaction layer in the query link model to obtain a bidirectional interaction feature vector of the object to be queried and a bidirectional interaction feature vector of the object to be matched; and fusing the bidirectional interaction feature vector of the object to be inquired and the bidirectional interaction feature vector of the object to be matched to obtain the third interaction feature.

In some exemplary embodiments, the first interactive feature comprises a plurality of feature vectors corresponding to multi-modal descriptive features of the object to be queried, and the second interactive feature comprises a plurality of feature vectors corresponding to multi-modal representational features of the object to be matched; correspondingly, when the second interaction module 5031 performs bidirectional interaction processing on the first interaction feature and the second interaction feature based on the attention mechanism to obtain a bidirectional interaction feature vector of the object to be queried, the bidirectional interaction module is specifically configured to: calculating the similarity of a plurality of feature vectors in the feature vectors and the second interactive features aiming at any feature vector in the first interactive features to obtain a plurality of bidirectional interactive weights; according to the bidirectional interaction weights, carrying out weighted calculation on a plurality of features in the second interaction features to obtain bidirectional interaction vectors of the features; and splicing the respective bidirectional interaction vectors of the plurality of features contained in the first interaction feature to obtain the bidirectional interaction feature vector of the object to be queried.

In this embodiment, the interaction features are calculated based on the multi-modal features of the object to be queried and the object to be matched, and the query link model is trained based on the interaction features obtained through calculation, so that the query link model can continuously and automatically learn the interaction relationship between the multi-modal features, and automatically learn the relationship between the interaction features of the object to be queried and the interaction features of the object to be matched, which is beneficial to accurately identify the object to be matched with the query request according to the multi-modal features having the interaction relationship in the actual application of the model.

Fig. 6 is a schematic structural diagram of a query device according to an exemplary embodiment of the present application, where as shown in fig. 6, the query device includes: a memory 601 and a processor 602.

The memory 601 is used to store computer programs and may be configured to store other various data to support operations on the querying device. Examples of such data include instructions for any application or method operating on the querying device, contact data, phonebook data, messages, pictures, videos, and so forth.

A processor 602, coupled to the memory 601, for executing the computer programs in the memory 601 to: responding to the query request, and acquiring multi-modal description characteristics of the object to be queried; a first interaction module 402 for: performing feature interaction operation on the multi-modal description features of the object to be queried to obtain first interaction features; calculating the matching degree of the object to be inquired and the at least one object to be matched according to the first interactive characteristic and the interactive characteristic of each object to be matched; and according to the matching degree, determining a target object matched with the object to be inquired from the at least one object to be matched.

Further optionally, when the processor 602 responds to the query request and obtains the multi-modal description feature of the object to be queried, it is specifically configured to: responding to the query request, and acquiring first query description information provided by a user; providing the user with at least one other descriptive information having an interactive relationship with the first query descriptive information; responding to the selection operation of the user from at least one other description information, and acquiring the selected description information as second query description information; and respectively acquiring the description features of the first modality and the description features of the second modality from the first query description information and the second query description information as the multi-modal description features of the object to be queried.

Further optionally, the processor 602 is further configured to: acquiring a plurality of objects to be matched, which are adapted to the first query description information, from a multi-modal knowledge graph; and selecting other representation information which belongs to different modalities from the first query description information from the multi-modal representation information corresponding to the plurality of objects to be matched as the at least one other description information.

Further optionally, the first query description information includes: text description information; the second query description information includes: image description information.

Further optionally, when performing feature interaction operation on the multi-modal description feature of the object to be queried to obtain a first interaction feature, the processor 602 is specifically configured to: on the basis of an attention mechanism, self-interacting processing is carried out on the multi-modal description features of the object to be queried on a first interaction layer of a query link model, and self-interacting feature vectors of the multi-modal description features of the object to be queried are obtained; and fusing self-interaction feature vectors of the multi-modal description features of the object to be queried to obtain the first interaction feature.

Further optionally, when the processor 602 performs self-interaction processing on the multi-modal descriptive features of the object to be queried based on an attention mechanism to obtain self-interaction feature vectors of the multi-modal descriptive features of the object to be queried, the processor is specifically configured to: calculating the similarity of the description characteristics and each characteristic in the multi-modal description characteristics of the object to be queried aiming at the description characteristics of any one modal in the multi-modal description characteristics of the object to be queried to obtain a plurality of self-interaction weights corresponding to the description characteristics; and according to the self-interaction weights, performing weighted calculation on the multi-modal description characteristics of the object to be queried to obtain self-interaction vectors of the description characteristics.

Further optionally, the processor 602 is further configured to: and performing feature interaction operation on the multi-modal representation features of the object to be matched aiming at any object to be matched in the at least one object to be matched to obtain the interaction features of the object to be matched as second interaction features.

Further optionally, when the processor 602 calculates the matching degree between the object to be queried and the at least one object to be matched according to the first interactive feature and the respective interactive feature of the at least one object to be matched, the processor is specifically configured to: in a query link model, performing feature interaction operation on the first interaction feature and the second interaction feature to obtain a third interaction feature; and scoring the third interactive characteristics according to the parameters of the scoring layer on the scoring layer of the query link model to obtain the matching degree of the object to be queried and the object to be matched.

Further optionally, when the processor 602 performs a feature interaction operation on the first interaction feature and the second interaction feature to obtain a third interaction feature, the processor is specifically configured to: on the basis of an attention mechanism, performing bidirectional interaction processing on the first interaction feature and the second interaction feature on a second interaction layer in the query link model to obtain a bidirectional interaction feature vector of the object to be queried and a bidirectional interaction feature vector of the object to be matched; and fusing the bidirectional interaction feature vector of the object to be inquired and the bidirectional interaction feature vector of the object to be matched to obtain the third interaction feature.

Further optionally, the first interactive feature includes a plurality of feature vectors corresponding to multi-modal description features of the object to be queried, and the second interactive feature includes a plurality of feature vectors corresponding to multi-modal representation features of the object to be matched; the processor 602, when performing bidirectional interaction processing on the first interaction feature and the second interaction feature based on an attention mechanism to obtain a bidirectional interaction feature vector of the object to be queried, is specifically configured to: calculating the similarity of a plurality of feature vectors in the feature vectors and the second interactive features aiming at any feature vector in the first interactive features to obtain a plurality of bidirectional interactive weights; according to the bidirectional interaction weights, carrying out weighted calculation on a plurality of feature vectors in the second interaction features to obtain bidirectional interaction vectors of the features; and splicing the respective bidirectional interaction vectors of the plurality of feature vectors contained in the first interaction feature to obtain the bidirectional interaction feature vector of the object to be queried.

Further, as shown in fig. 6, the query device further includes: communication component 603, display 604, power component 605, audio component 606, and the like. Only some of the components are shown schematically in fig. 6, and it is not meant that the querying device includes only the components shown in fig. 6.

Accordingly, the present application further provides a computer-readable storage medium storing a computer program, where the computer program is capable of implementing the steps that can be executed by the query device in the foregoing method embodiments when executed.

Fig. 7 is a schematic structural diagram of a model training apparatus according to an exemplary embodiment of the present application, and as shown in fig. 7, the model training apparatus includes: a memory 701 and a processor 702.

Memory 701 is used to store computer programs and may be configured to store various other data to support operations on the model training apparatus. Examples of such data include instructions for any application or method operating on the model training device, contact data, phonebook data, messages, pictures, videos, and so forth.

A processor 702, coupled to the memory 701, for executing the computer program in the memory 701 for: acquiring multi-modal description characteristics of an object to be queried according to query description information of the object to be queried; respectively carrying out feature interaction operation on the multi-mode description features of the object to be inquired and the multi-mode representation features of the object to be matched to obtain a first interaction feature and a second interaction feature; calculating the matching degree of the first interaction characteristic and the second interaction characteristic according to the model parameters of the query link model; and updating the model parameters according to the matching degree and the true value of the matching degree of the object to be inquired and the object to be matched so as to optimize the inquiry link model.

Further optionally, the multi-modal description feature of the object to be queried may include: at least one of semantic features of the name of the object to be queried, contextual semantic features of the name of the object to be queried, and image features of the object to be queried.

Further optionally, the multi-modal representation feature of the object to be matched may include: at least one of structural features of the object to be matched, image features of the object to be matched, semantic features of the name of the object to be matched and abstract features of the object to be matched.

Further optionally, the processor 702 is further configured to: performing at least one of the following operations to obtain multi-modal representation characteristics of the object to be matched: generating the structural characteristics of the object to be matched according to the object to be matched and other objects with set relations; calculating the image characteristics of the object to be matched according to the image of the object to be matched; generating a word vector according to the name of the object to be matched to obtain semantic features of the name of the object to be matched; and generating a word vector according to the summary information of the object to be matched to obtain the summary characteristics of the object to be matched.

Further optionally, when the processor 702 generates the structural feature of the object to be matched according to the object to be matched and other objects having a set relationship, the processor is specifically configured to: calculating other objects having a set relationship with the object to be matched based on a translation embedding algorithm; and taking a vector formed by the object to be matched, the set relation and the other objects as a structural vector of the object to be matched.

Further optionally, when performing feature interaction operation on the multi-modal description feature of the object to be queried to obtain a first interaction feature, the processor 702 is specifically configured to: on the basis of an attention mechanism, self-interacting processing is carried out on the multi-modal description features of the object to be queried on a first interaction layer of a query link model, and self-interacting feature vectors of the multi-modal description features of the object to be queried are obtained; and fusing self-interaction feature vectors of the multi-modal description features of the object to be queried to obtain the first interaction feature.

Further optionally, when the processor 702 performs self-interaction processing on the multi-modal descriptive features of the object to be queried based on an attention mechanism to obtain self-interaction feature vectors of the multi-modal descriptive features of the object to be queried, the processor is specifically configured to: calculating the similarity of the description characteristics and each characteristic in the multi-modal description characteristics of the object to be queried aiming at the description characteristics of any one modal in the multi-modal description characteristics of the object to be queried to obtain a plurality of self-interaction weights corresponding to the description characteristics; and according to the self-interaction weights, performing weighted calculation on the multi-modal description characteristics of the object to be queried to obtain self-interaction vectors of the description characteristics.

Further optionally, when the processor 702 calculates the matching degree of the first interaction feature and the second interaction feature according to the model parameter of the query link model, it is specifically configured to: in a query link model, performing feature interaction operation on the first interaction feature and the second interaction feature to obtain a third interaction feature; and scoring the third interactive feature according to the parameters of the scoring layer on the scoring layer of the query link model to obtain the matching degree of the first interactive feature and the second interactive feature.

Further optionally, when the processor 702 performs the feature interaction operation on the first interaction feature and the second interaction feature to obtain a third interaction feature, the processor is specifically configured to: on the basis of an attention mechanism, performing bidirectional interaction processing on the first interaction feature and the second interaction feature on a second interaction layer in the query link model to obtain a bidirectional interaction feature vector of the object to be queried and a bidirectional interaction feature vector of the object to be matched; and fusing the bidirectional interaction feature vector of the object to be inquired and the bidirectional interaction feature vector of the object to be matched to obtain the third interaction feature.

Further optionally, the first interactive feature includes a plurality of feature vectors corresponding to multi-modal description features of the object to be queried, and the second interactive feature includes a plurality of feature vectors corresponding to multi-modal representation features of the object to be matched; correspondingly, when the processor 702 performs bidirectional interaction processing on the first interaction feature and the second interaction feature based on the attention mechanism to obtain a bidirectional interaction feature vector of the object to be queried, the processor is specifically configured to: calculating the similarity of a plurality of feature vectors in the feature vectors and the second interactive features aiming at any feature vector in the first interactive features to obtain a plurality of bidirectional interactive weights; according to the bidirectional interaction weights, carrying out weighted calculation on a plurality of features in the second interaction features to obtain bidirectional interaction vectors of the features; and splicing the respective bidirectional interaction vectors of the plurality of features contained in the first interaction feature to obtain the bidirectional interaction feature vector of the object to be queried.

Further, as shown in fig. 7, the model training apparatus further includes: communication components 703, display 704, power components 705, audio components 706, and other components. Only some of the components are schematically shown in fig. 7, and it is not meant that the model training apparatus includes only the components shown in fig. 7.

Accordingly, the present application further provides a computer readable storage medium storing a computer program, where the computer program is capable of implementing the steps that can be performed by the model training apparatus in the foregoing method embodiments when executed.

The memories of fig. 6 and 7 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The communication components of fig. 6 and 7 described above are configured to facilitate communication between the device in which the communication component is located and other devices in a wired or wireless manner. The device in which the communication component is located may access a wireless network based on a communication standard, such as WiFi, 2G, 3G, 4G, or 5G, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component may be implemented based on Near Field Communication (NFC) technology, Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

The display in fig. 6 described above includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

The power supply components of fig. 6 and 7 described above provide power to the various components of the device in which the power supply components are located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of querying, comprising:

responding to the query request, and acquiring multi-modal description characteristics of the object to be queried;

performing feature interaction operation on the multi-modal description features of the object to be queried to obtain first interaction features;

calculating the matching degree of the object to be inquired and the at least one object to be matched according to the first interactive characteristic and the interactive characteristic of each object to be matched;

and according to the matching degree, determining a target object matched with the object to be inquired from the at least one object to be matched.

2. The method of claim 1, wherein obtaining multimodal descriptive features of an object to be queried in response to a query request comprises:

responding to the query request, and acquiring first query description information provided by a user;

providing the user with at least one other descriptive information having an interactive relationship with the first query descriptive information;

responding to the selection operation of the user from at least one other description information, and acquiring the selected description information as second query description information;

and respectively acquiring the description features of the first modality and the description features of the second modality from the first query description information and the second query description information as the multi-modal description features of the object to be queried.

3. The method of claim 2, further comprising:

acquiring a plurality of objects to be matched, which are adapted to the first query description information, from a multi-modal knowledge graph;

and selecting other representation information which belongs to different modalities from the first query description information from the multi-modal representation information corresponding to the plurality of objects to be matched as the at least one other description information.

4. The method of claim 2, wherein the first query description information comprises: text description information; the second query description information includes: image description information.

5. The method according to claim 1, wherein performing feature interaction operation on the multi-modal description feature of the object to be queried to obtain a first interaction feature comprises:

on the basis of an attention mechanism, self-interacting processing is carried out on the multi-modal description features of the object to be queried on a first interaction layer of a query link model, and self-interacting feature vectors of the multi-modal description features of the object to be queried are obtained;

and fusing self-interaction feature vectors of the multi-modal description features of the object to be queried to obtain the first interaction feature.

6. The method of claim 5, wherein the self-interacting processing of the multi-modal descriptive features of the object to be queried based on an attention mechanism to obtain respective self-interacting feature vectors of the multi-modal descriptive features of the object to be queried comprises:

calculating the similarity of the description characteristics and each characteristic in the multi-modal description characteristics of the object to be queried aiming at the description characteristics of any one modal in the multi-modal description characteristics of the object to be queried to obtain a plurality of self-interaction weights corresponding to the description characteristics;

and according to the self-interaction weights, performing weighted calculation on the multi-modal description characteristics of the object to be queried to obtain self-interaction vectors of the description characteristics.

7. The method of claim 1, further comprising:

and performing feature interaction operation on the multi-modal representation features of the object to be matched aiming at any object to be matched in the at least one object to be matched to obtain the interaction features of the object to be matched as second interaction features.

8. The method according to claim 7, wherein calculating the matching degree between the object to be queried and the at least one object to be matched according to the first interactive feature and the respective interactive feature of the at least one object to be matched comprises:

in a query link model, performing feature interaction operation on the first interaction feature and the second interaction feature to obtain a third interaction feature;

and scoring the third interactive characteristics according to the parameters of the scoring layer on the scoring layer of the query link model to obtain the matching degree of the object to be queried and the object to be matched.

9. The method of claim 8, wherein performing a feature interaction operation on the first interactive feature and the second interactive feature to obtain a third interactive feature comprises:

on the basis of an attention mechanism, performing bidirectional interaction processing on the first interaction feature and the second interaction feature on a second interaction layer in the query link model to obtain a bidirectional interaction feature vector of the object to be queried and a bidirectional interaction feature vector of the object to be matched;

and fusing the bidirectional interaction feature vector of the object to be inquired and the bidirectional interaction feature vector of the object to be matched to obtain the third interaction feature.

10. The method according to claim 9, wherein the first interactive feature comprises a plurality of feature vectors corresponding to multi-modal description features of the object to be queried, and the second interactive feature comprises a plurality of feature vectors corresponding to multi-modal representation features of the object to be matched;

based on an attention mechanism, performing bidirectional interaction processing on the first interaction feature and the second interaction feature to obtain a bidirectional interaction feature vector of the object to be queried, including:

calculating the similarity of a plurality of feature vectors in the feature vectors and the second interactive features aiming at any feature vector in the first interactive features to obtain a plurality of bidirectional interactive weights;

according to the bidirectional interaction weights, carrying out weighted calculation on a plurality of feature vectors in the second interaction features to obtain bidirectional interaction vectors of the features;

and splicing the respective bidirectional interaction vectors of the plurality of feature vectors contained in the first interaction feature to obtain the bidirectional interaction feature vector of the object to be queried.

11. A method of model training, comprising:

acquiring multi-modal description characteristics of an object to be queried according to query description information of the object to be queried;

respectively carrying out feature interaction operation on the multi-mode description features of the object to be inquired and the multi-mode representation features of the object to be matched to obtain a first interaction feature and a second interaction feature;

calculating the matching degree of the first interaction characteristic and the second interaction characteristic according to the model parameters of the query link model;

and updating the model parameters according to the matching degree and the true value of the matching degree of the object to be inquired and the object to be matched so as to optimize the inquiry link model.

12. The method according to claim 11, wherein the multi-modal description of the object to be queried comprises: at least one of semantic features of the name of the object to be queried, context semantic features of the name of the object to be queried, and image features of the object to be queried.

13. The method of claim 11, further comprising: performing at least one of the following operations to obtain multi-modal representation characteristics of the object to be matched:

generating the structural characteristics of the object to be matched according to the object to be matched and other objects with set relations;

calculating the image characteristics of the object to be matched according to the image of the object to be matched;

generating a word vector according to the name of the object to be matched to obtain semantic features of the name of the object to be matched;

and generating a word vector according to the summary information of the object to be matched to obtain the summary characteristics of the object to be matched.

14. The method according to claim 13, wherein generating the structural feature of the object to be matched according to the object to be matched and other objects having a set relationship comprises:

calculating other objects having a set relationship with the object to be matched based on a translation embedding algorithm;

and taking a vector formed by the object to be matched, the set relation and the other objects as a structural vector of the object to be matched.

15. The method according to claim 11, wherein performing feature interaction on the multi-modal description feature of the object to be queried to obtain a first interaction feature comprises:

16. The method of claim 11, wherein calculating the matching degree of the first interactive feature and the second interactive feature according to the model parameters of the query link model comprises:

and scoring the third interactive feature according to the parameters of the scoring layer on the scoring layer of the query link model to obtain the matching degree of the first interactive feature and the second interactive feature.

17. An inquiry apparatus, comprising:

a first feature acquisition module to: responding to the query request, and acquiring multi-modal description characteristics of the object to be queried;

a first interaction module to: performing feature interaction operation on the multi-modal description features of the object to be queried to obtain first interaction features;

a matching degree calculation module for: calculating the matching degree of the object to be inquired and the at least one object to be matched according to the first interactive characteristic and the interactive characteristic of each object to be matched;

a target object determination module to: and according to the matching degree, determining a target object matched with the object to be inquired from the at least one object to be matched.

18. A model training apparatus, comprising:

a first feature acquisition module to: acquiring multi-modal description characteristics of an object to be queried according to query description information of the object to be queried;

a first interaction module to: respectively carrying out feature interaction operation on the multi-mode description features of the object to be inquired and the multi-mode representation features of the object to be matched to obtain a first interaction feature and a second interaction feature;

a matching degree calculation module for: calculating the matching degree of the first interaction characteristic and the second interaction characteristic according to the model parameters of the query link model;

a parameter optimization module to: and updating the model parameters according to the matching degree and the true value of the matching degree of the object to be inquired and the object to be matched so as to optimize the inquiry link model.

19. An electronic device, comprising: a memory and a processor; the memory is to store one or more computer instructions; the processor is to execute the one or more computer instructions to: performing the steps of the method of any one of claims 1-16.

20. A computer-readable storage medium storing a computer program, wherein the computer program is capable of performing the steps of the method of any one of claims 1 to 16 when executed.