CN113343664A

CN113343664A - Method and device for determining matching degree between image texts

Info

Publication number: CN113343664A
Application number: CN202110724610.6A
Authority: CN
Inventors: 白亚龙; 张炜; 梅涛
Original assignee: Jingdong Shuke Haiyi Information Technology Co Ltd
Current assignee: Jingdong Shuke Haiyi Information Technology Co Ltd
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-09-03
Anticipated expiration: 2041-06-29
Also published as: CN113343664B

Abstract

The application discloses a method and a device for determining matching degree between image texts. One embodiment of the method comprises: determining image characteristic information of an image to be matched and text characteristic information of a text to be matched; determining image common sense feature information and text common sense feature information, wherein the image common sense feature information represents common sense information related to target information in an image to be matched, and the text common sense feature information represents common sense information related to the target information in a text to be matched; and determining the matching degree between the image to be matched and the text to be matched according to the image characteristic information, the text characteristic information, the image common sense characteristic information and the text common sense characteristic information. The method for determining the matching degree between the image texts combines the characteristic information of the image texts and the related common sense information, and improves the generalization capability.

Description

Method and device for determining matching degree between image texts

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method and a device for determining matching degree between image texts.

Background

Multimodal content understanding is an important topic in the multimedia and computer vision fields. Among them, cross-modal retrieval between images and texts, i.e. image-text matching, is a very challenging research target with important application value. With the rapid development of deep learning technology and the increasing amount of multimedia data, image-text matching technology has made great progress. At present, the idea of the mainstream method of image-text matching can be summarized as follows: and mapping data of two modalities of images and texts into a public hidden space by using a deep neural network, and carrying out similarity measurement.

Disclosure of Invention

The embodiment of the application provides a method and a device for determining matching degree between image texts.

In a first aspect, an embodiment of the present application provides a method for determining a matching degree between image texts, including: determining image characteristic information of an image to be matched and text characteristic information of a text to be matched; determining image common sense feature information and text common sense feature information, wherein the image common sense feature information represents common sense information related to target information in an image to be matched, and the text common sense feature information represents common sense information related to the target information in a text to be matched; and determining the matching degree between the image to be matched and the text to be matched according to the image characteristic information, the text characteristic information, the image common sense characteristic information and the text common sense characteristic information.

In some embodiments, before determining the image common sense feature information and the text common sense feature information, the method further includes: generating logic type common sense characteristic information through a graph convolution network representing the logic type common sense information; on the basis of the logic type common sense feature information, generating common sense feature information comprising the logic type common sense information and the statistical type common sense information through a hypergraph convolution network representing the statistical type common sense information; and the above-mentioned common sense feature information of definite image and common sense feature information of text, including: and determining the image common sense feature information and the text common sense feature information according to the common sense feature information.

In some embodiments, the determining the matching degree between the image to be matched and the text to be matched according to the image feature information, the text feature information, the image common sense feature information and the text common sense feature information includes: combining the common sense characteristic information and the image common sense characteristic information to obtain the combined image common sense characteristic information; combining the common sense characteristic information and the text common sense characteristic information to obtain combined common sense characteristic information; and determining the matching degree between the image to be matched and the text to be matched according to the image characteristic information, the text characteristic information, the combined image common sense characteristic information and the combined text common sense characteristic information.

In some embodiments, the determining the matching degree between the image to be matched and the text to be matched according to the image feature information, the text feature information, the combined image common sense feature information, and the combined text common sense feature information includes: determining a first matching degree between the image characteristic information and the text characteristic information and a second matching degree between the combined image common sense characteristic information and the combined present common sense characteristic information; and determining the matching degree between the image to be matched and the text to be matched according to the first matching degree and the second matching degree.

In some embodiments, the determining the image common sense feature information and the text common sense feature information according to the common sense feature information includes: determining target information in the image to be matched according to the image characteristic information; determining target information in the text to be matched according to the text characteristic information; and determining image common sense feature information corresponding to the target information in the image to be matched and text common sense feature information corresponding to the target information in the text to be matched from the common sense feature information.

In some embodiments, the generating the common sense feature information of logical type through the graph convolution network for characterizing the common sense information of logical type includes: determining a data set corresponding to target information in an image to be matched and target information in a text to be matched; and inputting the initialization vector information of each concept in the data set into the graph convolution network to generate logic type common sense characteristic information.

In some embodiments, in a hypergraph characterized by a hypergraph convolutional network, semantic relevance between multiple concepts connected by a hypergraph is characterized by a hypergraph.

In some embodiments, the determining the image feature information of the image to be matched includes: determining first characteristic information of target information in an image to be matched through a target detection network; and determining the image characteristic information of the image to be matched through the first self-attention network based on the first characteristic information.

In some embodiments, determining text feature information of a text to be matched includes: determining second characteristic information of the text to be matched through a characteristic extraction network; and determining text characteristic information of the text to be matched through a second self-attention network based on the second characteristic information.

In some embodiments, the first self-attention network and the second self-attention network employ a multi-headed self-attention mechanism.

In a second aspect, an embodiment of the present application provides an apparatus for determining a matching degree between image texts, including: the image matching device comprises a first determining unit, a second determining unit and a matching unit, wherein the first determining unit is configured to determine image characteristic information of an image to be matched and text characteristic information of a text to be matched; the second determining unit is configured to determine image common sense feature information and text common sense feature information, wherein the image common sense feature information represents common sense information related to target information in the image to be matched, and the text common sense feature information represents common sense information related to the target information in the text to be matched; and the third determining unit is configured to determine the matching degree between the image to be matched and the text to be matched according to the image characteristic information, the text characteristic information, the image common sense characteristic information and the text common sense characteristic information.

In some embodiments, the above apparatus further comprises: a generating unit configured to generate the logic type common sense feature information through a graph convolution network representing the logic type common sense information; on the basis of the logic type common sense feature information, generating common sense feature information comprising the logic type common sense information and the statistical type common sense information through a hypergraph convolution network representing the statistical type common sense information; and a second determination unit further configured to: and determining the image common sense feature information and the text common sense feature information according to the common sense feature information.

In some embodiments, the third determining unit is further configured to: combining the common sense characteristic information and the image common sense characteristic information to obtain the combined image common sense characteristic information; combining the common sense characteristic information and the text common sense characteristic information to obtain combined common sense characteristic information; and determining the matching degree between the image to be matched and the text to be matched according to the image characteristic information, the text characteristic information, the combined image common sense characteristic information and the combined text common sense characteristic information.

In some embodiments, the third determining unit is further configured to: determining a first matching degree between the image characteristic information and the text characteristic information and a second matching degree between the combined image common sense characteristic information and the combined present common sense characteristic information; and determining the matching degree between the image to be matched and the text to be matched according to the first matching degree and the second matching degree.

In some embodiments, the second determining unit is further configured to: determining target information in the image to be matched according to the image characteristic information; determining target information in the text to be matched according to the text characteristic information; and determining image common sense feature information corresponding to the target information in the image to be matched and text common sense feature information corresponding to the target information in the text to be matched from the common sense feature information.

In some embodiments, the generating unit is further configured to: determining a data set corresponding to target information in an image to be matched and target information in a text to be matched; and inputting the initialization vector information of each concept in the data set into the graph convolution network to generate logic type common sense characteristic information.

In some embodiments, the first determining unit is further configured to: determining first characteristic information of target information in an image to be matched through a target detection network; and determining the image characteristic information of the image to be matched through the first self-attention network based on the first characteristic information.

In some embodiments, the first determining unit is further configured to: determining second characteristic information of the text to be matched through a characteristic extraction network; and determining text characteristic information of the text to be matched through a second self-attention network based on the second characteristic information.

In a third aspect, the present application provides a computer-readable medium, on which a computer program is stored, where the program, when executed by a processor, implements the method as described in any implementation manner of the first aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement a method as described in any implementation of the first aspect.

According to the method and the device for determining the matching degree between the image texts, the image characteristic information of the image to be matched and the text characteristic information of the text to be matched are determined; determining image common sense feature information and text common sense feature information, wherein the image common sense feature information represents common sense information related to target information in an image to be matched, and the text common sense feature information represents common sense information related to the target information in a text to be matched; according to the image feature information, the text feature information, the image common sense feature information and the text common sense feature information, the matching degree between the image to be matched and the text to be matched is determined, so that the method for determining the matching degree between the image texts by combining the image text feature information and the related common sense information is provided, and the generalization capability is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram according to one embodiment of a method for determining a degree of match between image texts of the present application;

fig. 3 is a schematic diagram of an application scenario of the determination method of the matching degree between image texts according to the present embodiment;

FIG. 4 is a flow diagram of yet another embodiment of a method of determining a degree of match between image text according to the present application;

FIG. 5 is a detailed schematic diagram according to an embodiment of the present application;

fig. 6 is a block diagram of an embodiment of a device for determining a degree of matching between image texts according to the present application;

FIG. 7 is a block diagram of a computer system suitable for use in implementing embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary architecture 100 to which the method and apparatus for determining a degree of matching between image texts of the present application can be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The communication connections between the

terminal devices

101, 102, 103 form a topological network, and the network 104 serves to provide a medium for communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

terminal devices

101, 102, 103 may be hardware devices or software that support network connections for data interaction and data processing. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices supporting network connection, information acquisition, interaction, display, processing, and the like, including but not limited to smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, for example, a background processing server that acquires an image to be matched and a text to be matched, which are sent by the user through the

terminal devices

101, 102, and 103, and determines whether the image to be matched and the text to be matched are matched. Optionally, the server may feed back the matching degree result to the terminal device. As an example, the server 105 may be a cloud server.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be further noted that the method for determining the matching degree between the image texts provided by the embodiment of the present application may be executed by a server, or may be executed by a terminal device, or may be executed by the server and the terminal device in cooperation with each other. Accordingly, the parts (for example, the units) included in the device for determining the matching degree between the image texts may be all provided in the server, all provided in the terminal device, or provided in the server and the terminal device, respectively.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. When the electronic device on which the determination method of the degree of matching between image texts is executed does not need to perform data transmission with other electronic devices, the system architecture may include only the electronic device (e.g., a server or a terminal device) on which the determination method of the degree of matching between image texts is executed.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for determining a degree of match between image text is shown, comprising the steps of:

step 201, determining image feature information of an image to be matched and text feature information of a text to be matched.

In this embodiment, an execution subject (for example, a server in fig. 1) of the method for determining the matching degree between the image texts may obtain the image to be matched and the text to be matched from a remote location or a local location in a wired network connection manner or a wireless network connection manner, and determine image feature information of the image to be matched and text feature information of the text to be matched.

The image to be processed and the text to be matched are any image and text of which the matching degree is to be determined. When the visual semantic information of the image to be processed is consistent with the text semantic information of the text to be matched, the image to be processed and the text semantic information can be considered to be matched.

As an example, the executing body may perform feature extraction on the image to be matched through an image feature extraction network corresponding to the image to be matched, so as to obtain image feature information of the image to be matched; and performing feature extraction on the text to be matched through a text feature extraction network corresponding to the text to be matched to obtain text feature information of the text to be matched. The feature extraction network may be any network model with a feature extraction function. Such as convolutional neural network models, residual neural networks, recurrent neural networks.

In some optional implementations of the embodiment, the executing body may extract image feature information of the image to be matched by:

first, first characteristic information of target information in an image to be matched is determined through a target detection network.

The target detection network is used for determining target frames of target information in the image to be matched and characteristic information of the target information in each target frame. The target information may be all concept information related in the image to be matched, including target objects (e.g., concepts such as people and objects), state information of the target objects (e.g., the target objects are people, and the state information may be a concept of sleeping), and correlation information between the target objects (e.g., the correlation between the target objects people and a hat is a concept of a person wearing a hat).

Then, based on the first feature information, image feature information of the image to be matched is determined through the first self-attention network.

Through the self-attention network, the information with high attention degree in the matching operation of the image text in the image to be matched can be emphasized, so that the accuracy of the image text matching is improved.

In some optional implementation manners of this embodiment, the execution main body may extract text feature information of a text to be matched by:

firstly, determining second feature information of a text to be matched through a feature extraction network.

In this implementation manner, the execution main body may input the text to be matched into the feature extraction network, and the execution main body takes each word in the text to be matched as a basic input unit, and models the text to be matched through the feature extraction network to obtain the second feature information of the text to be matched.

And then, determining text characteristic information of the text to be matched through a second self-attention network based on the second characteristic information.

In the implementation mode, similar to the characteristic extraction process of the image to be matched, the information with high attention degree in the image text matching operation in the text to be matched can be emphasized through the self-attention network, so that the accuracy of image text matching is improved.

In some alternative implementations of the present embodiment, the first self-attention network and the second self-attention network employ a multi-head self-attention mechanism.

Step 202, determining image common sense feature information and text common sense feature information.

In this embodiment, the execution subject may determine the image common sense feature information and the text common sense feature information. The image common sense feature information represents common sense information related to target information in the image to be matched, and the text common sense feature information represents common sense information related to the target information in the text to be matched.

The common sense information may be any common sense information related to the target information in the image to be matched and the text to be matched. Taking a seat as an example, the corresponding common sense information includes that the seat belongs to furniture.

As an example, first, the execution subject initializes each concept related to the target information, and determines a vectorized representation of each concept; and then updating the vectorization representation of each concept through a graph convolution network representing the common sense information to obtain the common sense feature information of each concept after the common sense information is fused. And finally, determining concepts (namely target information related to the image to be matched) in the image characteristic information through a concept prediction model, and determining common sense characteristic information corresponding to the concepts in the image characteristic information from the common sense characteristic information of each concept. Wherein the associations between concepts may be characterized by a knowledge graph. The knowledge graph is formed by using concepts as nodes and using the relevance between the concepts as edges.

In this implementation, after determining the common sense feature information of each concept, the executing entity may determine, from the common sense feature information, image common sense feature information representing common sense information related to the target information related to the image to be matched and text common sense feature information representing common sense information related to the target information related to the text to be matched.

In some optional implementations of this embodiment, before performing step 202, the executing main body may perform the following operations:

first, logical common sense feature information is generated by a graph convolution network representing logical common sense information.

Secondly, common sense feature information including the logic common sense information and the statistical common sense information is generated through a hypergraph convolution network representing the statistical common sense information on the basis of the logic common sense feature information. In this implementation, the common sense information includes logic common sense information and statistical common sense information. The logic type common sense information is common sense information that can be directly determined in daily life learning, and for example, human beings include men and women. The statistical common sense information is common sense information obtained by performing statistics and analysis on semantic correlations between concepts on the basis of the logical common sense information and further determining correlation information between the concepts. For example, the execution subject may count probabilities of the concept of "man" and the concept of "woman" appearing in various information at the same time, and obtain corresponding statistical common sense information by using the probability value as a weight of an edge between the two concepts.

In order to increase the richness of the statistical common sense information, the execution subject may represent the statistical common sense information in the form of a hypergraph. In the hypergraph, each concept is a node, and a hyperedge can exist between the concept and other concepts, and the hyperedge is determined based on similarity measurement between the concepts characterized by the statistical common sense information.

High-order semantic information represented by the statistical common sense information plays an important role in cross-modal semantic inference between images and texts.

In this implementation, the execution subject may determine the image common sense feature information and the text common sense feature information according to the common sense feature information.

Specifically, the execution subject may determine target information in the image to be matched according to the image feature information; determining target information in the text to be matched according to the text characteristic information; and determining image common sense feature information corresponding to the target information in the image to be matched and text common sense feature information corresponding to the target information in the text to be matched from the common sense feature information. The target information in the image feature information may be various concepts related to the image feature information, and the target information in the text feature information may be various concepts related to the text feature information.

In some optional implementations of this embodiment, the executing body may execute the first step by: firstly, determining a data set corresponding to target information in an image to be matched and target information in a text to be matched; then, initialization vector information of each concept in the data set is input into the graph convolution network, and logical common sense feature information is generated.

As an example, the executing entity may divide various concepts in the corpus to obtain data sets of various classifications, and determine a data set including a concept corresponding to the target information in the image to be matched and a concept corresponding to the target information in the text to be matched as a corresponding data set in the present implementation.

It is to be understood that, in the present embodiment, the common sense feature information generated by the hypergraph convolution network is common sense feature information related to concepts corresponding to the determined data set, on the basis of the logic common sense feature information. And step 203, determining the matching degree between the image to be matched and the text to be matched according to the image characteristic information, the text characteristic information, the image common sense characteristic information and the text common sense characteristic information.

In this embodiment, the execution main body may determine the matching degree between the image to be matched and the text to be matched according to the image feature information, the text feature information, the image common sense feature information, and the text common sense feature information.

As an example, the executing entity may fuse image feature information corresponding to the image to be matched and image common sense feature information to obtain a fused image feature in which common sense information related to target information in the image to be matched is fused; fusing text characteristic information corresponding to the text to be matched and text common sense characteristic information to obtain fused text characteristics fused with common sense information related to target information in the text to be matched; and then, according to the fused image features and the fused text features, determining the similarity between the fused image features and the fused text features, and determining the determined similarity as the matching degree between the image to be matched and the text to be matched. Wherein the similarity between the vectors can be determined by determining the distance (e.g., euclidean distance, manhattan distance) between the two.

The matching degree determination process shown in steps 201 to 203 described above may be performed by a matching model. The matching model is obtained by training in the following way: firstly, acquiring a training sample set, wherein training samples in the training sample set comprise sample images, sample texts and labels for representing the sample images and indicating whether the sample texts are matched or not; then, training samples are selected from the training sample set, and image characteristic information and image common sense characteristic information corresponding to sample images in the selected training samples, and text characteristic information and text common sense characteristic information corresponding to sample texts in the selected training samples are determined through the initial matching model; then, determining the sample matching degree of the sample image and the sample text in the selected training sample according to the image characteristic information, the image common sense characteristic information, the text characteristic information and the text common sense characteristic information; and updating the initial image text model based on the target loss between the sample matching degree and the label until a trained matching model is obtained.

On the basis of the self characteristic information of the image to be processed and the text to be matched, the common knowledge information corresponding to the image to be processed and the text to be matched is combined, the problems that the existing matching degree determining method only concerns the information of the image text to the self and ignores the common knowledge, so that the matching degree determining model has good fitting capacity to common data in a training set but has poor generalization capacity on a few rare samples are solved, and the generalization capacity of the matching degree determining model is improved.

In some optional implementations of this embodiment, the executing main body may execute the step 203 by:

firstly, combining the common sense feature information and the image common sense feature information to obtain the combined image common sense feature information.

Secondly, the common sense feature information and the text common sense feature information are combined to obtain the combined common sense feature information.

As an example, the executing entity may perform a hadamard product on the common sense feature information and the image common sense feature information, and perform a hadamard product on the common sense feature information and the text common sense feature information to obtain the combined image common sense feature information and the combined common sense feature information, respectively.

Thirdly, determining the matching degree between the image to be matched and the text to be matched according to the image characteristic information, the text characteristic information, the combined image common sense characteristic information and the combined text common sense characteristic information. In some optional implementations of this embodiment, the executing body may execute the third step by: firstly, determining a first matching degree between the image characteristic information and the text characteristic information and a second matching degree between the combined image common sense characteristic information and the combined common sense characteristic information; and then, determining the matching degree between the image to be matched and the text to be matched according to the first matching degree and the second matching degree. For example, the first matching degree and the second matching degree are weighted and averaged, and finally the matching degree between the image to be matched and the text to be matched is obtained.

With continued reference to fig. 3, fig. 3 is a schematic diagram 300 of an application scenario of the determination method of the matching degree between image texts according to the present embodiment. In the application scenario of fig. 3, the server first obtains an image to be matched 301 and a text to be matched 302. Then, the server determines image feature information 303 of an image 301 to be matched and text feature information 304 of a text 302 to be matched through a feature extraction network; then, the server determines image common sense feature information 305 representing common sense information related to the target information in the image 301 to be matched and text common sense feature information 306 representing common sense information related to the target information in the text 302 to be matched; according to the image characteristic information 303, the text characteristic information 304, the image common sense characteristic information 305 and the text common sense characteristic information 306, the matching degree between the image to be matched and the text to be matched is determined.

In the method provided by the embodiment of the application, the image characteristic information of the image to be matched and the text characteristic information of the text to be matched are determined; determining image common sense feature information and text common sense feature information, wherein the image common sense feature information represents common sense information related to target information in an image to be matched, and the text common sense feature information represents common sense information related to the target information in a text to be matched; according to the image feature information, the text feature information, the image common sense feature information and the text common sense feature information, the matching degree between the image to be matched and the text to be matched is determined, so that the method for determining the matching degree between the image texts by combining the image text feature information and the related common sense information is provided, and the generalization capability is improved.

With continuing reference to FIG. 4, a schematic flow chart 400 illustrating one embodiment of a method for determining a degree of match between image text is shown in accordance with the present application including the steps of:

step 401, generating logic type common sense feature information through a graph convolution network representing logic type common sense information.

And 402, generating common sense feature information comprising the logic common sense information and the statistical common sense information through a hypergraph convolution network representing the statistical common sense information on the basis of the logic common sense feature information.

Step 403, determining first characteristic information of the target information in the image to be matched through the target detection network.

And step 404, determining image characteristic information of the image to be matched through the first self-attention network based on the first characteristic information.

And step 405, determining second feature information of the text to be matched through a feature extraction network.

And step 406, determining text characteristic information of the text to be matched through a second self-attention network based on the second characteristic information.

Step 407, determining image common sense feature information and text common sense feature information according to the common sense feature information.

The image common sense feature information represents common sense information related to target information in the image to be matched, and the text common sense feature information represents common sense information related to the target information in the text to be matched.

And step 408, combining the common sense characteristic information and the image common sense characteristic information to obtain the combined image common sense characteristic information.

And step 409, combining the common sense characteristic information and the text common sense characteristic information to obtain combined common sense characteristic information.

And step 410, determining the matching degree between the image to be matched and the text to be matched according to the image characteristic information, the text characteristic information, the combined image common sense characteristic information and the combined text common sense characteristic information.

As shown in fig. 5, a specific schematic diagram of the method for determining the matching degree between image texts according to the present embodiment is shown. First, each concept vector instantiated in the corpus 501 passes through the graph convolution network 502 representing logic type common sense information and the hypergraph convolution network 503 representing statistical type common sense information to obtain common sense feature information of each concept, which is obtained by fusing the logic type common sense information and the statistical type common sense information. Then, the image 504 to be matched sequentially passes through the target detection network 505 and the first self-attention network 506, and image feature information corresponding to the image 504 to be matched is obtained. Based on the image characteristic information and the common sense characteristic information of each concept, the first concept prediction model 507 obtains the image common sense characteristic information; and combining the common sense characteristic information and the image common sense characteristic information to obtain the combined image common sense characteristic information. Meanwhile, the text 508 to be matched sequentially passes through the feature extraction network 509 and the second self-attention network 510, and text feature information corresponding to the text 508 to be matched is obtained. Based on the text feature information and the common sense feature information of each concept, the second concept prediction model 511 obtains the text common sense feature information; and combining the common sense characteristic information and the text common sense characteristic information to obtain the combined common sense characteristic information. And finally, determining the matching degree between the image to be matched and the text to be matched according to the image characteristic information, the text characteristic information, the combined image common sense characteristic information and the combined text common sense characteristic information.

As can be seen from this embodiment, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for determining the matching degree between image texts in this embodiment specifically describes an obtaining process of feature information, an obtaining process of common sense feature information, and a determining process of the matching degree between image texts, so that the generalization ability and accuracy of the determination of the matching degree are further improved.

With continuing reference to fig. 6, as an implementation of the method shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for determining a matching degree between image texts, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices in particular.

As shown in fig. 6, the apparatus for determining the degree of matching between image texts includes: a first determining unit 601 configured to determine image feature information of an image to be matched and text feature information of a text to be matched; a second determining unit 602 configured to determine image common sense feature information and text common sense feature information, wherein the image common sense feature information represents common sense information related to target information in an image to be matched, and the text common sense feature information represents common sense information related to the target information in the text to be matched; a third determining unit 603 configured to determine a matching degree between the image to be matched and the text to be matched according to the image feature information, the text feature information, the image common sense feature information, and the text common sense feature information.

In some embodiments, the above apparatus further comprises: a generating unit (not shown in the figure) configured to generate the common sense feature information of logical type through a graph convolution network characterizing the common sense information of logical type; on the basis of the logic type common sense feature information, generating common sense feature information comprising the logic type common sense information and the statistical type common sense information through a hypergraph convolution network representing the statistical type common sense information; and a second determining unit 602, further configured to: and determining the image common sense feature information and the text common sense feature information according to the common sense feature information.

In some embodiments, the third determining unit 603 is further configured to: combining the common sense characteristic information and the image common sense characteristic information to obtain the combined image common sense characteristic information; combining the common sense characteristic information and the text common sense characteristic information to obtain combined common sense characteristic information; and determining the matching degree between the image to be matched and the text to be matched according to the image characteristic information, the text characteristic information, the combined image common sense characteristic information and the combined text common sense characteristic information.

In some embodiments, the third determining unit 603 is further configured to: determining a first matching degree between the image characteristic information and the text characteristic information and a second matching degree between the combined image common sense characteristic information and the combined present common sense characteristic information; and determining the matching degree between the image to be matched and the text to be matched according to the first matching degree and the second matching degree.

In some embodiments, the second determining unit 602 is further configured to: determining target information in the image to be matched according to the image characteristic information; determining target information in the text to be matched according to the text characteristic information; and determining image common sense feature information corresponding to the target information in the image to be matched and text common sense feature information corresponding to the target information in the text to be matched from the common sense feature information.

In some embodiments, the generating unit (not shown in the figures) is further configured to: determining a data set corresponding to target information in an image to be matched and target information in a text to be matched; and inputting the initialization vector information of each concept in the data set into the graph convolution network to generate logic type common sense characteristic information.

In some embodiments, the first determining unit 601 is further configured to: determining first characteristic information of target information in an image to be matched through a target detection network; and determining the image characteristic information of the image to be matched through the first self-attention network based on the first characteristic information.

In some embodiments, the first determining unit 601 is further configured to: determining second characteristic information of the text to be matched through a characteristic extraction network; and determining text characteristic information of the text to be matched through a second self-attention network based on the second characteristic information.

In this embodiment, a first determining unit in the apparatus for determining the matching degree between image texts is configured to determine image feature information of an image to be matched and text feature information of a text to be matched; the second determining unit is configured to determine image common sense feature information and text common sense feature information, wherein the image common sense feature information represents common sense information related to target information in the image to be matched, and the text common sense feature information represents common sense information related to the target information in the text to be matched; and the third determining unit is configured to determine the matching degree between the image to be matched and the text to be matched according to the image characteristic information, the text characteristic information, the image common sense characteristic information and the text common sense characteristic information, so that a device for determining the matching degree between the image texts by combining the image text characteristic information and the related common sense information is provided, and the generalization capability is improved.

Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use in implementing devices of embodiments of the present application (e.g.,

devices

101, 102, 103, 105 shown in FIG. 1). The apparatus shown in fig. 7 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present application.

As shown in fig. 7, the computer system 700 includes a processor (e.g., CPU, central processing unit) 701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for the operation of the system 700 are also stored. The processor 701, the ROM702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program, when executed by the processor 701, performs the above-described functions defined in the method of the present application.

It should be noted that the computer readable medium of the present application can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the client computer, partly on the client computer, as a stand-alone software package, partly on the client computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the client computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a first determination unit, a second determination unit, and a third determination unit. Here, the names of the units do not constitute a limitation to the units themselves in some cases, and for example, the third determination unit may also be described as a "unit that determines the degree of matching between the image to be matched and the text to be matched based on the image feature information, the text feature information, the image common sense feature information, and the text common sense feature information".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the computer device to: determining image characteristic information of an image to be matched and text characteristic information of a text to be matched; determining image common sense feature information and text common sense feature information, wherein the image common sense feature information represents common sense information related to target information in an image to be matched, and the text common sense feature information represents common sense information related to the target information in a text to be matched; and determining the matching degree between the image to be matched and the text to be matched according to the image characteristic information, the text characteristic information, the image common sense characteristic information and the text common sense characteristic information.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for determining matching degree between image texts comprises the following steps:

determining image characteristic information of an image to be matched and text characteristic information of a text to be matched;

determining image common sense feature information and text common sense feature information, wherein the image common sense feature information represents common sense information related to target information in the image to be matched, and the text common sense feature information represents common sense information related to the target information in the text to be matched;

and determining the matching degree between the image to be matched and the text to be matched according to the image characteristic information, the text characteristic information, the image common sense characteristic information and the text common sense characteristic information.

2. The method of claim 1, wherein prior to determining the image common sense feature information and the text common sense feature information, further comprising:

generating logic type common sense characteristic information through a graph convolution network representing the logic type common sense information;

on the basis of the logic type common sense feature information, generating common sense feature information comprising the logic type common sense information and the statistical type common sense information through a hypergraph convolution network representing the statistical type common sense information;

and

the determining image common sense feature information and text common sense feature information includes:

and determining the image common sense feature information and the text common sense feature information according to the common sense feature information.

3. The method according to claim 2, wherein the determining the matching degree between the image to be matched and the text to be matched according to the image feature information, the text feature information, the image common sense feature information and the text common sense feature information comprises:

combining the common sense characteristic information with the image common sense characteristic information to obtain combined image common sense characteristic information;

combining the common sense characteristic information and the text common sense characteristic information to obtain combined common sense characteristic information;

and determining the matching degree between the image to be matched and the text to be matched according to the image characteristic information, the text characteristic information, the combined image common sense characteristic information and the combined common sense characteristic information.

4. The method according to claim 2, wherein the determining the matching degree between the image to be matched and the text to be matched according to the image feature information, the text feature information, the combined image common sense feature information and the combined common sense feature information comprises:

determining a first matching degree between the image characteristic information and the text characteristic information and a second matching degree between the combined image common sense characteristic information and the combined present common sense characteristic information;

and determining the matching degree between the image to be matched and the text to be matched according to the first matching degree and the second matching degree.

5. The method of claim 2, wherein the determining the image common sense feature information and the text common sense feature information from the common sense feature information comprises:

determining target information in the image to be matched according to the image characteristic information;

determining target information in the text to be matched according to the text characteristic information;

and determining image common sense feature information corresponding to the target information in the image to be matched and text common sense feature information corresponding to the target information in the text to be matched from the common sense feature information.

6. The method of claim 2, wherein the generating of the logical common sense trait information by the graph convolution network characterizing the logical common sense information comprises:

determining a data set corresponding to the target information in the image to be matched and the target information in the text to be matched;

and inputting the initialization vector information of each concept in the data set into the graph convolution network to generate the logic type common sense characteristic information.

7. The method of claim 6, wherein semantic correlations between the concepts connected by the hyperedge are characterized by hyperedges in the hypergraph characterized by the hyperedge convolutional network.

8. The method of claim 1, wherein the determining image feature information of the image to be matched comprises:

determining first characteristic information of target information in the image to be matched through a target detection network;

and determining the image characteristic information of the image to be matched through a first self-attention network based on the first characteristic information.

9. The method of claim 1, wherein determining text feature information of the text to be matched comprises:

determining second characteristic information of the text to be matched through a characteristic extraction network;

and determining the text characteristic information of the text to be matched through a second self-attention network based on the second characteristic information.

10. A determination device of a degree of matching between image texts, comprising:

the image matching device comprises a first determining unit, a second determining unit and a matching unit, wherein the first determining unit is configured to determine image characteristic information of an image to be matched and text characteristic information of a text to be matched;

a second determining unit configured to determine image common sense feature information and text common sense feature information, wherein the image common sense feature information represents common sense information related to target information in the image to be matched, and the text common sense feature information represents common sense information related to the target information in the text to be matched;

a third determining unit configured to determine a matching degree between the image to be matched and the text to be matched according to the image feature information, the text feature information, the image common sense feature information, and the text common sense feature information.

11. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-9.

12. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.