CN110879844B

CN110879844B - Cross-media reasoning method and system based on heterogeneous interactive learning

Info

Publication number: CN110879844B
Application number: CN201911023636.7A
Authority: CN
Inventors: 彭宇新; 黄鑫
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2019-10-25
Filing date: 2019-10-25
Publication date: 2022-10-14
Anticipated expiration: 2039-10-25
Also published as: CN110879844A

Abstract

The invention relates to a cross-media reasoning method and a system based on heterogeneous interactive learning. The method comprises the following steps: 1. a cross-media implication reasoning data set is established, wherein the premises includes two different media types and the conclusion includes one media type. 2. The heterogeneous interactive learning network structure is trained by using a cross-media implication reasoning data set, and the method mainly comprises cross-media interactive attention learning and heterogeneous tensor space construction. 3. And reasoning by using the trained heterogeneous interactive learning network, and judging the implication relation of the given premise and the conclusion. Compared with the prior art, the method can realize the implication reasoning based on different media premises and improve the accuracy of the implication reasoning.

Description

Cross-media reasoning method and system based on heterogeneous interactive learning

Technical Field

The invention relates to the field of multimedia analysis, in particular to a cross-media reasoning method and system based on heterogeneous interactive learning.

Background

Reasoning is a key ability of humans to perceive the external world, while implication reasoning is an important basic form of reasoning. Implication reasoning refers to judging whether the conclusion H is true or not according to a given premise P. The method has wide application value in semantic retrieval, intelligent question answering and other applications. The existing implication reasoning method focuses on texts, namely, the condition that the preconditions and the conclusions are both one section of text, and the emphasis is on judging the text similarity of the preconditions and the conclusions. However, human knowledge and reasoning processes often participate in vision, language and other senses, and the reasoning form relying on text only greatly limits the breadth and depth of reasoning. Therefore, how to expand the inference form mainly based on the existing text to the cross-media inference with the participation of multiple media becomes a key problem of research and application.

The related technology mainly comprises two categories of text implication reasoning and cross-media analysis. On the basis of text implication reasoning, the goal is to judge three conditions of the conclusion H according to a given premise P: certain establishment (implication), certain non-establishment (contradiction), and no judgment (irrelevance). As a basic task of natural language processing, text implication reasoning has received extensive attention from researchers. An inference rule-based approach, such as that proposed by Mirkin et al in the document "Source-Language information Modeling for transforming Unknown terminals", can attempt transformation of the previously mentioned conclusions by known text rules. The rules involved include inclusion relationships (e.g., dog → animal) and causal relationships (e.g., buy → own), among others. If the premise can obtain a conclusion through rule transformation, the premise and the conclusion are in an implication relationship. Bowman et al propose a depth network-based method in the document "large annotated corppus for learning natural language reference", using two independent recurrent neural network models to extract text features of preconditions and conclusions, and then judging implication relations through a plurality of full-connected layers. However, these methods all use text preconditions and text conclusions as input, so that only reasoning on text implication relationships can be performed. This greatly limits the depth and breadth of the inference.

In cross-media analysis, existing research has focused on the search task. The mainstream method is unified representation learning, namely, different media such as images and texts are mapped into the same semantic space, so that the representations of the media can be subjected to similarity measurement. For example, rasiwasia et al proposed a high-level semantic mapping method in the document "A New Approach to Cross-Module Multimedia Retrieval", which maps an image and a text into the same space by using a typical correlation analysis method, labels the data according to their categories, and learns the semantics by using a logistic regression method. Ngiam in the document "Multimodal Deep Learning" proposes a Multimodal self-encoder method, which uses two self-encoders to simultaneously receive the input of two media, and uses the reconstruction error minimization principle to train. The two self-encoders have a shared encoding layer, so that the association relation of different media can be learned. However, these methods are all directed to retrieval tasks, and the emphasis is on judging the similarity of different media data, and the implication reasoning task cannot be supported.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a cross-media reasoning method and a system based on heterogeneous interactive learning, which can comprehensively consider the premise of two different media data and judge whether the conclusion is true. Through cross-media interactive attention learning and heterogeneous tensor space construction, complementary cross-media fine-grained clues can be fully mined, and comprehensive reasoning is achieved.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a cross-media reasoning method based on heterogeneous interactive learning is used for comprehensively analyzing reasoning clues contained in different media and judging the possibility of establishing a conclusion, so that cross-media implication reasoning is realized. The method comprises the following steps:

(1) Establishing a cross-media implication reasoning data set, wherein the premises comprise two different media types, and the conclusion comprises one media type;

(2) Training a heterogeneous interactive learning network structure by using a cross-media implication reasoning data set, mainly comprising cross-media interactive attention learning and heterogeneous tensor space construction;

(3) And reasoning by using the trained heterogeneous interactive learning network, and judging the implication relation of the given premise and the conclusion.

Further, in the above cross-media inference method based on heterogeneous interactive learning, the media types of the precondition in the step (1) are text and image; the media type of the conclusion is text.

Further, in the above cross-media inference method based on heterogeneous interactive learning, the network structure in step (2) includes two main parts: cross-media interactive attention learning and heterogeneous tensor space construction. The method comprises the steps of firstly generating fine-grained representation for images and texts, and then simultaneously mining reasoning clues of image preconditions, text preconditions and conclusions in a tensor space to realize implication reasoning.

Further, in the above cross-media inference method based on heterogeneous interactive learning, the implication relationship in step (3) is divided into implication, contradiction and irrelevant. The reasoning way is as follows: and simultaneously inputting the image precondition, the text precondition and the text conclusion, outputting the probability values of the three implication relations by the network, and taking the implication relation with the maximum probability as an output result.

Based on the same inventive concept, the invention also provides a cross-media reasoning system based on heterogeneous interactive learning, which comprises:

the data set establishing module is responsible for establishing a cross-media implication reasoning data set, wherein the premises comprise two different media types, and the conclusion comprises one media type;

the network training module is in charge of training a heterogeneous interactive learning network structure by using a cross-media implication reasoning data set, and comprises cross-media interactive attention learning and heterogeneous tensor space construction;

and the reasoning module is used for reasoning by utilizing the trained heterogeneous interactive learning network and judging the implication relation of the given premise and the conclusion.

The invention has the beneficial effects that: compared with the prior art, the implication relationship reasoning based on different media premises can be realized. In addition, through interactive attention learning and heterogeneous tensor space construction, complementary clues of different media are more fully utilized, and accuracy of implication reasoning is improved.

The present invention has the above-described effects because: fine-grained semantic alignment of images-texts and texts-texts is realized through interactive attention learning, and cross-media association relations of different media are fully mined; through the construction of the heterogeneous tensor space, the premises and the conclusions of different media are constructed in the same tensor space, reasoning clues between the premises and the conclusions of the different media can be comprehensively analyzed, and the accuracy of cross-media implication reasoning is improved.

Drawings

FIG. 1 is a flow chart of a cross-media reasoning method based on heterogeneous interactive learning according to the present invention.

Fig. 2 is a schematic diagram of the complete network architecture of the present invention.

FIG. 3 is a schematic diagram of cross-media interactive attention learning in the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and specific embodiments.

The invention discloses a cross-media reasoning method based on heterogeneous interactive learning, a flow chart of which is shown in figure 1, and the method comprises the following steps:

(1) A cross-media implication reasoning data set is established, wherein the premises includes two different media types and the conclusion includes one media type.

In this embodiment, two media types in the premise are as follows: text and images; the media types concluded were: text. Data representing a set of cross-media implication reasoning Data, then

Wherein N is the number of data, (P (I) ⁿ ,P(T) ⁿ ) And h ⁿ An nth premise-conclusion pair is formed. P (I) ⁿ Representing image premise, P (T) ⁿ Representing a text precondition. And e ⁿ The implication relation label of the nth premise-conclusion pair is represented, and the structure of the implication relation label is a 1 × 3 vector, one dimension and only one dimension are 1, and the rest are 0. A dimension of 1 represents one of three implication relationships: implications, contradictions and irrelevance.

(2) The heterogeneous interactive learning network structure is trained by using a cross-media implication reasoning data set, and the method mainly comprises cross-media interactive attention learning and heterogeneous tensor space construction.

The network structure of this step is shown in fig. 2. In this embodiment, for image input, a VGG19 convolutional neural network structure is used to extract pool5 layer features as local features of an image (each image has v =49 regions); for text entry, the method of the document "Natural Language assessment over Interaction Space" (by the authors YICHEN Gong, heng Luo, and Jianan Zhang, published in International Conference on Learning responses 2018) was used to extract the features of each word as local features of the text. For convenience, features are extracted for each text according to w =49 words, redundant interception is carried out, and the deficiency is completed by 0. The local features of the image are converted into representations with the same dimensionality as the local features of the text through a full connection layer, and therefore the representations are used as input of interactive attention learning.

In this embodiment, the cross-media interactive attention learning occurs between a text and a text, and between a text and an image at the same time, the goal is to perform cross coding between a premise and a conclusion, and mine the semantic association relationship of the cross-media, and the cross-media interactive attention learning is shown in fig. 3. Specifically, let P (I) ⁿ Is characterized in part by

Wherein

The 1 st local feature is obtained in sequence; similarly, P (T) ⁿ Is characterized by

h ⁿ Is characterized by

In this example, the compound represented by the formula P (I) ⁿ And h ⁿ For example, the interactive attention is expressed as a v × w matrix, the number of image areas is v, the number of text words is w, and each element is

Wherein the symbol "o" represents multiplication of vectors by corresponding elements,

and representing a fully-connected layer with the output dimension of 1, taking x as an input, and representing a network parameter by theta. According to the interactive attention, obtain

The cross-coding of (a) is as follows:

likewise, cross-coding of text preconditions and text conclusions

And

can be calculated according to the method.

In this embodiment, the method for constructing the heterogeneous tensor space is to express an interactive relationship between the preconditions and the conclusions of different media types in the same tensor space. Specifically, taking the image premise and conclusion branch of the network as an example, the inputs required for the tensor space construction are as follows:

wherein, symbol "; "denotes the concatenation of vectors. Then, the tensor of the branch of the image precondition and the text conclusion can be obtained:

similarly, the Tensor Tensor (TT) of the text precondition and the text conclusion branch can be obtained, and the final heterogeneous Tensor space Tensor (HT) = [ Tensor (IT) can be obtained according to the Tensor Tensor (TT); tensor (TT) ]. Then, a convolutional neural network model (in this embodiment, denseNet is used), the Heterogeneous Tensor (HT) is used as an input, a classifier is used to perform probabilistic inference of implication relationships, and the most probable is an inference result.

(3) And reasoning by using the trained heterogeneous interactive learning network to judge the implication relation of the given premise and the conclusion.

After the deep network training is finished, the implication relation of the preconditions and the conclusion can be judged according to the input image preconditions, the text preconditions and the text conclusion. Specifically, the image precondition, the text precondition and the text conclusion are simultaneously input into a network structure, the network finally obtains a probability value for the implication, contradiction and irrelevance, and the maximum probability value is taken as an inference result.

The following experimental results show that compared with the existing method, the cross-media reasoning method based on heterogeneous interactive learning can obtain higher implication reasoning accuracy.

In this example, experiments were carried out based on the implication reasoning data set SNLI, which was proposed by the documents "A large annotated co-rpus for learning Natural Language introduction" (authors: samuel R. Bowman, gabor Angeli, christopher Potts, and Christopher D. Management, published in 2015 reference on Empirical Methods on Natural Language Processing), and which contained 570,000 prerequisite text-conclusion text pairs. Since each premise in the SNLI dataset can correspond to one image in Flickr30k, we performed experiments by adding this image to construct data of the premise image, the premise text, and the conclusion text. We tested the following 4 methods as experimental comparisons:

the prior method comprises the following steps: the 100-d LSTM encoders method in the document "A large annotated corps for learning natural language inference" (author: samuel R. Bowman, gabor Angeli, christopher Potts, and Christopher D. Management) extracts text features through two independent recurrent neural networks and then performs implication relationship inference using several fully connected layers.

The prior method II comprises the following steps: the BiMPM method in the document "Bilateral Multi-Perspectral Matching for Natural Language sequences" (authors Zhiguo Wang, wael Hamza, and Radu Florian) considers implication reasoning as a two-way Matching problem from the aforementioned conclusion, from conclusion to the premise, matching text from multiple angles.

The existing method three: in the DIIN method in the document "Natural Language Inference over Interaction Space" (authors yiche Gong, heng Luo, and Jian Zhang), features are extracted from text preconditions and text conclusions by using a self-attention mechanism (self-attention), and then the association relationship between the preconditions and the conclusions is modeled in an Interaction tensor Space to perform implication reasoning.

The invention comprises the following steps: the method of this embodiment.

The experiment adopts an Accuracy (Accuracy) index to evaluate the Accuracy of implication reasoning. The accuracy rate refers to the ratio of the correct data volume to all data volumes judged by the implication relationship in the data set. The higher the accuracy, the better the implication reasoning effect.

TABLE 1 Experimental results of the invention show

Comparison method	Rate of accuracy
		Existing method 1	77.6％
Conventional method II	86.9％
		Existing method III	88.0％
The invention	90.3％

As can be seen from Table 1, the method can obtain a better result than the existing method in the accuracy of implication reasoning. The network structure of the first method is simple, and only the recurrent neural network and a plurality of full connection layers are adopted, so that the accuracy is low. In the existing method II and the existing method III, a bidirectional matching mechanism, an attention mechanism and the like are respectively adopted, so that higher accuracy is obtained. However, the above methods can only realize implication reasoning of text, and cannot utilize complementary information brought by images, so that accuracy is limited. On one hand, fine-grained semantic alignment of images-texts and texts-texts is realized through interactive attention learning, and cross-media association relation is fully excavated; on the other hand, through the heterogeneous tensor space construction, reasoning clues between different media premises and conclusions can be comprehensively analyzed, and the accuracy of cross-media implication reasoning is improved.

Based on the same inventive concept, another embodiment of the present invention provides a cross-media inference system based on heterogeneous interactive learning, which includes:

the data set establishing module is responsible for establishing a cross-media implication reasoning data set, wherein the premise comprises two different media types, and the conclusion comprises one media type;

and the reasoning module is responsible for reasoning by utilizing the trained heterogeneous interactive learning network and judging the implication relation of the given premise and the conclusion.

Based on the same inventive concept, another embodiment of the present invention provides a computer/server comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the steps of the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, performs the steps of the inventive method.

In the foregoing embodiment, the media types of the premises are text and image, and the media type of the conclusion is text. The method of the invention also supports implication reasoning of other media types, such as image and audio media types and text media types of conclusion.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A cross-media reasoning method based on heterogeneous interactive learning comprises the following steps:

establishing a cross-media implication reasoning data set, wherein the premises comprise two different media types, and the conclusion comprises one media type;

training a heterogeneous interactive learning network structure by using a cross-media implication reasoning data set, wherein the structure comprises cross-media interactive attention learning and heterogeneous tensor space construction;

reasoning by using the trained heterogeneous interactive learning network, and judging the implication relation of the given premise and the conclusion;

the cross-media interactive attention learning simultaneously occurs between texts and between images, and aims to perform cross coding between the premises and the conclusions and emphasize the semantic association of cross-media;

assuming that the number of image regions is v and the number of text words is w, the image premise P (I) ⁿ Is characterized by

Textual conclusion hn has local features of

The cross-media interaction attention is expressed as a v x w matrix, each element being

Wherein, the symbol

Representing a vectorThe multiplication is carried out according to the corresponding elements,

representing a full connection layer with an output dimension of 1, taking x as input, and representing a network parameter by theta; according to the interactive attention, an image premise P (I) ⁿ I local feature of

Is as follows:

similarly, the cross coding of the text precondition and the text conclusion is calculated according to the method

And

2. the method of claim 1, wherein the media types of the premises are text and images; the media type of the conclusion is text; the nth data in the cross-media implication reasoning data set is composed of text preconditions P (I) ⁿ Image premise P (T) ⁿ And a text conclusion hn with an implication relationship label en.

3. The method of claim 1, the heterogeneous interactive learning network structure comprising two main parts: performing cross-media interactive attention learning and constructing a heterogeneous tensor space; the method comprises the steps of firstly generating fine-grained representation of an image and a text, and then simultaneously mining implication relations of image preconditions, text preconditions and conclusions in a heterogeneous tensor space to realize implication reasoning.

4. A method as in claim 3, wherein the heterogeneous tensor space is constructed with the goal of expressing inference cues for the premises and conclusions of different media types in the same tensor space.

5. The method of claim 4, wherein for image preconditions and textual conclusion branches of the heterogeneous interactive learning network structure, the inputs required for the heterogeneous tensor space construction are as follows:

wherein,

for image precondition P (I) ⁿ The ith local feature of (a);

the ith local feature of the text conclusion hn;

cross coding for image premise;

cross coding for textual conclusions; symbol

The representative vectors are multiplied by the corresponding elements,

represents a fully connected layer with an output dimension of 1, takes x as input, and theta represents a network parameterCounting; symbol "; "represents concatenation of vectors; then, the tensor of the branch of the image precondition and the text conclusion is obtained:

similarly, a Tensor Tensor (TT) of the text precondition and the text conclusion branch is obtained, and a final heterogeneous Tensor space Tensor (HT) = [ Tensor (IT) is obtained according to the Tensor Tensor (TT); tensor (TT) ]; and then, using a convolutional neural network model, taking the heterogeneous Tensor Tensor (HT) as input, and performing probability inference of the implication relation through a classifier, wherein the maximum probability is an inference result.

6. The method of claim 1, wherein the utilizing the trained heterogeneous interactive learning network for reasoning means that the image precondition, the text precondition and the text conclusion are simultaneously input into the heterogeneous interactive learning network structure, the network finally obtains a probability value for the implication, the contradiction and the irrelevance, and takes the item with the maximum probability value as the reasoning result.

7. A cross-media inference system based on heterogeneous interactive learning using the method of any one of claims 1 to 6, comprising:

the network training module is responsible for training a heterogeneous interactive learning network structure by using a cross-media implication reasoning data set, and comprises cross-media interactive attention learning and heterogeneous tensor space construction;

8. A computer comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps of the method according to any one of claims 1 to 6.