CN110889505B

CN110889505B - Cross-media comprehensive reasoning method and system for image-text sequence matching

Info

Publication number: CN110889505B
Application number: CN201911127259.1A
Authority: CN
Inventors: 彭宇新; 黄鑫
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2019-11-18
Filing date: 2019-11-18
Publication date: 2023-05-02
Anticipated expiration: 2039-11-18
Also published as: CN110889505A

Abstract

The invention relates to a cross-media comprehensive reasoning method and system for image-text sequence matching. The method comprises the following steps: 1. and establishing a cross-media implication reasoning data set, wherein the precursor comprises two media types of images and texts, and the media type of the conclusion is the text. 2. Training the image-text sequence matching deep network by using the cross-media implication reasoning data set, and using a cross-task cross-media migration training method during network training. 3. And (3) reasoning by using the trained image-text sequence matching network structure, and judging implication relation of the given premise and the conclusion. Compared with the prior art, the method can realize implication reasoning based on the image and text preconditions, and migrate the relevant knowledge implication by the retrieval task to the reasoning task network, thereby improving the reasoning accuracy.

Description

Cross-media comprehensive reasoning method and system for image-text sequence matching

Technical Field

The invention relates to the field of multimedia analysis, in particular to a cross-media comprehensive reasoning method and system for image-text sequence matching.

Background

Humans can infer new conclusions from known facts, which is critical to perceiving the cognitive outside world. Implication reasoning is a basic and important reasoning form, and refers to judging whether a conclusion H and a premise P have implication relation or not, namely judging whether a logic expression 'P-H' is established or not. The method is a supporting technology in intelligent man-machine interaction, and has wide application value in intelligent question-answering, knowledge extraction and other applications. The existing method mainly takes research of text implication reasoning as a main part, namely conclusion H and premise P are both a piece of text. However, the human reasoning process is a cross-media comprehensive reasoning process involving senses such as vision, language and the like, and the prior art cannot deal with the reasoning process on the premise of various media (such as images and texts). How to break through the limitation of single media type, realize the comprehensive reasoning of multiple media premise participation, become the important problem of the extension reasoning breadth and depth.

The related technology of text implication reasoning mainly comprises two types of methods. The first type of method is rule-based reasoning, the idea being to try to transform the premises text into conclusion text by a given set of rules. Rule sets typically include inclusion, cause and effect relationships, such as dog→animal, buy→own, and the like. As the method proposed by Mirkin et al in the literature "Source-Language Entailment Modeling for Translating Unknown Terms", words and phrases in the premises and conclusions can be subjected to rule transformation to see if matching can be completed. Since rules are mainly defined manually, it is difficult to cover the reasoning phenomenon in the real world, so that the generalization of the rule-based reasoning method is not good. The second type of method is based on feature extraction and is the main stream method in the existing research. The basic idea is to extract text feature vectors from the preconditions and conclusions, and then judge the relationship between the two vectors through a classifier to be used as an reasoning result. For example, bowman et al in document "Large Annotated Corpus for Learning Natural Language Inference" propose to extract preconditions and conclusions as feature vectors using a recurrent neural network, and then to classify implications by full connection layer. In addition, because the sequence information is very important for judging the text implication relationship, the idea of text sequence matching is also applied to some existing methods, and the accuracy is improved. However, the above method is still limited to text reasoning, and cannot deal with implication reasoning on the premise of various media (such as images and text).

In the related technology of cross-media analysis, the main challenge is the problem of 'heterogeneous gap', which refers to the inconsistent characterization of different media such as images, texts and the like, so that the cross-media similarity is difficult to measure, and the semantic association is difficult to establish. The existing mainstream method is cross-media unified characterization, namely mapping different media to the same semantic space. For example, the high-level semantic mapping method proposed by Rasiwasia et al in the document A New Approach to Cross-Modal Multimedia Retrieval is to map the image and text into the same space by using a typical correlation analysis method according to the paired relation between the image and the text, and then obtain semantic characterization by using a logistic regression method according to the class labels of the training set. Bimodal depth self-encoder (Ngiam et al, publication "Multimodal Deep Learning") is a representative depth-across-media analysis model. Based on the depth Boltzmann machine, a self-encoder is respectively constructed for two media of audio and video, and the two self-encoders share the same encoding output layer and can capture the commonality semantics of the two self-encoders. However, the above methods focus on learning cross-media association relationships, so that cross-media similarity is calculated, and implication reasoning tasks cannot be supported.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a cross-media comprehensive reasoning method and a system for matching image-text sequences. By modeling the implication reasoning task as a sequence matching problem and combining a training method of cross-task and cross-media migration, implication reasoning based on the premise of images and texts can be realized, associated knowledge implication by the retrieval task is migrated to an implication reasoning task network, and reasoning accuracy is improved.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a cross-media comprehensive reasoning method for image-text sequence matching is used for carrying out sequence coding on images and preconditions and conclusions of the images and carrying out sequence matching of fine-granularity reasoning clues so as to realize implication reasoning based on the preconditions of the images and the texts. The method comprises the following steps:

(1) Establishing a cross-media implication reasoning data set, wherein the precursor comprises two media types of images and texts, and the media type of the conclusion is a text;

(2) Training an image-text sequence matching deep network by using a cross-media implication reasoning data set, and using a cross-task cross-media migration training method during network training;

(3) And (3) reasoning by using the trained image-text sequence matching depth network, and judging implication relation of the given premise and the conclusion.

In the step (2), in the image-text sequence matching depth network, the image-text sequence matching task is modeled as a comprehensive sequence matching problem, and the comprehensive reasoning is realized by using the logic relationship possibly contained by three combinations of text-text, text-image and text-image-text. On the training method of cross-task and cross-media migration, the triplet measurement learning loss is used as the retrieval task loss, the associated knowledge contained in the retrieval task is migrated to an implication reasoning network (namely, an image-text sequence matching depth network), and the reasoning accuracy is improved.

Furthermore, according to the cross-media comprehensive reasoning method for matching the image-text sequences, the implication relationship in the step (3) is divided into implication, contradiction and irrelevant three types. The way to make the reasoning is: and simultaneously inputting an image premise, a text premise and a text conclusion, outputting predictive scores of three implication relations by the image-text sequence matching depth network, and taking the implication relation with the highest score as an output result.

Based on the same inventive concept, the invention also provides a cross-media comprehensive reasoning system for image-text sequence matching, which comprises:

the data set establishing module is responsible for establishing a cross-media implication reasoning data set, wherein the precursor comprises two media types of images and texts, and the media type of the conclusion is the text;

the network training module is responsible for training the image-text sequence matching deep network by using the cross-media implication reasoning data set, and a cross-task cross-media migration training method is used during network training;

and the reasoning module is used for reasoning by using the trained image-text sequence matching depth network and judging implication relation of the given premise and the conclusion.

The invention has the beneficial effects that: compared with the existing method, the method can comprehensively utilize reasoning clues in the images and the texts to carry out implication relation reasoning. In addition, the method can support model training of implication reasoning tasks by utilizing the association annotation of the cross-media retrieval tasks, and breaks through the problem of difficulty in implication relation annotation.

The present invention has the above-described effects because: the image and text preconditions and fine granularity information in the conclusions can be extracted through the image-text sequence matching depth network structure, and complementary cross-media local clues can be fully mined through the sequence matching of three combinations of text-text, text-image and text-image-text. By the training method of cross-task and cross-media migration, training information of the cross-media retrieval task is effectively utilized, and accuracy of implication reasoning is improved.

Drawings

FIG. 1 is a flow chart of a cross-media comprehensive reasoning method of image-text sequence matching in the invention.

Fig. 2 is a schematic diagram of a graph-text sequence matching depth network structure of the present invention.

FIG. 3 is a schematic diagram of a training method of cross-task and cross-media migration of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and the specific embodiments.

The invention relates to a cross-media comprehensive reasoning method for matching image-text sequences, which is shown in a flow chart in figure 1 and comprises the following steps:

(1) And establishing a cross-media implication reasoning data set, wherein the precursor comprises two media types of images and texts, and the media type of the conclusion is the text.

In this embodiment, the established cross-media implication reasoning dataset is represented as

Wherein (i) ⁿ ,t ⁿ ) And h ⁿ For the nth premise-conclusion pair, N represents the total number of such premise-conclusion pairs. The image premise is denoted as i ⁿ The text precondition is denoted as t ⁿ The text conclusion is h ⁿ Implication relationship is marked as e ⁿ 。i ⁿ And t ⁿ Has an association relationship e ⁿ Is a 1 x 3 vector, with one dimension 1 and only, indicating one of three implications: implications, contradictions and independence. The other dimensions of the vector are all 0.

(2) Training the image-text sequence matching depth network by using the cross-media implication reasoning data set. The training method of cross-task and cross-media migration is used in network training.

Figure 2 shows a graph-text sequence matching depth network structure. In this embodiment, the input section includes three branches: image preconditions, text conclusions, text preconditions. For each image, extracting pool5 layer features of the convolutional neural network VGG19 as local features of the image, wherein the pool5 layer has v=7×7=49 regions in total, and each region is represented as 512-dimensional features; for each piece of input text, each english word extracts a 350-dimensional concatenation feature, consisting in particular of a 300-dimensional Glove feature, a 49-dimensional Character combination embedding (Character-composed embedding) feature, and a 1-dimensional Exact match feature. The local features of the three branches are all converted into local unified characterization of the same dimension through a fully connected layer (fc-I, fc-H, fc-T in FIG. 2) as input for the respective sequence codes. Specifically, let i ⁿ Is characterized as the local unification of

Wherein->

The 1 st local unified characterization; similarly, t ⁿ Is characterized by->

h ⁿ Is characterized by->

In the sequence encoding portion, each branch encodes a local unified representation using a bi-directional long and short time memory network model (BiLSTM). Wherein, for an image, the viewing sequence of human eyes is simulated, 7×7=49 areas are arranged in the sequence from left to right and from top to bottom; for text, the text is arranged according to the sequence of words. Each local unified token can be converted into a local sequence token by a BiLSTM model. For the image precondition, the text precondition and the text conclusion, the local sequence characterization is respectively recorded as

And->

As shown in fig. 2, the local sequence representations of the three branches are spliced by three combination modes of image-text, image-text and text-text, and are respectively input into DMAN (Delta Memory Attention Network) models (documents Memory fusion network for multi-view sequential learning, authors A.Zadeh, P.P.Liang, N.Mazumder, S.Poria, E.Cambria, and l. -p.molar, published in AAAI Conference on Artificial Intelligence 2018) to obtain context codes. In the three-way branch, the context code passes through a softmax classifier and the classification probabilities are added to obtain a final implication relation prediction score. In this embodiment, the model training adopts a training method of cross-task and cross-media migration, as shown in fig. 3. The cross-task migration occurs between a cross-media retrieval task and an implication reasoning task, and aims to migrate associated knowledge implication of the retrieval task to an implication reasoning task network. Since (in, tn) has a pair-wise association of an image and text, association retrieval model training can be supported. I is i ⁿ For example, all partial sequence characterizations are calculated using a Max pooling method

Global representation G (in) of (c), similarly yielding t ⁿ Is a global representation G (tn). The triplet metric learning penalty for the retrieval task is as follows:

wherein G (i) ⁿ ) ⁺ ,G(t ⁿ ) ⁺ For image, text preconditions with pairwise association, G (i ⁿ ) ⁺ ,G(t ⁿ ) ^- And G (t) ⁿ ) ⁺ ,G(i ⁿ ) ^- For images, text preconditions that do not have a pairwise association. max is the maximum value calculation to be performed,sim is a distance measurement operation, such as euclidean distance, and α is a boundary parameter. After the above-described network part training, the parameters are shared to the inference network.

The cross-media migration occurs in image preconditions branches and text preconditions branches of the implication inference network, with the goal of maintaining cross-media consistency in model training. Specifically, the maximum average difference (maximum mean discrepancy, MMD) is adopted as migration loss, so that a and b are respectively local unified characterization distribution of images and texts, and the MMD of the images and the texts is in the regenerated Hilbert space H _k The definition of the middle square form is as follows:

wherein E is _a And E is _b Representing the desire for local unified characterization of images, text, x and j represent the image region and any superscript of text words, respectively. By minimizing the MMD described above, knowledge migration between images and text preconditions can be achieved.

After the deep network training is finished, the implication relationship of the preconditions and the conclusions can be judged according to the input image preconditions, text preconditions and text conclusions. Specifically, the network receives an image premise, a text premise and a text conclusion as input, and each network can obtain a probability value for three conditions of implication, contradiction and independence. And adding probability values of the three paths of networks to obtain an implication relation prediction score, and taking the largest item as an reasoning result.

The following experimental results show that compared with the existing method, the cross-media comprehensive reasoning method for matching the image-text sequences can obtain higher implication reasoning accuracy.

In this embodiment, the extended text implication reasoning dataset SNLI is used for experimental verification. The original SNLI dataset was from the university of Stanford and is proposed by literature "A large annotated corpus for learning natural language inference" (authors: samuel R. Bowman, gabor Angeli, christopher Potts, and Christopher D. Manning, conference on Empirical Methods on Natural Language Processing published 2015). It contains 570,000 premise-conclusion pairs, where each premise and conclusion is a piece of text. Each text in the SNLI dataset is a description of one image in the Flickr30k dataset, and so the SNLI dataset is extended with these images so that each premise contains both text and image media. On experimental comparison, the following 4 methods were tested:

the existing method comprises the following steps: the 100-d LSTM encoders method in document "A large annotated corpus for learning natural language inference" (authors: samuel R. Bowman, gabor Angeli, christopher Potts, and Christopher D. Manning). The method extracts feature vectors from preconditions and conclusions by using an independent cyclic neural network respectively, and analyzes implication relations through a plurality of fully-connected network layers to obtain reasoning results.

The existing method is as follows: biMPM method in literature "Bilatial Multi-Perspective Matching for Natural Language Sentences" (authors Zhiguo Wang, wael Hamza, and Radu Florian). The method introduces a sequence matching idea, matches the preconditions and the conclusions according to the forward direction and the reverse direction, and classifies implications.

The existing method is as follows: the DIIN method in document "Natural Language Inference over Interaction Space" (authors YIchen Gong, heng Luo, and Jian Zhang). Firstly, extracting features from text preconditions and conclusions, constructing an interaction tensor space, and modeling implication reasoning as vector interaction in the tensor space, thereby realizing implication reasoning.

The invention comprises the following steps: the method of this embodiment.

The Accuracy (Accuracy) is adopted as an effect evaluation index of implication reasoning, and the calculation method is a ratio of the number of conclusions to the total number, which is a precondition for judging the implication relation in the data set correctly. The higher the reasoning accuracy, the better the explanation effect.

TABLE 1 demonstration of the experimental results of the invention

Contrast method	Accuracy rate of
		Existing method one	77.6％
Second existing method	86.9％
		Existing method III	88.0％
The invention is that	90.0％

As can be seen from Table 1, the present invention enables higher inference accuracy than the prior art methods. The existing method is an early method, only adopts simple structures such as a cyclic neural network and the like, and is difficult to fully model complex reasoning clues. The second method adopts the sequence matching idea, and has the advantage of accuracy compared with the first method, but only utilizes text information, so that the accuracy is difficult to further improve. The third method fully models the logic relation between the premise and the conclusion in the tensor interaction space, but only utilizes text information, and cannot provide complementary clues by utilizing image information. The invention realizes the sequence matching of three combinations of text-text, text-image and text-image-text by matching the image-text sequence with a depth network structure, and fully excavates complementary cross-media local clues; by means of the cross-task and cross-media migration training method, training information of the cross-media retrieval task is effectively utilized, and accuracy of implication reasoning is improved.

Based on the same inventive concept, another embodiment of the present invention provides a cross-media comprehensive reasoning system for matching of graphic sequences, which includes:

and the network training module is responsible for training the image-text sequence matching deep network by using the cross-media implication reasoning data set. A training method of cross-task and cross-media migration is used in network training;

Based on the same inventive concept, another embodiment of the present invention provides a computer/server comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the steps of the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, implements the steps of the inventive method.

Other embodiments: in the previous embodiment, the images used VGG19 network features, and the text used Glove features, character-composed embedding features, and Exact match features. The method also supports other features as inputs, such as Word embedding (Word embedding) features of text, etc. In the previous embodiment, the BiLSTM model is used for the sequence encoding of images and text. The method also supports LSTM and other sequence coding models.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A cross-media comprehensive reasoning method for image-text sequence matching comprises the following steps:

establishing a cross-media implication reasoning data set, wherein the precursor comprises two media types of images and texts, and the media type of the conclusion is a text;

training an image-text sequence matching deep network by using a cross-media implication reasoning data set, and using a cross-task cross-media migration training method during network training; the image-text sequence matching depth network firstly extracts local characteristics from an image premise, a text premise and a text conclusion, then codes the representations of three combinations of text-text, text-image and text-image-text, and obtains an implication relation prediction score through self-adaptive threshold aggregation and implication relation classifier; in the training method of cross-task and cross-media migration, each group of image preconditions and text preconditions have a pair association relationship, a triplet measurement learning loss is adopted to train a retrieval task, network parameters are shared in an implication reasoning network part, and associated knowledge of the retrieval task is migrated to an implication reasoning task network; meanwhile, the maximum average difference is used as migration loss, the local uniform characterization distribution difference of the image and the text is minimized, and knowledge migration between the image and the text precondition is realized;

and (3) reasoning by using the trained image-text sequence matching depth network, and judging implication relation of the given premise and the conclusion.

2. The method for cross-media comprehensive reasoning of image-text sequence matching as claimed in claim 1, wherein each group of data in the cross-media implication reasoning data set is composed of an image premise, a text premise and a text conclusion and has implication relation labels.

3. The cross-media comprehensive reasoning method of the image-text sequence matching, as claimed in claim 1, wherein the image-text sequence matching depth network models the cross-media implication reasoning task as a comprehensive sequence matching problem, and utilizes the logical relations possibly contained by three combinations of text-text, text-image and text-image-text to realize comprehensive reasoning.

4. The method for cross-media comprehensive reasoning of image-text sequence matching as claimed in claim 3, wherein in the image-text sequence matching depth network, local features are extracted from image preconditions, text preconditions and text conclusions, and are converted into local unified characterization of the same dimension through a full connection layer; converting the local unified characterization into local sequence characterization through a sequence coding model respectively; for the image precondition, the text precondition and the text conclusion, the local sequence characterization is spliced in three combination modes of image-text, image-text and text-text, and the local sequence characterization is respectively input into a DMAN model to obtain a context code; and (3) carrying out self-adaptive threshold aggregation and implication relation classifier on the context codes, and adding the classification probabilities to obtain a final implication relation prediction score.

5. The method for cross-media comprehensive reasoning matched with image-text sequences according to claim 1, wherein in the training method for cross-task cross-media migration, the cross-task migration occurs between a cross-media retrieval task and an implication reasoning task, and the goal is to migrate associated knowledge implication of the retrieval task to an implication reasoning task network; in the cross-media implication reasoning data set, each group of image preconditions and text preconditions have a pair association relationship, the pair data are regarded as relevant, the non-pair data are regarded as irrelevant, the retrieval task training is carried out by adopting the triplet measurement learning loss, and then the network parameters are shared in the implication reasoning network part, so that the cross-task migration is realized.

6. The method for cross-media comprehensive reasoning of teletext sequence matching according to claim 5, wherein the triplet metric learning loss is calculated using the formula:

L _Sim ＝max(0,α+sim(G(i ⁿ ) ⁺ ,G(t ⁿ ) ⁺ )-sim(G(i ⁿ )+,G(t ⁿ ) ^- ))+max(0,α+sim(G(i ⁿ ) ⁺ ,G(t ⁿ ) ⁺ )-sim(G(t ⁿ ) ⁺ ,G(i ⁿ ) ^- ))

wherein G (i) ⁿ )+,G(t ⁿ ) ⁺ For image, text preconditions with pairwise association, G (i ⁿ ) ⁺ ,G(tn) ^- And G (t) ⁿ ) ⁺ ,G(i ⁿ ) ^- For images and text preconditions without pairwise association, max is the maximum value calculation, sim is the distance measurement calculation, and alpha is the boundary parameter.

7. The method for cross-media comprehensive reasoning matched with the image-text sequence as claimed in claim 1, wherein in the training method for cross-task cross-media migration, the cross-media migration occurs in an image precondition branch and a text precondition branch of an implication reasoning network, and the aim is to maintain the cross-media consistency in model training; and the maximum average difference is used as migration loss, the local uniform characterization distribution difference of the image and the text is minimized, and the knowledge migration between the image and the text precondition is realized.

8. The method for cross-media comprehensive reasoning of image-text sequence matching according to claim 1, wherein the reasoning by using the trained image-text sequence matching depth network structure means that after the image-text sequence matching depth network is trained, the network simultaneously receives an image premise, a text premise and a text conclusion as inputs, and each network can respectively obtain a probability value for three conditions of implications, contradictions and irrelevant; and adding probability values of the three paths of networks to obtain an implication relation prediction score, and taking the largest item as an reasoning result.

9. A cross-media comprehensive reasoning system using the method of any one of claims 1-8 for matching a sequence of pictures and texts, comprising:

and the reasoning module is used for reasoning by using the trained image-text sequence matching depth network structure and judging implication relation of the given premise and the conclusion.

10. A computer comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the steps of the method of any of claims 1-8.