CN107346328B

CN107346328B - Cross-modal association learning method based on multi-granularity hierarchical network

Info

Publication number: CN107346328B
Application number: CN201710378513.XA
Authority: CN
Inventors: 彭宇新; 綦金玮
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2017-05-25
Filing date: 2017-05-25
Publication date: 2020-09-08
Anticipated expiration: 2037-05-25
Also published as: CN107346328A

Abstract

The invention relates to a cross-modal association learning method based on a multi-granularity hierarchical network, which comprises the following steps: 1. establishing a cross-modal database containing multiple modal types, dividing data in the database into a training set, a verification set and a test set, carrying out blocking processing on different modal data in the database, and extracting feature vectors of all modal original data and blocked data. 2. And training a multi-granularity hierarchical network structure by using the original data and the partitioned data, and learning unified characterization for different modal data. 3. And obtaining the unified representation of different modal data by utilizing the trained multi-granularity hierarchical network structure, and further calculating the similarity of the different modal data. 4. And taking any one mode type in the test set as a query mode, taking the other mode type as a target mode, calculating the similarity between the query sample and the query target, and obtaining a related result list of target mode data according to the similarity. The invention can improve the accuracy of cross-modal retrieval.

Description

Cross-modal association learning method based on multi-granularity hierarchical network

Technical Field

The invention relates to the field of multimedia retrieval, in particular to a cross-modal association learning method based on a multi-granularity hierarchical network.

Background

In recent years, with the rapid development of computer technology, information acquisition and processing have been changed from a single modality form of text, image, audio, video, and the like to a form in which multiple modalities are fused with each other. Multimodal retrieval has become an important issue in the field of information retrieval, and has wide application in both search engines and big data management. The traditional retrieval mode is mainly a single mode form, namely, a user submits data of one mode type as a query, and a retrieval system returns retrieval results of the same mode, such as image retrieval, text retrieval and the like. This retrieval approach does not directly measure the similarity between different modality data, such as the similarity of an image to an audio clip, and therefore limits the flexibility of retrieval. In order to solve the above problems, cross-modal retrieval becomes a new research hotspot, which can retrieve relevant results containing multiple modal types according to data of any modal type uploaded by a user as a query. Compared with the traditional single-mode retrieval, the cross-mode retrieval can provide more flexible and practical retrieval experience.

A key problem with cross-modality retrieval is how to learn the intrinsic correlations between different modalities. Cross-modality similarity metrics are very challenging due to inconsistent distribution characteristics and feature representations of different modality data. The existing common cross-modal retrieval method mainly learns a uniform space for different modal data, that is, feature representations of the different modal data are mapped to the cross-modal uniform space from an original single modal space, so that a uniform representation capable of directly measuring cross-modal similarity is obtained. The existing methods can be mainly divided into two categories, one is to learn linear mapping under a traditional framework, and the method includes a method based on Canonical Correlation Analysis (CCA for short), which maps data of different modalities into a common subspace of the same dimension by analyzing a pair-wise association relationship of data of different modalities, and maximizes an association between the pair-wise data. In addition, there are methods based on graph conventions, for example, Zhai et al, in the document "Learning Cross-Media joint retrieval with Sparse and Semi-Supervised reconstruction" proposes a Sparse and Semi-Supervised protocol-based Cross-modal retrieval method, which constructs graph models for different modal data, and performs Cross-modal association Learning and high-level semantic abstraction at the same time.

The other type is a cross-modal unified characterization learning method based on a deep neural network, and the main idea is to analyze and mine a complex cross-modal association relation by utilizing the strong modeling capability of the deep neural network. For example, Ngiam et al, in the "Multimodal Deep Learning" document, propose a multi-modal auto-encoder that models cross-modal correlation information in the middle layer, using two-modal data as inputs, and simultaneously models the reconstruction errors of both. Feng et al propose a corresponding self-encoder (Corr-AE) in a document of Cross-model retrieval with Correspondence Autoencoder, construct two networks connected by an encoding layer, and simultaneously model associated information and reconstructed information. Most of the existing cross-modal retrieval methods based on the deep network can be divided into two learning stages, the separation characteristic representation of each modal is learned in the first stage, and the cross-modal unified representation is learned in the second stage. However, the existing method has three limitations, namely, the existing method only models the association relation in the modality at the first stage, and ignores the supplementary role of the association between the modalities on the separation feature representation learning; secondly, in the second stage, the existing method only uses a single loss function for constraint, and cannot fully balance the associated learning process in the modes and among the modes; in addition, the existing method only considers the original data of different modes, but ignores the rich fine-grained information provided by each part inside the existing method, and cannot fully mine the cross-mode association relationship.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a cross-modal association learning method based on a multi-granularity hierarchical network, which can fully mine the multi-level association relationship in the modalities and among the modalities by utilizing a hierarchical network structure, and simultaneously dynamically balance semantic category constraints in the modalities and pairwise similarity constraint learning processes among the modalities by utilizing a multi-task framework. In addition, the accuracy of cross-modal retrieval is improved by modeling the multi-granularity information of different modal data.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a cross-modal association learning method based on a multi-granularity hierarchical network is used for comprehensively modeling multi-granularity information of cross-modal data and association information in and among modalities to obtain uniform representations of different modal data, so that cross-modal retrieval is realized, and the method comprises the following steps:

(1) establishing a cross-modal database containing various modal types, dividing data in the cross-modal database into a training set, a verification set and a test set, carrying out blocking processing on data of different modes in the cross-modal database, and extracting original data of all the modes and feature vectors of the blocked data;

(2) training a multi-granularity hierarchical network structure by utilizing original data and partitioned data, and learning unified representation for different modal data through the multi-granularity hierarchical network structure;

(3) calculating the similarity of different modal data by using the unified representation of the different modal data obtained according to the trained multi-granularity hierarchical network structure;

(4) and using any one mode type in the test set as a query mode, using the other mode type as a target mode, using each data of the query mode as a query sample, retrieving data in the target mode, calculating the similarity between the query sample and the query target, and obtaining a related result list of the target mode data according to the similarity.

Further, in the above cross-modality association learning method based on the multi-granularity hierarchical network, the cross-modality database in step (1) may contain a plurality of modality types, such as images, texts, and the like.

Further, in the above cross-modal association learning method based on the multi-granularity hierarchical network, the different modal data in the database are subjected to a blocking processing process in the step (1), and the original data can be divided into a plurality of parts by adopting different blocking processing methods for the different modal data. Specifically, a plurality of candidate regions containing rich fine-grained information such as visual objects are extracted from image data by using a selective search (selective search) algorithm; for text data, the text data is divided into a plurality of pieces in units of sentences. Meanwhile, other blocking methods may be supported, such as segmenting an image into 2 × 2 or 4 × 4 regions, segmenting text by phrases, and so on.

Further, in the above cross-modal association learning method based on a multi-granularity hierarchical network, the feature vector in step (1) is specifically: extracting word frequency characteristic vectors from the text data; the image data is the feature vector of the extracted convolutional neural network, and can support other kinds of features, such as the feature vector of the word bag of the image, the feature vector of the hidden Direx distribution of the text, and the like.

Further, according to the cross-modal association learning method based on the multi-granularity hierarchical network, a multi-path network structure is used in the step (2), different modal data are subjected to blocking processing, multi-granularity information in the data is fully mined, meanwhile, intra-modal and inter-modal association relations of the cross-modal data are modeled to obtain a single-modal separation characteristic representation, a multi-task learning framework is built, a learning process of intra-modal semantic category constraints and inter-modal pairwise association constraints is dynamically balanced, and finally, cross-modal unified characterization is obtained.

Further, in the above cross-modal association learning method based on the multi-granularity hierarchical network, the distance measurement in step (3) adopts cosine distance, and the similarity between two modal data is measured by calculating cosine values of the included angle of the unified representation vector of the two modal data. In addition, the framework also supports other types of distance metrics, such as Euclidean distance and the like.

Further, in the above cross-modal association learning method based on the multi-granularity hierarchical network, the retrieval manner in the step (4) is to use one modal type in the test set as a query modality, and use another modal type as a target modality. And (3) regarding each data of the query modes in the test set as a query sample, calculating similarity with all data of the target modes in the test set after calculating the similarity according to the step (3), and then sequencing the similarity from large to small to obtain a related result list.

The invention has the following effects: compared with the existing method, the method can fully mine the multi-granularity information of different modal data, simultaneously model the incidence relation between the intra-modal and the inter-modal to learn the separation characteristic representation of the single mode, further dynamically balance the learning process of semantic category constraint and pairwise incidence constraint between the modal by adopting a multi-task learning framework, and improve the accuracy of cross-modal retrieval.

The reason why the method has the above-mentioned inventive effect is that: aiming at two stages of single-mode separation feature representation learning and cross-mode unified representation learning, a hierarchical network structure is adopted to fully model the incidence relation between the modes. On one hand, in the single-mode separation feature representation learning process, multi-granularity feature representations of different mode data are fused, and association learning in the modes and among the modes is optimized in a combined mode. On the other hand, in the cross-modal unified representation learning process, a multi-task learning framework is adopted to dynamically balance the learning process of semantic category constraint in the modalities and pairwise association constraint between the modalities, so that the accuracy of cross-modal retrieval is improved.

Drawings

FIG. 1 is a flow chart of a cross-modal association learning method based on a multi-granularity hierarchical network according to the present invention.

Fig. 2 is a schematic diagram of the complete network architecture of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and specific embodiments.

The invention relates to a cross-modal association learning method based on a multi-granularity hierarchical network, the flow of which is shown in figure 1, and the method comprises the following steps:

(1) establishing a cross-modal database containing multiple modal types, dividing the database into a training set, a verification set and a test set, carrying out blocking processing on data of different modes in the database, and extracting all modal original data and feature vectors of the blocked data.

In this embodiment, the cross-modal database may include a plurality of modal types, and different blocking processing methods are used to divide the original data into a plurality of parts for different modal data. Taking images and texts as an example, extracting a plurality of candidate regions containing rich fine-grained information such as visual objects by using a Selective Search algorithm for image data; for text data, the text data is divided into a plurality of pieces in units of sentences. Further, the feature vector extraction method for the two modality type data is as follows: extracting word frequency characteristic vectors from the text data; the image data is a feature vector of the extracted deep convolutional neural network. The framework of the method also supports other modal type data, such as audio, video and the like, and can support other kinds of characteristics, such as word bag characteristic vectors of images, hidden Direy distributed characteristic vectors of texts and the like.

The cross-modal dataset is denoted by D ═ D⁽ⁱ⁾,D^(t)Therein of

For media type r, where r ═ i, t (i denotes image, t denotes text), n is defined^(r)The number of data is the same. Each data in the training set has one and only one semantic category.

Definition of

As a mediumThe feature vector of the p-th data in type r, whose expression structure is d^(r)× 1, where d^(r)Representing the feature vector dimension of media type r.

Definition of

Is semantically tagged as

It represents a vector structured as c × 1, where c represents the total number of semantic categories.

One dimension of the data is 1, and the other dimensions are 0, which indicates that the semantic category of the data is the label corresponding to the column with the value of 1.

(2) And training a multi-granularity hierarchical network structure by using the original data and the partitioned data, and learning unified characterization for different modal data.

The process of this step is shown in fig. 2, wherein the circles represent hidden units in the neural network, and the dotted lines represent connections between the hidden units in two adjacent layers of the neural network. In this embodiment, two networks are used to model the original image and text data. Firstly, two Deep Belief Networks (DBN) are used for respectively modeling the feature distribution of an image and a text, and the following conditional probability distribution formula is used:

wherein h is⁽¹⁾And h⁽²⁾Representing two hidden layers in a DBN, v_iRepresenting image data, v_tRepresenting text data. From this, a feature representation Q containing intra-modal high-level semantic information can be derived⁽ⁱ⁾And Q^(t). Then connecting two networks by using shared coding layer, and simultaneously buildingIntra-modal and inter-modal associations of the modal images and the text data jointly optimize the reconstructed learning error and the associated learning error by minimizing the following loss function:

wherein

And

representing a reconstructed representation of each modality, L_rDenotes the reconstructed learning error, L_cIndicating the associated learning error. Thus, a coarse-grained characterization including intra-modality and inter-modality associations may be obtained

And

wherein

And

and respectively representing the coarse-grained characteristic representation of the p-th data in two media types of images and texts.

In this embodiment, two networks are used to model fine-grained image and text data. Specifically, two Deep Belief Networks (DBNs) are used for modeling fine-grained image and text data, and an average fusion strategy is adopted to obtain a feature representation U containing intra-modal fine-grained information⁽ⁱ⁾And U^(t)Then, a shared coding layer is constructed to connect two networks, and intra-modal association and inter-modal association represented by fine-grained features of the image and the text are modeled simultaneously by minimizing the following loss functions:

wherein

And

a reconstructed representation, L, representing fine-grained features of each modality_rDenotes the reconstructed learning error, L_cIndicating the associated learning error. Thus, a fine-grained feature representation including intra-modality and inter-modality associations may be obtained

And

wherein

And

and respectively representing fine-grained characteristic representation of the p-th data in two media types of images and texts.

In this embodiment, a joint constrained Boltzmann Machines (RBM) is used to fuse the coarse-grained representation and the fine-grained representation of each mode(s) ((r))

And

). Specifically, the following joint distribution is defined:

wherein,

and

respectively representing two hidden layers in a jointly constrained boltzmann machine, h⁽²⁾Represents a combined layer therein; for images, v₁Coarse-grained feature representation of a representation image

v₂Fine-grained feature representation of a representation image

For the same reason as for text, a joint distribution as defined above is still used, where v₁Coarse-grained feature representation of presentation text

v₂Fine-grained feature representation of representation text

Therefore, the single-mode feature representation containing coarse-grained and fine-grained information at the same time can be obtained

And

wherein

And

a single modality feature representation of the p-th data in both image and text media types, respectively.

In this embodiment, a multitask learning framework is used to model semantic category constraints within modalities and pairwise similarity constraints between modalities. Specifically, for the pairwise similarity constraint between modalities, a neighbor graph G ═ V, E is first constructed for all image and text data, where V denotes image or text data and E denotes a similarity relationship between image and text data, as defined below:

wherein

And

labels representing image and text data. The following contrast loss functions are then defined to model pairs of similar and dissimilar constraints:

wherein

And

single modality feature representation (S) representing images and text, respectively⁽ⁱ⁾And S^(t)) The boundary parameter is set to α.

Then, for the intra-modal semantic category constraint, constructing an n-way softmax layer, where n represents the number of categories and defines the following cross-entropy loss function:

wherein

Representing the predicted distribution probability, p_iRepresenting the target distribution probability. By minimizing the loss function, the semantic discernment capabilities of the unified representation can be enhanced.

Finally, through the multi-task learning framework, the modal dynamic balance can be realizedThe learning process of the paired association constraint between the semantic category constraint and the modality finally obtains more accurate cross-modality unified representation

And

wherein

And

the cross-mode unified representation of the pth data in the two media types of the image and the text is respectively represented.

(3) And calculating the similarity of the different modal data by utilizing the unified representation of the different modal data obtained according to the trained multi-granularity hierarchical network structure.

After the deep network training is completed, the data of different media can obtain the uniform representations of the same dimension through the deep network, and the similarity of the data is defined as the distance measurement between the uniform representations of the data of different modes. In this embodiment, the distance measurement adopts cosine distance, and the similarity between two modal data is measured by calculating a cosine value of a unified representation vector included angle. In addition, the framework also supports other types of distance metrics, such as Euclidean distance and the like.

(4) Any one modality type in the test set is used as a query modality, and the other modality type is used as a target modality. And (4) taking each data of the query modality as a query sample, retrieving data in the target modality, calculating the similarity between the query sample and the query target according to the mode in the step (3), and sequencing the similarity from large to small to obtain a related result list of the data of the target modality.

The following experimental results show that compared with the existing method, the cross-modal association learning method based on the multi-granularity hierarchical network can achieve higher retrieval accuracy.

This example was conducted using the Wikipedia Cross-Modal dataset, which was proposed by the documents "a new approach to Cross-modular Multimedia review" (authors n. rasiwasisia, j. pereira, e. covielo, g. doyle, g. lanckriet, r. levy and n. vasccolos, published ACMinternational conference on Multimedia in 2010), which included 2866 text and 2866 images, wherein the text and images were in one-to-one correspondence, divided into 10 categories in total, wherein 2173 text and 2173 images were used as training sets, 231 text and 231 images were used as validation sets, 492 text and 492 images were used as test sets. The following 3 methods were tested as experimental comparisons:

the prior method comprises the following steps: joint Representation Learning (JRL) method in the document "Learning Cross-Media Joint retrieval with spark and Semi-supervisory reconstruction" (author x.zhai, y.peng, and j.xiao), constructs a graph model for different modal data, performs Cross-modal associative Learning and high-level semantic abstraction simultaneously, and introduces sparse and Semi-Supervised conventions.

The prior method II comprises the following steps: in a Multimodal autoencoder (Bimodal AE) method in the document "Multimodal Deep Learning" (author j.ngiam, a.khasla, m.kim, j.nam, h.lee, and a.y.ng), a plurality of media types are used as input, cross-modal associated information is modeled in an intermediate layer to obtain a uniform characterization, and meanwhile, a network is required to be capable of reconstructing an original characteristic input from the uniform characterization, so that the associated information among different media can be effectively learned, and reconstruction information in each media can be retained.

The existing method is three: a corresponding self-encoder network (Corr-AE for short) method in a document 'Cross-mode Retrieval with corresponding auto encoder' (author f.feng, x.wang, and r.li) constructs two networks, and connects at intermediate layers to simultaneously model associated information and reconstructed information.

The invention comprises the following steps: the method of the present embodiment.

The accuracy of the cross-modal retrieval is evaluated by adopting a MAP (mean average retrieval) index commonly used in the field of information retrieval in the experiment, wherein the MAP is an average value of the retrieval accuracy of each query sample, and the larger the MAP value is, the better the result of the cross-modal retrieval is.

TABLE 1 Experimental results of the invention show

	Image query text	Text query image	Average
				Existing method 1	0.453	0.400	0.427
Conventional method II	0.314	0.290	0.302
				Existing method III	0.402	0.395	0.399
The invention	0.504	0.457	0.481

As can be seen from Table 1, the method of the invention is greatly improved in the two tasks of image query text and text query image compared with the prior art. In the existing method, a graph model is constructed under a traditional frame to linearly map different modal data to a uniform space, so that a complex cross-modal incidence relation is difficult to fully model. The existing method II and the existing method III both adopt a deep network structure, but only utilize original data of different modal types, and learn the cross-modal unified representation through a simple network structure. On one hand, the method fuses multi-granularity characteristic representations of different modal data and combines the related learning in optimized modes and among the modes to obtain single-modal separation characteristic representation. On the other hand, a multi-task learning framework is adopted, the learning process of semantic category constraint in the modes and pairwise association constraint between the modes is dynamically balanced, and cross-mode unified representation is obtained, so that the accuracy of cross-mode retrieval is improved.

In other embodiments, in the cross-modal unified characterization learning method in step (2), a Deep Belief Network (DBN) is used to model original and fine-grained image and text data, and a Stacked auto-encoder (SAE) may also be used as a substitute.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A cross-modal association learning method based on a multi-granularity hierarchical network comprises the following steps:

(1) establishing a cross-modal database containing various modal types, dividing data in the cross-modal database into a training set, a verification set and a test set, carrying out blocking processing on data of different modes in the cross-modal database, and extracting feature vectors of original data of all the modes and the blocked data;

(2) training a multi-granularity hierarchical network structure by utilizing original data and partitioned data, and learning unified characterization for different modal data through the multi-granularity hierarchical network structure, wherein the method comprises the following steps: modeling original images and text data by using two networks, modeling fine-grained images and text data by using the two networks, fusing coarse-grained representation and fine-grained representation of each mode by using a joint limiting Boltzmann machine, and finally modeling semantic category constraint in the modes and pairwise similarity constraint between the modes by using a multi-task learning framework;

2. The method of claim 1, wherein the cross-modality database contains a plurality of modality types including images, text.

3. The method according to claim 1, wherein the step (1) employs different chunking processing methods for different modality data to slice the original data into a plurality of parts, wherein a plurality of candidate regions containing rich fine-grained information are extracted using a selective search algorithm for the image data, or the image is sliced into 2 x 2 or 4 x 4 regions; for text data, the text data is divided into a plurality of pieces in units of sentences, or the text is divided into phrases.

4. The method of claim 1, wherein the feature vectors extracted in step (1) are: extracting word frequency characteristic vectors or hidden Direy distribution characteristic vectors from the text data; the image data is the characteristic vector or the bag of words characteristic vector of the convolutional neural network of extraction.

5. The method of claim 1, wherein the modeling of raw image and text data using two networks begins with modeling the feature distributions of images and text, respectively, using two deep belief networks, using the following conditional probability distribution formula:

wherein h is⁽¹⁾And h⁽²⁾Representing two hidden layers in a DBN, v_iRepresenting image data, v_tRepresenting text data; thereby obtaining a feature representation Q containing intra-modal high-level semantic information⁽ⁱ⁾And Q^(t)(ii) a Then connecting two networks by using a shared coding layer, simultaneously modeling intra-modal association and inter-modal association of the image and the text data, and jointly optimizing and reconstructing a learning error and an associated learning error by minimizing the following loss function:

wherein

And

representing a reconstructed representation of each modality, L_rDenotes the reconstructed learning error, L_cRepresenting the associated learning error; thereby obtaining a coarse-grained feature representation containing intra-modality and inter-modality associations

And

6. the method of claim 5, wherein the two-way network modeling of fine-grained image and text data is based on modeling fine-grained image and text data using two Deep Belief Networks (DBNs), and an average fusion strategy is used to obtain a feature representation U containing intra-modal fine-grained information⁽ⁱ⁾And U^(t)Then, a shared coding layer is constructed to connect two networks, and intra-modal association and inter-modal association represented by fine-grained features of the image and the text are modeled simultaneously by minimizing the following loss functions:

wherein

And

a reconstructed representation, L, representing fine-grained features of each modality_rDenotes the reconstructed learning error, L_cRepresenting associative learning errors to obtain a fine-grained feature representation containing intra-modal and inter-modal associations

And

7. the method of claim 6, in which the fusing the coarse-grained representation and the fine-grained representation of each modality using a joint-constrained boltzmann machine defines a joint distribution as follows:

wherein,

and

respectively representing two hidden layers in a jointly constrained boltzmann machine, h⁽²⁾Representing a joint layer therein, v for an image₁Coarse-grained feature representation of a representation image

v₂Fine-grained feature representation of a representation image

v₂Fine-grained feature representation of representation text

Thus obtaining a single-mode feature representation S containing both coarse-grained and fine-grained information⁽ⁱ⁾And S^(t)。

8. The method of claim 7, wherein the use of a multitask learning framework to model semantic category constraints within modalities and pairwise similarity constraints between modalities, for which pairwise similarity constraints between modalities a neighborhood graph G ═ (V, E) is first constructed for all image and text data, where V represents image or text data and E represents a similarity relationship between image and text data, defined as follows:

wherein

And

labels representing image and text data; the following contrast loss functions are then defined to model pairs of similar and dissimilar constraints:

wherein

And

single modality feature representation (S) representing images and text, respectively⁽ⁱ⁾And S^(t)) Boundary parameter set to α;

wherein

Representing the predicted distribution probability, p_iRepresenting a target distribution probability; enhancing the semantic recognition capability of the unified representation by minimizing the loss function; finally, through the multi-task learning framework, the learning process of semantic category constraint in the dynamic balance mode and pairwise association constraint between the modes is achieved, and finally the more accurate cross-mode unified representation M is obtained⁽ⁱ⁾And M^(t)。

9. The method as claimed in claim 1, wherein the step (3) adopts cosine distance, and measures the similarity of two modal data by calculating cosine value of the included angle of the unified characterization vector of the two modal data; or step (3) adopts Euclidean distance to measure similarity.