CN107346328B - Cross-modal association learning method based on multi-granularity hierarchical network - Google Patents

Cross-modal association learning method based on multi-granularity hierarchical network Download PDF

Info

Publication number
CN107346328B
CN107346328B CN201710378513.XA CN201710378513A CN107346328B CN 107346328 B CN107346328 B CN 107346328B CN 201710378513 A CN201710378513 A CN 201710378513A CN 107346328 B CN107346328 B CN 107346328B
Authority
CN
China
Prior art keywords
data
modal
representation
grained
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710378513.XA
Other languages
Chinese (zh)
Other versions
CN107346328A (en
Inventor
彭宇新
綦金玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201710378513.XA priority Critical patent/CN107346328B/en
Publication of CN107346328A publication Critical patent/CN107346328A/en
Application granted granted Critical
Publication of CN107346328B publication Critical patent/CN107346328B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a cross-modal association learning method based on a multi-granularity hierarchical network, which comprises the following steps: 1. establishing a cross-modal database containing multiple modal types, dividing data in the database into a training set, a verification set and a test set, carrying out blocking processing on different modal data in the database, and extracting feature vectors of all modal original data and blocked data. 2. And training a multi-granularity hierarchical network structure by using the original data and the partitioned data, and learning unified characterization for different modal data. 3. And obtaining the unified representation of different modal data by utilizing the trained multi-granularity hierarchical network structure, and further calculating the similarity of the different modal data. 4. And taking any one mode type in the test set as a query mode, taking the other mode type as a target mode, calculating the similarity between the query sample and the query target, and obtaining a related result list of target mode data according to the similarity. The invention can improve the accuracy of cross-modal retrieval.

Description

Cross-modal association learning method based on multi-granularity hierarchical network
Technical Field
The invention relates to the field of multimedia retrieval, in particular to a cross-modal association learning method based on a multi-granularity hierarchical network.
Background
In recent years, with the rapid development of computer technology, information acquisition and processing have been changed from a single modality form of text, image, audio, video, and the like to a form in which multiple modalities are fused with each other. Multimodal retrieval has become an important issue in the field of information retrieval, and has wide application in both search engines and big data management. The traditional retrieval mode is mainly a single mode form, namely, a user submits data of one mode type as a query, and a retrieval system returns retrieval results of the same mode, such as image retrieval, text retrieval and the like. This retrieval approach does not directly measure the similarity between different modality data, such as the similarity of an image to an audio clip, and therefore limits the flexibility of retrieval. In order to solve the above problems, cross-modal retrieval becomes a new research hotspot, which can retrieve relevant results containing multiple modal types according to data of any modal type uploaded by a user as a query. Compared with the traditional single-mode retrieval, the cross-mode retrieval can provide more flexible and practical retrieval experience.
A key problem with cross-modality retrieval is how to learn the intrinsic correlations between different modalities. Cross-modality similarity metrics are very challenging due to inconsistent distribution characteristics and feature representations of different modality data. The existing common cross-modal retrieval method mainly learns a uniform space for different modal data, that is, feature representations of the different modal data are mapped to the cross-modal uniform space from an original single modal space, so that a uniform representation capable of directly measuring cross-modal similarity is obtained. The existing methods can be mainly divided into two categories, one is to learn linear mapping under a traditional framework, and the method includes a method based on Canonical Correlation Analysis (CCA for short), which maps data of different modalities into a common subspace of the same dimension by analyzing a pair-wise association relationship of data of different modalities, and maximizes an association between the pair-wise data. In addition, there are methods based on graph conventions, for example, Zhai et al, in the document "Learning Cross-Media joint retrieval with Sparse and Semi-Supervised reconstruction" proposes a Sparse and Semi-Supervised protocol-based Cross-modal retrieval method, which constructs graph models for different modal data, and performs Cross-modal association Learning and high-level semantic abstraction at the same time.
The other type is a cross-modal unified characterization learning method based on a deep neural network, and the main idea is to analyze and mine a complex cross-modal association relation by utilizing the strong modeling capability of the deep neural network. For example, Ngiam et al, in the "Multimodal Deep Learning" document, propose a multi-modal auto-encoder that models cross-modal correlation information in the middle layer, using two-modal data as inputs, and simultaneously models the reconstruction errors of both. Feng et al propose a corresponding self-encoder (Corr-AE) in a document of Cross-model retrieval with Correspondence Autoencoder, construct two networks connected by an encoding layer, and simultaneously model associated information and reconstructed information. Most of the existing cross-modal retrieval methods based on the deep network can be divided into two learning stages, the separation characteristic representation of each modal is learned in the first stage, and the cross-modal unified representation is learned in the second stage. However, the existing method has three limitations, namely, the existing method only models the association relation in the modality at the first stage, and ignores the supplementary role of the association between the modalities on the separation feature representation learning; secondly, in the second stage, the existing method only uses a single loss function for constraint, and cannot fully balance the associated learning process in the modes and among the modes; in addition, the existing method only considers the original data of different modes, but ignores the rich fine-grained information provided by each part inside the existing method, and cannot fully mine the cross-mode association relationship.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a cross-modal association learning method based on a multi-granularity hierarchical network, which can fully mine the multi-level association relationship in the modalities and among the modalities by utilizing a hierarchical network structure, and simultaneously dynamically balance semantic category constraints in the modalities and pairwise similarity constraint learning processes among the modalities by utilizing a multi-task framework. In addition, the accuracy of cross-modal retrieval is improved by modeling the multi-granularity information of different modal data.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
a cross-modal association learning method based on a multi-granularity hierarchical network is used for comprehensively modeling multi-granularity information of cross-modal data and association information in and among modalities to obtain uniform representations of different modal data, so that cross-modal retrieval is realized, and the method comprises the following steps:
(1) establishing a cross-modal database containing various modal types, dividing data in the cross-modal database into a training set, a verification set and a test set, carrying out blocking processing on data of different modes in the cross-modal database, and extracting original data of all the modes and feature vectors of the blocked data;
(2) training a multi-granularity hierarchical network structure by utilizing original data and partitioned data, and learning unified representation for different modal data through the multi-granularity hierarchical network structure;
(3) calculating the similarity of different modal data by using the unified representation of the different modal data obtained according to the trained multi-granularity hierarchical network structure;
(4) and using any one mode type in the test set as a query mode, using the other mode type as a target mode, using each data of the query mode as a query sample, retrieving data in the target mode, calculating the similarity between the query sample and the query target, and obtaining a related result list of the target mode data according to the similarity.
Further, in the above cross-modality association learning method based on the multi-granularity hierarchical network, the cross-modality database in step (1) may contain a plurality of modality types, such as images, texts, and the like.
Further, in the above cross-modal association learning method based on the multi-granularity hierarchical network, the different modal data in the database are subjected to a blocking processing process in the step (1), and the original data can be divided into a plurality of parts by adopting different blocking processing methods for the different modal data. Specifically, a plurality of candidate regions containing rich fine-grained information such as visual objects are extracted from image data by using a selective search (selective search) algorithm; for text data, the text data is divided into a plurality of pieces in units of sentences. Meanwhile, other blocking methods may be supported, such as segmenting an image into 2 × 2 or 4 × 4 regions, segmenting text by phrases, and so on.
Further, in the above cross-modal association learning method based on a multi-granularity hierarchical network, the feature vector in step (1) is specifically: extracting word frequency characteristic vectors from the text data; the image data is the feature vector of the extracted convolutional neural network, and can support other kinds of features, such as the feature vector of the word bag of the image, the feature vector of the hidden Direx distribution of the text, and the like.
Further, according to the cross-modal association learning method based on the multi-granularity hierarchical network, a multi-path network structure is used in the step (2), different modal data are subjected to blocking processing, multi-granularity information in the data is fully mined, meanwhile, intra-modal and inter-modal association relations of the cross-modal data are modeled to obtain a single-modal separation characteristic representation, a multi-task learning framework is built, a learning process of intra-modal semantic category constraints and inter-modal pairwise association constraints is dynamically balanced, and finally, cross-modal unified characterization is obtained.
Further, in the above cross-modal association learning method based on the multi-granularity hierarchical network, the distance measurement in step (3) adopts cosine distance, and the similarity between two modal data is measured by calculating cosine values of the included angle of the unified representation vector of the two modal data. In addition, the framework also supports other types of distance metrics, such as Euclidean distance and the like.
Further, in the above cross-modal association learning method based on the multi-granularity hierarchical network, the retrieval manner in the step (4) is to use one modal type in the test set as a query modality, and use another modal type as a target modality. And (3) regarding each data of the query modes in the test set as a query sample, calculating similarity with all data of the target modes in the test set after calculating the similarity according to the step (3), and then sequencing the similarity from large to small to obtain a related result list.
The invention has the following effects: compared with the existing method, the method can fully mine the multi-granularity information of different modal data, simultaneously model the incidence relation between the intra-modal and the inter-modal to learn the separation characteristic representation of the single mode, further dynamically balance the learning process of semantic category constraint and pairwise incidence constraint between the modal by adopting a multi-task learning framework, and improve the accuracy of cross-modal retrieval.
The reason why the method has the above-mentioned inventive effect is that: aiming at two stages of single-mode separation feature representation learning and cross-mode unified representation learning, a hierarchical network structure is adopted to fully model the incidence relation between the modes. On one hand, in the single-mode separation feature representation learning process, multi-granularity feature representations of different mode data are fused, and association learning in the modes and among the modes is optimized in a combined mode. On the other hand, in the cross-modal unified representation learning process, a multi-task learning framework is adopted to dynamically balance the learning process of semantic category constraint in the modalities and pairwise association constraint between the modalities, so that the accuracy of cross-modal retrieval is improved.
Drawings
FIG. 1 is a flow chart of a cross-modal association learning method based on a multi-granularity hierarchical network according to the present invention.
Fig. 2 is a schematic diagram of the complete network architecture of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and specific embodiments.
The invention relates to a cross-modal association learning method based on a multi-granularity hierarchical network, the flow of which is shown in figure 1, and the method comprises the following steps:
(1) establishing a cross-modal database containing multiple modal types, dividing the database into a training set, a verification set and a test set, carrying out blocking processing on data of different modes in the database, and extracting all modal original data and feature vectors of the blocked data.
In this embodiment, the cross-modal database may include a plurality of modal types, and different blocking processing methods are used to divide the original data into a plurality of parts for different modal data. Taking images and texts as an example, extracting a plurality of candidate regions containing rich fine-grained information such as visual objects by using a Selective Search algorithm for image data; for text data, the text data is divided into a plurality of pieces in units of sentences. Further, the feature vector extraction method for the two modality type data is as follows: extracting word frequency characteristic vectors from the text data; the image data is a feature vector of the extracted deep convolutional neural network. The framework of the method also supports other modal type data, such as audio, video and the like, and can support other kinds of characteristics, such as word bag characteristic vectors of images, hidden Direy distributed characteristic vectors of texts and the like.
The cross-modal dataset is denoted by D ═ D(i),D(t)Therein of
Figure BDA0001304611880000041
For media type r, where r ═ i, t (i denotes image, t denotes text), n is defined(r)The number of data is the same. Each data in the training set has one and only one semantic category.
Definition of
Figure BDA0001304611880000042
As a mediumThe feature vector of the p-th data in type r, whose expression structure is d(r)× 1, where d(r)Representing the feature vector dimension of media type r.
Definition of
Figure BDA0001304611880000043
Is semantically tagged as
Figure BDA0001304611880000044
It represents a vector structured as c × 1, where c represents the total number of semantic categories.
Figure BDA0001304611880000045
One dimension of the data is 1, and the other dimensions are 0, which indicates that the semantic category of the data is the label corresponding to the column with the value of 1.
(2) And training a multi-granularity hierarchical network structure by using the original data and the partitioned data, and learning unified characterization for different modal data.
The process of this step is shown in fig. 2, wherein the circles represent hidden units in the neural network, and the dotted lines represent connections between the hidden units in two adjacent layers of the neural network. In this embodiment, two networks are used to model the original image and text data. Firstly, two Deep Belief Networks (DBN) are used for respectively modeling the feature distribution of an image and a text, and the following conditional probability distribution formula is used:
Figure BDA0001304611880000051
Figure BDA0001304611880000052
wherein h is(1)And h(2)Representing two hidden layers in a DBN, viRepresenting image data, vtRepresenting text data. From this, a feature representation Q containing intra-modal high-level semantic information can be derived(i)And Q(t). Then connecting two networks by using shared coding layer, and simultaneously buildingIntra-modal and inter-modal associations of the modal images and the text data jointly optimize the reconstructed learning error and the associated learning error by minimizing the following loss function:
Figure BDA0001304611880000053
wherein
Figure BDA0001304611880000054
And
Figure BDA0001304611880000055
representing a reconstructed representation of each modality, LrDenotes the reconstructed learning error, LcIndicating the associated learning error. Thus, a coarse-grained characterization including intra-modality and inter-modality associations may be obtained
Figure BDA0001304611880000056
And
Figure BDA0001304611880000057
wherein
Figure BDA0001304611880000058
And
Figure BDA0001304611880000059
and respectively representing the coarse-grained characteristic representation of the p-th data in two media types of images and texts.
In this embodiment, two networks are used to model fine-grained image and text data. Specifically, two Deep Belief Networks (DBNs) are used for modeling fine-grained image and text data, and an average fusion strategy is adopted to obtain a feature representation U containing intra-modal fine-grained information(i)And U(t)Then, a shared coding layer is constructed to connect two networks, and intra-modal association and inter-modal association represented by fine-grained features of the image and the text are modeled simultaneously by minimizing the following loss functions:
Figure BDA00013046118800000510
wherein
Figure BDA00013046118800000511
And
Figure BDA00013046118800000512
a reconstructed representation, L, representing fine-grained features of each modalityrDenotes the reconstructed learning error, LcIndicating the associated learning error. Thus, a fine-grained feature representation including intra-modality and inter-modality associations may be obtained
Figure BDA00013046118800000513
And
Figure BDA00013046118800000514
wherein
Figure BDA00013046118800000515
And
Figure BDA00013046118800000516
and respectively representing fine-grained characteristic representation of the p-th data in two media types of images and texts.
In this embodiment, a joint constrained Boltzmann Machines (RBM) is used to fuse the coarse-grained representation and the fine-grained representation of each mode(s) ((r))
Figure BDA00013046118800000517
And
Figure BDA00013046118800000518
). Specifically, the following joint distribution is defined:
Figure BDA0001304611880000061
wherein,
Figure BDA0001304611880000062
and
Figure BDA0001304611880000063
respectively representing two hidden layers in a jointly constrained boltzmann machine, h(2)Represents a combined layer therein; for images, v1Coarse-grained feature representation of a representation image
Figure BDA0001304611880000064
v2Fine-grained feature representation of a representation image
Figure BDA0001304611880000065
For the same reason as for text, a joint distribution as defined above is still used, where v1Coarse-grained feature representation of presentation text
Figure BDA0001304611880000066
v2Fine-grained feature representation of representation text
Figure BDA0001304611880000067
Therefore, the single-mode feature representation containing coarse-grained and fine-grained information at the same time can be obtained
Figure BDA0001304611880000068
And
Figure BDA0001304611880000069
wherein
Figure BDA00013046118800000610
And
Figure BDA00013046118800000611
a single modality feature representation of the p-th data in both image and text media types, respectively.
In this embodiment, a multitask learning framework is used to model semantic category constraints within modalities and pairwise similarity constraints between modalities. Specifically, for the pairwise similarity constraint between modalities, a neighbor graph G ═ V, E is first constructed for all image and text data, where V denotes image or text data and E denotes a similarity relationship between image and text data, as defined below:
Figure BDA00013046118800000612
wherein
Figure BDA00013046118800000613
And
Figure BDA00013046118800000614
labels representing image and text data. The following contrast loss functions are then defined to model pairs of similar and dissimilar constraints:
Figure BDA00013046118800000615
wherein
Figure BDA00013046118800000616
And
Figure BDA00013046118800000617
single modality feature representation (S) representing images and text, respectively(i)And S(t)) The boundary parameter is set to α.
Then, for the intra-modal semantic category constraint, constructing an n-way softmax layer, where n represents the number of categories and defines the following cross-entropy loss function:
Figure BDA00013046118800000618
wherein
Figure BDA00013046118800000619
Representing the predicted distribution probability, piRepresenting the target distribution probability. By minimizing the loss function, the semantic discernment capabilities of the unified representation can be enhanced.
Finally, through the multi-task learning framework, the modal dynamic balance can be realizedThe learning process of the paired association constraint between the semantic category constraint and the modality finally obtains more accurate cross-modality unified representation
Figure BDA00013046118800000620
And
Figure BDA00013046118800000621
wherein
Figure BDA0001304611880000071
And
Figure BDA0001304611880000072
the cross-mode unified representation of the pth data in the two media types of the image and the text is respectively represented.
(3) And calculating the similarity of the different modal data by utilizing the unified representation of the different modal data obtained according to the trained multi-granularity hierarchical network structure.
After the deep network training is completed, the data of different media can obtain the uniform representations of the same dimension through the deep network, and the similarity of the data is defined as the distance measurement between the uniform representations of the data of different modes. In this embodiment, the distance measurement adopts cosine distance, and the similarity between two modal data is measured by calculating a cosine value of a unified representation vector included angle. In addition, the framework also supports other types of distance metrics, such as Euclidean distance and the like.
(4) Any one modality type in the test set is used as a query modality, and the other modality type is used as a target modality. And (4) taking each data of the query modality as a query sample, retrieving data in the target modality, calculating the similarity between the query sample and the query target according to the mode in the step (3), and sequencing the similarity from large to small to obtain a related result list of the data of the target modality.
The following experimental results show that compared with the existing method, the cross-modal association learning method based on the multi-granularity hierarchical network can achieve higher retrieval accuracy.
This example was conducted using the Wikipedia Cross-Modal dataset, which was proposed by the documents "a new approach to Cross-modular Multimedia review" (authors n. rasiwasisia, j. pereira, e. covielo, g. doyle, g. lanckriet, r. levy and n. vasccolos, published ACMinternational conference on Multimedia in 2010), which included 2866 text and 2866 images, wherein the text and images were in one-to-one correspondence, divided into 10 categories in total, wherein 2173 text and 2173 images were used as training sets, 231 text and 231 images were used as validation sets, 492 text and 492 images were used as test sets. The following 3 methods were tested as experimental comparisons:
the prior method comprises the following steps: joint Representation Learning (JRL) method in the document "Learning Cross-Media Joint retrieval with spark and Semi-supervisory reconstruction" (author x.zhai, y.peng, and j.xiao), constructs a graph model for different modal data, performs Cross-modal associative Learning and high-level semantic abstraction simultaneously, and introduces sparse and Semi-Supervised conventions.
The prior method II comprises the following steps: in a Multimodal autoencoder (Bimodal AE) method in the document "Multimodal Deep Learning" (author j.ngiam, a.khasla, m.kim, j.nam, h.lee, and a.y.ng), a plurality of media types are used as input, cross-modal associated information is modeled in an intermediate layer to obtain a uniform characterization, and meanwhile, a network is required to be capable of reconstructing an original characteristic input from the uniform characterization, so that the associated information among different media can be effectively learned, and reconstruction information in each media can be retained.
The existing method is three: a corresponding self-encoder network (Corr-AE for short) method in a document 'Cross-mode Retrieval with corresponding auto encoder' (author f.feng, x.wang, and r.li) constructs two networks, and connects at intermediate layers to simultaneously model associated information and reconstructed information.
The invention comprises the following steps: the method of the present embodiment.
The accuracy of the cross-modal retrieval is evaluated by adopting a MAP (mean average retrieval) index commonly used in the field of information retrieval in the experiment, wherein the MAP is an average value of the retrieval accuracy of each query sample, and the larger the MAP value is, the better the result of the cross-modal retrieval is.
TABLE 1 Experimental results of the invention show
Image query text Text query image Average
Existing method 1 0.453 0.400 0.427
Conventional method II 0.314 0.290 0.302
Existing method III 0.402 0.395 0.399
The invention 0.504 0.457 0.481
As can be seen from Table 1, the method of the invention is greatly improved in the two tasks of image query text and text query image compared with the prior art. In the existing method, a graph model is constructed under a traditional frame to linearly map different modal data to a uniform space, so that a complex cross-modal incidence relation is difficult to fully model. The existing method II and the existing method III both adopt a deep network structure, but only utilize original data of different modal types, and learn the cross-modal unified representation through a simple network structure. On one hand, the method fuses multi-granularity characteristic representations of different modal data and combines the related learning in optimized modes and among the modes to obtain single-modal separation characteristic representation. On the other hand, a multi-task learning framework is adopted, the learning process of semantic category constraint in the modes and pairwise association constraint between the modes is dynamically balanced, and cross-mode unified representation is obtained, so that the accuracy of cross-mode retrieval is improved.
In other embodiments, in the cross-modal unified characterization learning method in step (2), a Deep Belief Network (DBN) is used to model original and fine-grained image and text data, and a Stacked auto-encoder (SAE) may also be used as a substitute.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (9)

1. A cross-modal association learning method based on a multi-granularity hierarchical network comprises the following steps:
(1) establishing a cross-modal database containing various modal types, dividing data in the cross-modal database into a training set, a verification set and a test set, carrying out blocking processing on data of different modes in the cross-modal database, and extracting feature vectors of original data of all the modes and the blocked data;
(2) training a multi-granularity hierarchical network structure by utilizing original data and partitioned data, and learning unified characterization for different modal data through the multi-granularity hierarchical network structure, wherein the method comprises the following steps: modeling original images and text data by using two networks, modeling fine-grained images and text data by using the two networks, fusing coarse-grained representation and fine-grained representation of each mode by using a joint limiting Boltzmann machine, and finally modeling semantic category constraint in the modes and pairwise similarity constraint between the modes by using a multi-task learning framework;
(3) calculating the similarity of different modal data by using the unified representation of the different modal data obtained according to the trained multi-granularity hierarchical network structure;
(4) and using any one mode type in the test set as a query mode, using the other mode type as a target mode, using each data of the query mode as a query sample, retrieving data in the target mode, calculating the similarity between the query sample and the query target, and obtaining a related result list of the target mode data according to the similarity.
2. The method of claim 1, wherein the cross-modality database contains a plurality of modality types including images, text.
3. The method according to claim 1, wherein the step (1) employs different chunking processing methods for different modality data to slice the original data into a plurality of parts, wherein a plurality of candidate regions containing rich fine-grained information are extracted using a selective search algorithm for the image data, or the image is sliced into 2 x 2 or 4 x 4 regions; for text data, the text data is divided into a plurality of pieces in units of sentences, or the text is divided into phrases.
4. The method of claim 1, wherein the feature vectors extracted in step (1) are: extracting word frequency characteristic vectors or hidden Direy distribution characteristic vectors from the text data; the image data is the characteristic vector or the bag of words characteristic vector of the convolutional neural network of extraction.
5. The method of claim 1, wherein the modeling of raw image and text data using two networks begins with modeling the feature distributions of images and text, respectively, using two deep belief networks, using the following conditional probability distribution formula:
Figure FDA0002487892700000011
Figure FDA0002487892700000012
wherein h is(1)And h(2)Representing two hidden layers in a DBN, viRepresenting image data, vtRepresenting text data; thereby obtaining a feature representation Q containing intra-modal high-level semantic information(i)And Q(t)(ii) a Then connecting two networks by using a shared coding layer, simultaneously modeling intra-modal association and inter-modal association of the image and the text data, and jointly optimizing and reconstructing a learning error and an associated learning error by minimizing the following loss function:
Figure FDA0002487892700000021
wherein
Figure FDA0002487892700000022
And
Figure FDA0002487892700000023
representing a reconstructed representation of each modality, LrDenotes the reconstructed learning error, LcRepresenting the associated learning error; thereby obtaining a coarse-grained feature representation containing intra-modality and inter-modality associations
Figure FDA0002487892700000024
And
Figure FDA0002487892700000025
6. the method of claim 5, wherein the two-way network modeling of fine-grained image and text data is based on modeling fine-grained image and text data using two Deep Belief Networks (DBNs), and an average fusion strategy is used to obtain a feature representation U containing intra-modal fine-grained information(i)And U(t)Then, a shared coding layer is constructed to connect two networks, and intra-modal association and inter-modal association represented by fine-grained features of the image and the text are modeled simultaneously by minimizing the following loss functions:
Figure FDA0002487892700000026
wherein
Figure FDA0002487892700000027
And
Figure FDA0002487892700000028
a reconstructed representation, L, representing fine-grained features of each modalityrDenotes the reconstructed learning error, LcRepresenting associative learning errors to obtain a fine-grained feature representation containing intra-modal and inter-modal associations
Figure FDA0002487892700000029
And
Figure FDA00024878927000000210
7. the method of claim 6, in which the fusing the coarse-grained representation and the fine-grained representation of each modality using a joint-constrained boltzmann machine defines a joint distribution as follows:
Figure FDA00024878927000000211
wherein,
Figure FDA00024878927000000212
and
Figure FDA00024878927000000213
respectively representing two hidden layers in a jointly constrained boltzmann machine, h(2)Representing a joint layer therein, v for an image1Coarse-grained feature representation of a representation image
Figure FDA00024878927000000214
v2Fine-grained feature representation of a representation image
Figure FDA00024878927000000215
For the same reason as for text, a joint distribution as defined above is still used, where v1Coarse-grained feature representation of presentation text
Figure FDA00024878927000000216
v2Fine-grained feature representation of representation text
Figure FDA00024878927000000217
Thus obtaining a single-mode feature representation S containing both coarse-grained and fine-grained information(i)And S(t)
8. The method of claim 7, wherein the use of a multitask learning framework to model semantic category constraints within modalities and pairwise similarity constraints between modalities, for which pairwise similarity constraints between modalities a neighborhood graph G ═ (V, E) is first constructed for all image and text data, where V represents image or text data and E represents a similarity relationship between image and text data, defined as follows:
Figure FDA0002487892700000031
wherein
Figure FDA0002487892700000032
And
Figure FDA0002487892700000033
labels representing image and text data; the following contrast loss functions are then defined to model pairs of similar and dissimilar constraints:
Figure FDA0002487892700000034
wherein
Figure FDA0002487892700000035
And
Figure FDA0002487892700000036
single modality feature representation (S) representing images and text, respectively(i)And S(t)) Boundary parameter set to α;
then, for the intra-modal semantic category constraint, constructing an n-way softmax layer, where n represents the number of categories and defines the following cross-entropy loss function:
Figure FDA0002487892700000037
wherein
Figure FDA0002487892700000038
Representing the predicted distribution probability, piRepresenting a target distribution probability; enhancing the semantic recognition capability of the unified representation by minimizing the loss function; finally, through the multi-task learning framework, the learning process of semantic category constraint in the dynamic balance mode and pairwise association constraint between the modes is achieved, and finally the more accurate cross-mode unified representation M is obtained(i)And M(t)
9. The method as claimed in claim 1, wherein the step (3) adopts cosine distance, and measures the similarity of two modal data by calculating cosine value of the included angle of the unified characterization vector of the two modal data; or step (3) adopts Euclidean distance to measure similarity.
CN201710378513.XA 2017-05-25 2017-05-25 Cross-modal association learning method based on multi-granularity hierarchical network Active CN107346328B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710378513.XA CN107346328B (en) 2017-05-25 2017-05-25 Cross-modal association learning method based on multi-granularity hierarchical network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710378513.XA CN107346328B (en) 2017-05-25 2017-05-25 Cross-modal association learning method based on multi-granularity hierarchical network

Publications (2)

Publication Number Publication Date
CN107346328A CN107346328A (en) 2017-11-14
CN107346328B true CN107346328B (en) 2020-09-08

Family

ID=60253337

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710378513.XA Active CN107346328B (en) 2017-05-25 2017-05-25 Cross-modal association learning method based on multi-granularity hierarchical network

Country Status (1)

Country Link
CN (1) CN107346328B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109189968B (en) * 2018-08-31 2020-07-03 深圳大学 Cross-modal retrieval method and system
CN109271486B (en) * 2018-09-19 2021-11-26 九江学院 Similarity-preserving cross-modal Hash retrieval method
CN112116095B (en) * 2019-06-19 2024-05-24 北京搜狗科技发展有限公司 Method and related device for training multi-task learning model
CN110457516A (en) * 2019-08-12 2019-11-15 桂林电子科技大学 A kind of cross-module state picture and text search method
CN110781319B (en) * 2019-09-17 2022-06-21 北京邮电大学 Common semantic representation and search method and device for cross-media big data
CN110807465B (en) 2019-11-05 2020-06-30 北京邮电大学 Fine-grained image identification method based on channel loss function
CN111275130B (en) * 2020-02-18 2023-09-08 上海交通大学 Multi-mode-based deep learning prediction method, system, medium and equipment
CN111753549B (en) * 2020-05-22 2023-07-21 江苏大学 Multi-mode emotion feature learning and identifying method based on attention mechanism
CN111859635A (en) * 2020-07-03 2020-10-30 中国人民解放军海军航空大学航空作战勤务学院 Simulation system based on multi-granularity modeling technology and construction method
CN112819052B (en) * 2021-01-25 2021-12-24 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-modal fine-grained mixing method, system, device and storage medium
CN112990048B (en) * 2021-03-26 2021-11-23 中科视语(北京)科技有限公司 Vehicle pattern recognition method and device
CN113516286B (en) * 2021-05-14 2024-05-10 山东建筑大学 Student academic early warning method and system based on multi-granularity task joint modeling
CN114064967B (en) * 2022-01-18 2022-05-06 之江实验室 Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network
CN114219049B (en) * 2022-02-22 2022-05-10 天津大学 Fine-grained curbstone image classification method and device based on hierarchical constraint
CN116012679B (en) * 2022-12-19 2023-06-16 中国科学院空天信息创新研究院 Self-supervision remote sensing representation learning method based on multi-level cross-modal interaction

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105701227A (en) * 2016-01-15 2016-06-22 北京大学 Cross-media similarity measure method and search method based on local association graph
CN105718532A (en) * 2016-01-15 2016-06-29 北京大学 Cross-media sequencing method based on multi-depth network structure

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10776710B2 (en) * 2015-03-24 2020-09-15 International Business Machines Corporation Multimodal data fusion by hierarchical multi-view dictionary learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105701227A (en) * 2016-01-15 2016-06-22 北京大学 Cross-media similarity measure method and search method based on local association graph
CN105718532A (en) * 2016-01-15 2016-06-29 北京大学 Cross-media sequencing method based on multi-depth network structure

Also Published As

Publication number Publication date
CN107346328A (en) 2017-11-14

Similar Documents

Publication Publication Date Title
CN107346328B (en) Cross-modal association learning method based on multi-granularity hierarchical network
CN107562812B (en) Cross-modal similarity learning method based on specific modal semantic space modeling
Peng et al. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges
US10504010B2 (en) Systems and methods for fast novel visual concept learning from sentence descriptions of images
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
CN112417097B (en) Multi-modal data feature extraction and association method for public opinion analysis
CN110619051B (en) Question sentence classification method, device, electronic equipment and storage medium
CN107220337B (en) Cross-media retrieval method based on hybrid migration network
CN113971209B (en) Non-supervision cross-modal retrieval method based on attention mechanism enhancement
CN110647904A (en) Cross-modal retrieval method and system based on unmarked data migration
CN112487822A (en) Cross-modal retrieval method based on deep learning
CN113239159B (en) Cross-modal retrieval method for video and text based on relational inference network
CN113537304A (en) Cross-modal semantic clustering method based on bidirectional CNN
CN116975615A (en) Task prediction method and device based on video multi-mode information
Zhang et al. Deep unsupervised self-evolutionary hashing for image retrieval
Su et al. Semi-supervised knowledge distillation for cross-modal hashing
CN111368176B (en) Cross-modal hash retrieval method and system based on supervision semantic coupling consistency
Liu et al. Open intent discovery through unsupervised semantic clustering and dependency parsing
CN117273134A (en) Zero-sample knowledge graph completion method based on pre-training language model
He et al. Category alignment adversarial learning for cross-modal retrieval
CN114743029A (en) Image text matching method
CN112528062B (en) Cross-modal weapon retrieval method and system
Perdana et al. Instance-based deep transfer learning on cross-domain image captioning
CN109670071B (en) Serialized multi-feature guided cross-media Hash retrieval method and system
Su et al. Semantically guided projection for zero-shot 3D model classification and retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant