CN107346328B - Cross-modal association learning method based on multi-granularity hierarchical network - Google Patents
Cross-modal association learning method based on multi-granularity hierarchical network Download PDFInfo
- Publication number
- CN107346328B CN107346328B CN201710378513.XA CN201710378513A CN107346328B CN 107346328 B CN107346328 B CN 107346328B CN 201710378513 A CN201710378513 A CN 201710378513A CN 107346328 B CN107346328 B CN 107346328B
- Authority
- CN
- China
- Prior art keywords
- data
- modal
- representation
- grained
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 74
- 230000013016 learning Effects 0.000 title claims abstract description 62
- 239000013598 vector Substances 0.000 claims abstract description 25
- 238000012360 testing method Methods 0.000 claims abstract description 12
- 238000012549 training Methods 0.000 claims abstract description 11
- 238000012512 characterization method Methods 0.000 claims abstract description 10
- 230000000903 blocking effect Effects 0.000 claims abstract description 9
- 238000012545 processing Methods 0.000 claims abstract description 7
- 238000012795 verification Methods 0.000 claims abstract description 4
- 238000009826 distribution Methods 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 11
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 238000003672 processing method Methods 0.000 claims description 3
- 230000035045 associative learning Effects 0.000 claims description 2
- 238000000605 extraction Methods 0.000 claims description 2
- 230000004927 fusion Effects 0.000 claims description 2
- 238000010845 search algorithm Methods 0.000 claims description 2
- 230000002708 enhancing effect Effects 0.000 claims 1
- 238000000926 separation method Methods 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000005259 measurement Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- BHELIUBJHYAEDK-OAIUPTLZSA-N Aspoxicillin Chemical compound C1([C@H](C(=O)N[C@@H]2C(N3[C@H](C(C)(C)S[C@@H]32)C(O)=O)=O)NC(=O)[C@H](N)CC(=O)NC)=CC=C(O)C=C1 BHELIUBJHYAEDK-OAIUPTLZSA-N 0.000 description 1
- 101100136092 Drosophila melanogaster peng gene Proteins 0.000 description 1
- 230000002902 bimodal effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/43—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a cross-modal association learning method based on a multi-granularity hierarchical network, which comprises the following steps: 1. establishing a cross-modal database containing multiple modal types, dividing data in the database into a training set, a verification set and a test set, carrying out blocking processing on different modal data in the database, and extracting feature vectors of all modal original data and blocked data. 2. And training a multi-granularity hierarchical network structure by using the original data and the partitioned data, and learning unified characterization for different modal data. 3. And obtaining the unified representation of different modal data by utilizing the trained multi-granularity hierarchical network structure, and further calculating the similarity of the different modal data. 4. And taking any one mode type in the test set as a query mode, taking the other mode type as a target mode, calculating the similarity between the query sample and the query target, and obtaining a related result list of target mode data according to the similarity. The invention can improve the accuracy of cross-modal retrieval.
Description
Technical Field
The invention relates to the field of multimedia retrieval, in particular to a cross-modal association learning method based on a multi-granularity hierarchical network.
Background
In recent years, with the rapid development of computer technology, information acquisition and processing have been changed from a single modality form of text, image, audio, video, and the like to a form in which multiple modalities are fused with each other. Multimodal retrieval has become an important issue in the field of information retrieval, and has wide application in both search engines and big data management. The traditional retrieval mode is mainly a single mode form, namely, a user submits data of one mode type as a query, and a retrieval system returns retrieval results of the same mode, such as image retrieval, text retrieval and the like. This retrieval approach does not directly measure the similarity between different modality data, such as the similarity of an image to an audio clip, and therefore limits the flexibility of retrieval. In order to solve the above problems, cross-modal retrieval becomes a new research hotspot, which can retrieve relevant results containing multiple modal types according to data of any modal type uploaded by a user as a query. Compared with the traditional single-mode retrieval, the cross-mode retrieval can provide more flexible and practical retrieval experience.
A key problem with cross-modality retrieval is how to learn the intrinsic correlations between different modalities. Cross-modality similarity metrics are very challenging due to inconsistent distribution characteristics and feature representations of different modality data. The existing common cross-modal retrieval method mainly learns a uniform space for different modal data, that is, feature representations of the different modal data are mapped to the cross-modal uniform space from an original single modal space, so that a uniform representation capable of directly measuring cross-modal similarity is obtained. The existing methods can be mainly divided into two categories, one is to learn linear mapping under a traditional framework, and the method includes a method based on Canonical Correlation Analysis (CCA for short), which maps data of different modalities into a common subspace of the same dimension by analyzing a pair-wise association relationship of data of different modalities, and maximizes an association between the pair-wise data. In addition, there are methods based on graph conventions, for example, Zhai et al, in the document "Learning Cross-Media joint retrieval with Sparse and Semi-Supervised reconstruction" proposes a Sparse and Semi-Supervised protocol-based Cross-modal retrieval method, which constructs graph models for different modal data, and performs Cross-modal association Learning and high-level semantic abstraction at the same time.
The other type is a cross-modal unified characterization learning method based on a deep neural network, and the main idea is to analyze and mine a complex cross-modal association relation by utilizing the strong modeling capability of the deep neural network. For example, Ngiam et al, in the "Multimodal Deep Learning" document, propose a multi-modal auto-encoder that models cross-modal correlation information in the middle layer, using two-modal data as inputs, and simultaneously models the reconstruction errors of both. Feng et al propose a corresponding self-encoder (Corr-AE) in a document of Cross-model retrieval with Correspondence Autoencoder, construct two networks connected by an encoding layer, and simultaneously model associated information and reconstructed information. Most of the existing cross-modal retrieval methods based on the deep network can be divided into two learning stages, the separation characteristic representation of each modal is learned in the first stage, and the cross-modal unified representation is learned in the second stage. However, the existing method has three limitations, namely, the existing method only models the association relation in the modality at the first stage, and ignores the supplementary role of the association between the modalities on the separation feature representation learning; secondly, in the second stage, the existing method only uses a single loss function for constraint, and cannot fully balance the associated learning process in the modes and among the modes; in addition, the existing method only considers the original data of different modes, but ignores the rich fine-grained information provided by each part inside the existing method, and cannot fully mine the cross-mode association relationship.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a cross-modal association learning method based on a multi-granularity hierarchical network, which can fully mine the multi-level association relationship in the modalities and among the modalities by utilizing a hierarchical network structure, and simultaneously dynamically balance semantic category constraints in the modalities and pairwise similarity constraint learning processes among the modalities by utilizing a multi-task framework. In addition, the accuracy of cross-modal retrieval is improved by modeling the multi-granularity information of different modal data.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
a cross-modal association learning method based on a multi-granularity hierarchical network is used for comprehensively modeling multi-granularity information of cross-modal data and association information in and among modalities to obtain uniform representations of different modal data, so that cross-modal retrieval is realized, and the method comprises the following steps:
(1) establishing a cross-modal database containing various modal types, dividing data in the cross-modal database into a training set, a verification set and a test set, carrying out blocking processing on data of different modes in the cross-modal database, and extracting original data of all the modes and feature vectors of the blocked data;
(2) training a multi-granularity hierarchical network structure by utilizing original data and partitioned data, and learning unified representation for different modal data through the multi-granularity hierarchical network structure;
(3) calculating the similarity of different modal data by using the unified representation of the different modal data obtained according to the trained multi-granularity hierarchical network structure;
(4) and using any one mode type in the test set as a query mode, using the other mode type as a target mode, using each data of the query mode as a query sample, retrieving data in the target mode, calculating the similarity between the query sample and the query target, and obtaining a related result list of the target mode data according to the similarity.
Further, in the above cross-modality association learning method based on the multi-granularity hierarchical network, the cross-modality database in step (1) may contain a plurality of modality types, such as images, texts, and the like.
Further, in the above cross-modal association learning method based on the multi-granularity hierarchical network, the different modal data in the database are subjected to a blocking processing process in the step (1), and the original data can be divided into a plurality of parts by adopting different blocking processing methods for the different modal data. Specifically, a plurality of candidate regions containing rich fine-grained information such as visual objects are extracted from image data by using a selective search (selective search) algorithm; for text data, the text data is divided into a plurality of pieces in units of sentences. Meanwhile, other blocking methods may be supported, such as segmenting an image into 2 × 2 or 4 × 4 regions, segmenting text by phrases, and so on.
Further, in the above cross-modal association learning method based on a multi-granularity hierarchical network, the feature vector in step (1) is specifically: extracting word frequency characteristic vectors from the text data; the image data is the feature vector of the extracted convolutional neural network, and can support other kinds of features, such as the feature vector of the word bag of the image, the feature vector of the hidden Direx distribution of the text, and the like.
Further, according to the cross-modal association learning method based on the multi-granularity hierarchical network, a multi-path network structure is used in the step (2), different modal data are subjected to blocking processing, multi-granularity information in the data is fully mined, meanwhile, intra-modal and inter-modal association relations of the cross-modal data are modeled to obtain a single-modal separation characteristic representation, a multi-task learning framework is built, a learning process of intra-modal semantic category constraints and inter-modal pairwise association constraints is dynamically balanced, and finally, cross-modal unified characterization is obtained.
Further, in the above cross-modal association learning method based on the multi-granularity hierarchical network, the distance measurement in step (3) adopts cosine distance, and the similarity between two modal data is measured by calculating cosine values of the included angle of the unified representation vector of the two modal data. In addition, the framework also supports other types of distance metrics, such as Euclidean distance and the like.
Further, in the above cross-modal association learning method based on the multi-granularity hierarchical network, the retrieval manner in the step (4) is to use one modal type in the test set as a query modality, and use another modal type as a target modality. And (3) regarding each data of the query modes in the test set as a query sample, calculating similarity with all data of the target modes in the test set after calculating the similarity according to the step (3), and then sequencing the similarity from large to small to obtain a related result list.
The invention has the following effects: compared with the existing method, the method can fully mine the multi-granularity information of different modal data, simultaneously model the incidence relation between the intra-modal and the inter-modal to learn the separation characteristic representation of the single mode, further dynamically balance the learning process of semantic category constraint and pairwise incidence constraint between the modal by adopting a multi-task learning framework, and improve the accuracy of cross-modal retrieval.
The reason why the method has the above-mentioned inventive effect is that: aiming at two stages of single-mode separation feature representation learning and cross-mode unified representation learning, a hierarchical network structure is adopted to fully model the incidence relation between the modes. On one hand, in the single-mode separation feature representation learning process, multi-granularity feature representations of different mode data are fused, and association learning in the modes and among the modes is optimized in a combined mode. On the other hand, in the cross-modal unified representation learning process, a multi-task learning framework is adopted to dynamically balance the learning process of semantic category constraint in the modalities and pairwise association constraint between the modalities, so that the accuracy of cross-modal retrieval is improved.
Drawings
FIG. 1 is a flow chart of a cross-modal association learning method based on a multi-granularity hierarchical network according to the present invention.
Fig. 2 is a schematic diagram of the complete network architecture of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and specific embodiments.
The invention relates to a cross-modal association learning method based on a multi-granularity hierarchical network, the flow of which is shown in figure 1, and the method comprises the following steps:
(1) establishing a cross-modal database containing multiple modal types, dividing the database into a training set, a verification set and a test set, carrying out blocking processing on data of different modes in the database, and extracting all modal original data and feature vectors of the blocked data.
In this embodiment, the cross-modal database may include a plurality of modal types, and different blocking processing methods are used to divide the original data into a plurality of parts for different modal data. Taking images and texts as an example, extracting a plurality of candidate regions containing rich fine-grained information such as visual objects by using a Selective Search algorithm for image data; for text data, the text data is divided into a plurality of pieces in units of sentences. Further, the feature vector extraction method for the two modality type data is as follows: extracting word frequency characteristic vectors from the text data; the image data is a feature vector of the extracted deep convolutional neural network. The framework of the method also supports other modal type data, such as audio, video and the like, and can support other kinds of characteristics, such as word bag characteristic vectors of images, hidden Direy distributed characteristic vectors of texts and the like.
For media type r, where r ═ i, t (i denotes image, t denotes text), n is defined(r)The number of data is the same. Each data in the training set has one and only one semantic category.
Definition ofAs a mediumThe feature vector of the p-th data in type r, whose expression structure is d(r)× 1, where d(r)Representing the feature vector dimension of media type r.
Definition ofIs semantically tagged asIt represents a vector structured as c × 1, where c represents the total number of semantic categories.One dimension of the data is 1, and the other dimensions are 0, which indicates that the semantic category of the data is the label corresponding to the column with the value of 1.
(2) And training a multi-granularity hierarchical network structure by using the original data and the partitioned data, and learning unified characterization for different modal data.
The process of this step is shown in fig. 2, wherein the circles represent hidden units in the neural network, and the dotted lines represent connections between the hidden units in two adjacent layers of the neural network. In this embodiment, two networks are used to model the original image and text data. Firstly, two Deep Belief Networks (DBN) are used for respectively modeling the feature distribution of an image and a text, and the following conditional probability distribution formula is used:
wherein h is(1)And h(2)Representing two hidden layers in a DBN, viRepresenting image data, vtRepresenting text data. From this, a feature representation Q containing intra-modal high-level semantic information can be derived(i)And Q(t). Then connecting two networks by using shared coding layer, and simultaneously buildingIntra-modal and inter-modal associations of the modal images and the text data jointly optimize the reconstructed learning error and the associated learning error by minimizing the following loss function:
whereinAndrepresenting a reconstructed representation of each modality, LrDenotes the reconstructed learning error, LcIndicating the associated learning error. Thus, a coarse-grained characterization including intra-modality and inter-modality associations may be obtainedAndwhereinAndand respectively representing the coarse-grained characteristic representation of the p-th data in two media types of images and texts.
In this embodiment, two networks are used to model fine-grained image and text data. Specifically, two Deep Belief Networks (DBNs) are used for modeling fine-grained image and text data, and an average fusion strategy is adopted to obtain a feature representation U containing intra-modal fine-grained information(i)And U(t)Then, a shared coding layer is constructed to connect two networks, and intra-modal association and inter-modal association represented by fine-grained features of the image and the text are modeled simultaneously by minimizing the following loss functions:
whereinAnda reconstructed representation, L, representing fine-grained features of each modalityrDenotes the reconstructed learning error, LcIndicating the associated learning error. Thus, a fine-grained feature representation including intra-modality and inter-modality associations may be obtainedAndwhereinAndand respectively representing fine-grained characteristic representation of the p-th data in two media types of images and texts.
In this embodiment, a joint constrained Boltzmann Machines (RBM) is used to fuse the coarse-grained representation and the fine-grained representation of each mode(s) ((r))And). Specifically, the following joint distribution is defined:
wherein,andrespectively representing two hidden layers in a jointly constrained boltzmann machine, h(2)Represents a combined layer therein; for images, v1Coarse-grained feature representation of a representation imagev2Fine-grained feature representation of a representation imageFor the same reason as for text, a joint distribution as defined above is still used, where v1Coarse-grained feature representation of presentation textv2Fine-grained feature representation of representation textTherefore, the single-mode feature representation containing coarse-grained and fine-grained information at the same time can be obtainedAndwhereinAnda single modality feature representation of the p-th data in both image and text media types, respectively.
In this embodiment, a multitask learning framework is used to model semantic category constraints within modalities and pairwise similarity constraints between modalities. Specifically, for the pairwise similarity constraint between modalities, a neighbor graph G ═ V, E is first constructed for all image and text data, where V denotes image or text data and E denotes a similarity relationship between image and text data, as defined below:
whereinAndlabels representing image and text data. The following contrast loss functions are then defined to model pairs of similar and dissimilar constraints:
whereinAndsingle modality feature representation (S) representing images and text, respectively(i)And S(t)) The boundary parameter is set to α.
Then, for the intra-modal semantic category constraint, constructing an n-way softmax layer, where n represents the number of categories and defines the following cross-entropy loss function:
whereinRepresenting the predicted distribution probability, piRepresenting the target distribution probability. By minimizing the loss function, the semantic discernment capabilities of the unified representation can be enhanced.
Finally, through the multi-task learning framework, the modal dynamic balance can be realizedThe learning process of the paired association constraint between the semantic category constraint and the modality finally obtains more accurate cross-modality unified representationAndwhereinAndthe cross-mode unified representation of the pth data in the two media types of the image and the text is respectively represented.
(3) And calculating the similarity of the different modal data by utilizing the unified representation of the different modal data obtained according to the trained multi-granularity hierarchical network structure.
After the deep network training is completed, the data of different media can obtain the uniform representations of the same dimension through the deep network, and the similarity of the data is defined as the distance measurement between the uniform representations of the data of different modes. In this embodiment, the distance measurement adopts cosine distance, and the similarity between two modal data is measured by calculating a cosine value of a unified representation vector included angle. In addition, the framework also supports other types of distance metrics, such as Euclidean distance and the like.
(4) Any one modality type in the test set is used as a query modality, and the other modality type is used as a target modality. And (4) taking each data of the query modality as a query sample, retrieving data in the target modality, calculating the similarity between the query sample and the query target according to the mode in the step (3), and sequencing the similarity from large to small to obtain a related result list of the data of the target modality.
The following experimental results show that compared with the existing method, the cross-modal association learning method based on the multi-granularity hierarchical network can achieve higher retrieval accuracy.
This example was conducted using the Wikipedia Cross-Modal dataset, which was proposed by the documents "a new approach to Cross-modular Multimedia review" (authors n. rasiwasisia, j. pereira, e. covielo, g. doyle, g. lanckriet, r. levy and n. vasccolos, published ACMinternational conference on Multimedia in 2010), which included 2866 text and 2866 images, wherein the text and images were in one-to-one correspondence, divided into 10 categories in total, wherein 2173 text and 2173 images were used as training sets, 231 text and 231 images were used as validation sets, 492 text and 492 images were used as test sets. The following 3 methods were tested as experimental comparisons:
the prior method comprises the following steps: joint Representation Learning (JRL) method in the document "Learning Cross-Media Joint retrieval with spark and Semi-supervisory reconstruction" (author x.zhai, y.peng, and j.xiao), constructs a graph model for different modal data, performs Cross-modal associative Learning and high-level semantic abstraction simultaneously, and introduces sparse and Semi-Supervised conventions.
The prior method II comprises the following steps: in a Multimodal autoencoder (Bimodal AE) method in the document "Multimodal Deep Learning" (author j.ngiam, a.khasla, m.kim, j.nam, h.lee, and a.y.ng), a plurality of media types are used as input, cross-modal associated information is modeled in an intermediate layer to obtain a uniform characterization, and meanwhile, a network is required to be capable of reconstructing an original characteristic input from the uniform characterization, so that the associated information among different media can be effectively learned, and reconstruction information in each media can be retained.
The existing method is three: a corresponding self-encoder network (Corr-AE for short) method in a document 'Cross-mode Retrieval with corresponding auto encoder' (author f.feng, x.wang, and r.li) constructs two networks, and connects at intermediate layers to simultaneously model associated information and reconstructed information.
The invention comprises the following steps: the method of the present embodiment.
The accuracy of the cross-modal retrieval is evaluated by adopting a MAP (mean average retrieval) index commonly used in the field of information retrieval in the experiment, wherein the MAP is an average value of the retrieval accuracy of each query sample, and the larger the MAP value is, the better the result of the cross-modal retrieval is.
TABLE 1 Experimental results of the invention show
Image query text | Text query image | Average | |
Existing method 1 | 0.453 | 0.400 | 0.427 |
Conventional method II | 0.314 | 0.290 | 0.302 |
Existing method III | 0.402 | 0.395 | 0.399 |
The invention | 0.504 | 0.457 | 0.481 |
As can be seen from Table 1, the method of the invention is greatly improved in the two tasks of image query text and text query image compared with the prior art. In the existing method, a graph model is constructed under a traditional frame to linearly map different modal data to a uniform space, so that a complex cross-modal incidence relation is difficult to fully model. The existing method II and the existing method III both adopt a deep network structure, but only utilize original data of different modal types, and learn the cross-modal unified representation through a simple network structure. On one hand, the method fuses multi-granularity characteristic representations of different modal data and combines the related learning in optimized modes and among the modes to obtain single-modal separation characteristic representation. On the other hand, a multi-task learning framework is adopted, the learning process of semantic category constraint in the modes and pairwise association constraint between the modes is dynamically balanced, and cross-mode unified representation is obtained, so that the accuracy of cross-mode retrieval is improved.
In other embodiments, in the cross-modal unified characterization learning method in step (2), a Deep Belief Network (DBN) is used to model original and fine-grained image and text data, and a Stacked auto-encoder (SAE) may also be used as a substitute.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Claims (9)
1. A cross-modal association learning method based on a multi-granularity hierarchical network comprises the following steps:
(1) establishing a cross-modal database containing various modal types, dividing data in the cross-modal database into a training set, a verification set and a test set, carrying out blocking processing on data of different modes in the cross-modal database, and extracting feature vectors of original data of all the modes and the blocked data;
(2) training a multi-granularity hierarchical network structure by utilizing original data and partitioned data, and learning unified characterization for different modal data through the multi-granularity hierarchical network structure, wherein the method comprises the following steps: modeling original images and text data by using two networks, modeling fine-grained images and text data by using the two networks, fusing coarse-grained representation and fine-grained representation of each mode by using a joint limiting Boltzmann machine, and finally modeling semantic category constraint in the modes and pairwise similarity constraint between the modes by using a multi-task learning framework;
(3) calculating the similarity of different modal data by using the unified representation of the different modal data obtained according to the trained multi-granularity hierarchical network structure;
(4) and using any one mode type in the test set as a query mode, using the other mode type as a target mode, using each data of the query mode as a query sample, retrieving data in the target mode, calculating the similarity between the query sample and the query target, and obtaining a related result list of the target mode data according to the similarity.
2. The method of claim 1, wherein the cross-modality database contains a plurality of modality types including images, text.
3. The method according to claim 1, wherein the step (1) employs different chunking processing methods for different modality data to slice the original data into a plurality of parts, wherein a plurality of candidate regions containing rich fine-grained information are extracted using a selective search algorithm for the image data, or the image is sliced into 2 x 2 or 4 x 4 regions; for text data, the text data is divided into a plurality of pieces in units of sentences, or the text is divided into phrases.
4. The method of claim 1, wherein the feature vectors extracted in step (1) are: extracting word frequency characteristic vectors or hidden Direy distribution characteristic vectors from the text data; the image data is the characteristic vector or the bag of words characteristic vector of the convolutional neural network of extraction.
5. The method of claim 1, wherein the modeling of raw image and text data using two networks begins with modeling the feature distributions of images and text, respectively, using two deep belief networks, using the following conditional probability distribution formula:
wherein h is(1)And h(2)Representing two hidden layers in a DBN, viRepresenting image data, vtRepresenting text data; thereby obtaining a feature representation Q containing intra-modal high-level semantic information(i)And Q(t)(ii) a Then connecting two networks by using a shared coding layer, simultaneously modeling intra-modal association and inter-modal association of the image and the text data, and jointly optimizing and reconstructing a learning error and an associated learning error by minimizing the following loss function:
6. the method of claim 5, wherein the two-way network modeling of fine-grained image and text data is based on modeling fine-grained image and text data using two Deep Belief Networks (DBNs), and an average fusion strategy is used to obtain a feature representation U containing intra-modal fine-grained information(i)And U(t)Then, a shared coding layer is constructed to connect two networks, and intra-modal association and inter-modal association represented by fine-grained features of the image and the text are modeled simultaneously by minimizing the following loss functions:
7. the method of claim 6, in which the fusing the coarse-grained representation and the fine-grained representation of each modality using a joint-constrained boltzmann machine defines a joint distribution as follows:
wherein,andrespectively representing two hidden layers in a jointly constrained boltzmann machine, h(2)Representing a joint layer therein, v for an image1Coarse-grained feature representation of a representation imagev2Fine-grained feature representation of a representation imageFor the same reason as for text, a joint distribution as defined above is still used, where v1Coarse-grained feature representation of presentation textv2Fine-grained feature representation of representation textThus obtaining a single-mode feature representation S containing both coarse-grained and fine-grained information(i)And S(t)。
8. The method of claim 7, wherein the use of a multitask learning framework to model semantic category constraints within modalities and pairwise similarity constraints between modalities, for which pairwise similarity constraints between modalities a neighborhood graph G ═ (V, E) is first constructed for all image and text data, where V represents image or text data and E represents a similarity relationship between image and text data, defined as follows:
whereinAndlabels representing image and text data; the following contrast loss functions are then defined to model pairs of similar and dissimilar constraints:
whereinAndsingle modality feature representation (S) representing images and text, respectively(i)And S(t)) Boundary parameter set to α;
then, for the intra-modal semantic category constraint, constructing an n-way softmax layer, where n represents the number of categories and defines the following cross-entropy loss function:
whereinRepresenting the predicted distribution probability, piRepresenting a target distribution probability; enhancing the semantic recognition capability of the unified representation by minimizing the loss function; finally, through the multi-task learning framework, the learning process of semantic category constraint in the dynamic balance mode and pairwise association constraint between the modes is achieved, and finally the more accurate cross-mode unified representation M is obtained(i)And M(t)。
9. The method as claimed in claim 1, wherein the step (3) adopts cosine distance, and measures the similarity of two modal data by calculating cosine value of the included angle of the unified characterization vector of the two modal data; or step (3) adopts Euclidean distance to measure similarity.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710378513.XA CN107346328B (en) | 2017-05-25 | 2017-05-25 | Cross-modal association learning method based on multi-granularity hierarchical network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710378513.XA CN107346328B (en) | 2017-05-25 | 2017-05-25 | Cross-modal association learning method based on multi-granularity hierarchical network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107346328A CN107346328A (en) | 2017-11-14 |
CN107346328B true CN107346328B (en) | 2020-09-08 |
Family
ID=60253337
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710378513.XA Active CN107346328B (en) | 2017-05-25 | 2017-05-25 | Cross-modal association learning method based on multi-granularity hierarchical network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107346328B (en) |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109189968B (en) * | 2018-08-31 | 2020-07-03 | 深圳大学 | Cross-modal retrieval method and system |
CN109271486B (en) * | 2018-09-19 | 2021-11-26 | 九江学院 | Similarity-preserving cross-modal Hash retrieval method |
CN112116095B (en) * | 2019-06-19 | 2024-05-24 | 北京搜狗科技发展有限公司 | Method and related device for training multi-task learning model |
CN110457516A (en) * | 2019-08-12 | 2019-11-15 | 桂林电子科技大学 | A kind of cross-module state picture and text search method |
CN110781319B (en) * | 2019-09-17 | 2022-06-21 | 北京邮电大学 | Common semantic representation and search method and device for cross-media big data |
CN110807465B (en) | 2019-11-05 | 2020-06-30 | 北京邮电大学 | Fine-grained image identification method based on channel loss function |
CN111275130B (en) * | 2020-02-18 | 2023-09-08 | 上海交通大学 | Multi-mode-based deep learning prediction method, system, medium and equipment |
CN111753549B (en) * | 2020-05-22 | 2023-07-21 | 江苏大学 | Multi-mode emotion feature learning and identifying method based on attention mechanism |
CN111859635A (en) * | 2020-07-03 | 2020-10-30 | 中国人民解放军海军航空大学航空作战勤务学院 | Simulation system based on multi-granularity modeling technology and construction method |
CN112819052B (en) * | 2021-01-25 | 2021-12-24 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Multi-modal fine-grained mixing method, system, device and storage medium |
CN112990048B (en) * | 2021-03-26 | 2021-11-23 | 中科视语(北京)科技有限公司 | Vehicle pattern recognition method and device |
CN113516286B (en) * | 2021-05-14 | 2024-05-10 | 山东建筑大学 | Student academic early warning method and system based on multi-granularity task joint modeling |
CN114064967B (en) * | 2022-01-18 | 2022-05-06 | 之江实验室 | Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network |
CN114219049B (en) * | 2022-02-22 | 2022-05-10 | 天津大学 | Fine-grained curbstone image classification method and device based on hierarchical constraint |
CN116012679B (en) * | 2022-12-19 | 2023-06-16 | 中国科学院空天信息创新研究院 | Self-supervision remote sensing representation learning method based on multi-level cross-modal interaction |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105701227A (en) * | 2016-01-15 | 2016-06-22 | 北京大学 | Cross-media similarity measure method and search method based on local association graph |
CN105718532A (en) * | 2016-01-15 | 2016-06-29 | 北京大学 | Cross-media sequencing method based on multi-depth network structure |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10776710B2 (en) * | 2015-03-24 | 2020-09-15 | International Business Machines Corporation | Multimodal data fusion by hierarchical multi-view dictionary learning |
-
2017
- 2017-05-25 CN CN201710378513.XA patent/CN107346328B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105701227A (en) * | 2016-01-15 | 2016-06-22 | 北京大学 | Cross-media similarity measure method and search method based on local association graph |
CN105718532A (en) * | 2016-01-15 | 2016-06-29 | 北京大学 | Cross-media sequencing method based on multi-depth network structure |
Also Published As
Publication number | Publication date |
---|---|
CN107346328A (en) | 2017-11-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107346328B (en) | Cross-modal association learning method based on multi-granularity hierarchical network | |
CN107562812B (en) | Cross-modal similarity learning method based on specific modal semantic space modeling | |
Peng et al. | An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges | |
US10504010B2 (en) | Systems and methods for fast novel visual concept learning from sentence descriptions of images | |
CN112819023B (en) | Sample set acquisition method, device, computer equipment and storage medium | |
CN112417097B (en) | Multi-modal data feature extraction and association method for public opinion analysis | |
CN110619051B (en) | Question sentence classification method, device, electronic equipment and storage medium | |
CN107220337B (en) | Cross-media retrieval method based on hybrid migration network | |
CN113971209B (en) | Non-supervision cross-modal retrieval method based on attention mechanism enhancement | |
CN110647904A (en) | Cross-modal retrieval method and system based on unmarked data migration | |
CN112487822A (en) | Cross-modal retrieval method based on deep learning | |
CN113239159B (en) | Cross-modal retrieval method for video and text based on relational inference network | |
CN113537304A (en) | Cross-modal semantic clustering method based on bidirectional CNN | |
CN116975615A (en) | Task prediction method and device based on video multi-mode information | |
Zhang et al. | Deep unsupervised self-evolutionary hashing for image retrieval | |
Su et al. | Semi-supervised knowledge distillation for cross-modal hashing | |
CN111368176B (en) | Cross-modal hash retrieval method and system based on supervision semantic coupling consistency | |
Liu et al. | Open intent discovery through unsupervised semantic clustering and dependency parsing | |
CN117273134A (en) | Zero-sample knowledge graph completion method based on pre-training language model | |
He et al. | Category alignment adversarial learning for cross-modal retrieval | |
CN114743029A (en) | Image text matching method | |
CN112528062B (en) | Cross-modal weapon retrieval method and system | |
Perdana et al. | Instance-based deep transfer learning on cross-domain image captioning | |
CN109670071B (en) | Serialized multi-feature guided cross-media Hash retrieval method and system | |
Su et al. | Semantically guided projection for zero-shot 3D model classification and retrieval |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |