CN105760507A

CN105760507A - Cross-modal subject correlation modeling method based on deep learning

Info

Publication number: CN105760507A
Application number: CN201610099438.9A
Authority: CN
Inventors: 张玥杰; 程勇; 刘志鑫; 金城; 张涛
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2016-02-23
Filing date: 2016-02-23
Publication date: 2016-07-13
Anticipated expiration: 2036-02-23
Also published as: CN105760507B

Abstract

The invention belongs to the technical field of cross-media correlation learning, and particularly relates to a cross-modal subject correction modeling method based on deep learning.The method includes two main algorithms of multi-modal file expression based on deep vocabularies and correlation subject model modeling fusing cross-modal subjection correction learning.A deep learning technology is utilized for constructing deep semantic vocabularies and deep vision vocabularies to describe a semantic description part and an image part in a multi-modal file.Based on multi-modal file expression, a cross-modal correlation subject model is constructed to model a whole multi-modal file set, so that the generation process of the multi-modal file and the correlation between different modals are described.The accuracy is high, and adaptability is high.The cross-modal subject correction modeling method has important meaning for efficient cross-media information retrieval in consideration of multi-modal semantic information on the basis of the large-scale multi-modal file (a text and an image), can improve retrieval correlation and promote user experience, and has great application value in the field of cross-media information retrieval.

Description

Cross-module state topic relativity modeling method based on degree of depth study

Technical field

The invention belongs to across Media Correlation learning art field, be specifically related to based on the degree of depth study across modality images-text subject correlation study method.

Background technology

Development and the maturation of Web2.0 along with Internet technology, the multi-modal document of accumulative magnanimity on the internet, how to analyze and process the labyrinth of these multi-modal documents, thus providing theories integration to have changed into a very important study hotspot for practical applications such as cross-media retrieval.As a rule, a multi-modal document generally exists with the form of multiple modalities co-occurrence, for instance, many web graph pictures attach a lot of user-defined iamge description or mark, additionally also have the document package form containing some illustrations of some networks.But, although these multi-modal data are usually associated with each other, but due to the problem of semantic gap, visual information and text at image describe and have very big difference and difference [1] between information, and this makes the semantic association making full use of between different modalities become very difficult.Therefore, how fully to excavate the relation that different modalities data are implicit behind, and merge important [2,3] that multi-modal document is modeled becoming very by multi-modal information better.And utilize topic model that multi-modal document is modeled, and then the association excavated between different modalities is a Critical policies, in the research of cross-module state theme modeling, there are three problems of being mutually related needs to be solved simultaneously:

1, find and the image in multi-modal document and content of text are described expression by structure document elements more representative, more valuable respectively.

2, more reasonably topic relativity model can be set up better association between different modalities data in multi-modal document to be described, namely visual pattern and text describe between association.

3, set up one by the study of cross-module state topic relativity for the internal association between image and content of text and objectively weigh mechanism.

For solving first problem, how most important exploration exactly can set up one group of document elements optimized, thus utilizing the document elements of these optimizations more accurately, more fully hereinafter to the vision in multi-modal document and semantic feature to express.

For solving Second Problem, it is most important that the probability topic model of a robust more can be set up, the likelihood angle value of the multi-modal subject document observed is made to reach maximum thus excavating implicit subject information behind.

For solving the 3rd problem, maximally effective settling mode is that the attribute character of different modalities is mapped in common embedding subspace, thus the related information maximized between different modalities information.

Current more existing researcheres propose distinct methods for multi-modal data modeling, substantially can be divided into two classes from modeling these methods of angle, and the first kind is statistics dependence modeling method, and Equations of The Second Kind is to build joint probability to generate model method.

(1) modeling method that statistics relies on

The core concept of statistical modeling method is that the data characteristics of different modalities is mapped to identical latent space, it is therefore desirable for farthest excavate the statistic correlation between different modalities data characteristics.For image and text, by building corresponding mapping matrix, respectively the characteristics of image of different structure and text feature are mapped in identical public subspace, fall into a trap in public subspace the dependency of nomogram picture and text, more relevant image and text distance in public subspace are also more near, otherwise distance more far means that image is also more low with the dependency of text.Canonical correlation analysis method (CanonicalCorrelationAnalysis, CCA) being a kind of most typical statistics dependence method, it obtains its corresponding space base vector matrix by asking for the maximum correlation of visual signature matrix and semantic feature matrix；Space base vector matrix keeps the dependency of Image Visual Feature and semantic feature substantially, and provides it to be mapped to the mapping relations of isomorphism subspace；And then by the visual feature vector of image and semantic feature DUAL PROBLEMS OF VECTOR MAPPING to isomorphism subspace under same dimension and build cross-module state fusion feature, it is achieved the unified representation of media data different modalities information.Dependence between image and text is inquired into by the such as KernelCCA of work afterwards (KCCA) and deepCCA (DCCA) in deeper level.

Statistical modeling method and topic model are combined together by work [4], the method extracts the visual theme feature of image and the text subject feature of text respectively first with potential Di Li Cray model, utilize afterwards Canonical correlation method by visual theme feature and text subject Feature Mapping to isomorphism subspace to find and to calculate its dependency.It extends its work in [5], and utilizes KCCA to calculate its dependency.

(2) build joint probability and generate model method

Multi-modal topic model is to build joint probability to generate the Typical Representative method of model, has many related works recent years for the vision content in multi-modal document and semantic description to carry out probability topic modeling [6,7,8,9,10].[Blei2003] sets up a series of topic model [11] complicated step by step in its work in 2003, wherein CorrespondenceLatentDirchletAllocation (Corr-LDA) is wherein optimum cross-module state topic model, there is corresponding dependence between implicit theme between this model hypothesis different modalities, namely the implicit theme of corresponding mark comes from image vision information implicit theme behind.This hypothesis is set up the generation of unidirectional mapping relations and text vocabulary and is depended on the vision content information of image.Afterwards, [Wang2009] proposes a kind of topic model having a supervision to learn the potential relation [12] between image and mark word, and [Putthividhva2010] then proposes a kind of multi-modal potential Di Li Cray model [13] returned based on theme.[Rasiwasia2010] studies multi-modal document Chinese version and picture material combines modeling [3].[Nguyen2013] proposes a kind of method of image labeling, and the method is based on the distribution [9] of the union feature distribution with word and word and theme.[Niu2014] proposes a kind of semi-supervised relation topic model and the relation between picture material and image carries out explicit modeling [14].[Wang2014] then proposes a kind of semi-supervised multi-modal common theme and strengthens model, and the relation [15] mutually promoted between different modalities theme inquired into by this model.[Zheng2014] proposes a kind of supervision mutation model that has for DocNADE and the visual vocabulary of image, mark vocabulary and class target Joint Distribution is modeled [16].[Chen, 2015] solves the modeling wide gap [17] between image and text by building vision-emotion LDA model.

As seen through the above analysis, current method all obtains some progress when multi-modal Document Modeling, but all above method does not take into full account the impact that three below aspect is brought yet:

(1) in multi-modal document, depth information excavates most of existing images-label degree of association learning method and generally only focuses on the association to explore between different modalities in traditional visual signature method for expressing and markup information feature, does not consider the depth characteristic contained in these different modalities.For building the semantic and internal semantic association of overall Vision, this will cause a series of serious loss of learning problem.The degree of depth exploration of multi-modal document then can be made up this defect so that the characteristic element obtained represents multi-modal document better.

(2) model most existing theme modeling methods when considering the topic relativity building different modalities based on the relation topic relativity of depth analysis, be typically based on such it is assumed that the theme that namely different modalities is hidden behind is consistent.And such hypothesis is as a rule excessively absolute; meeting is simultaneously introduced some unnecessary noises structure topic relativity; therefore build one more reasonably it is assumed that merge the characteristic information of the degree of depth, form a more optimal relation topic relativity modeling mechanism and become particularly significant.

(3) the most existing multi-modal topic model of cross-module state degree of association study based on degree of depth theme feature typically directly considers the theme distribution feature that coupling different modalities is hidden behind when calculating dependency between different modalities, thus the internal association caught between visual pattern and text description.But, such a kind of direct matching way does not consider the isomerism of image and text well, therefore by degree of depth theme feature is mapped to public space thus learning its dependency can excavate its dependency well, thus solving problem suggested above.

Therefore, being highly desirable to use for reference current existing relevant mature technology, take one thing with another problem above simultaneously, the topic relativity computational methods analyzed more fully hereinafter and calculate between different modalities.The present invention thus excites, from local to entirety, the technological frame (including three main algorithm) designing a kind of novelty is contained, degree of depth vocabulary in multi-modal document builds, relation topic model builds, the study of allos topic relativity, thus setting up effective cross-module state topic relativity computational methods, final for improve across media image retrieval performance.

Summary of the invention

It is an object of the invention to propose a kind of cross-module state topic relativity modeling method based on degree of depth study, to improve across Media Society image retrieval performance.

Present invention firstly provides a novel degree of depth cross-module state topic relativity correlation model, this model is modeled for extensive multi-modal language material, can analyse in depth and understand the related information between image and text in multi-modal document, utilize constructed model, it is possible to effectively facilitate the performance of cross-media retrieval.This model mainly includes following components:

(1) degree of depth vocabulary builds (DeepWordConstruction).For multi-modal document, utilize degree of deep learning art to build degree of depth vocabulary respectively and be indicated as basic element；Degree of depth vocabulary includes deep vision vocabulary and degree of depth text vocabulary, and wherein, deep vision vocabulary is used for describing the image vision content in document better, and degree of depth text vocabulary is then as being used for describing the basic element of document Chinese version content.Compared with traditional visual vocabulary and text vocabulary, degree of depth vocabulary can excavate the semantic information of document to a deeper level.By such building mode, multi-modal document can represent better with degree of depth vocabulary.

(2) multi-modal subject information generates (MultimodalTopicInformationGeneration).On the degree of depth vocabulary basis built, utilize topic model LDA to excavate the subject information that different modalities data are hidden behind further.Topic model assumes that document sets has theme collection one group common behind, and in document, each word correspond to a theme, based on such it is assumed that can obtain each document theme feature behind by deriving and document is further represented.

(3) cross-module state theme association analysis (Cross-modalTopicCorrelationAnalysis).Assume that the theme that the document of different modalities is hidden behind is allos but relevant, the theme that such as " wedding " is corresponding in text document is likely to there is significantly high related information with image " white " theme behind, therefore by the method for structure common subspace, the theme feature of different modalities is mapped in public subspace, to find the related information between different modalities.

(4) relation theme modeling (RelationalTopicModeling).Relation topic model is when generating the theme feature of different modalities, consider the related information of image and document simultaneously, namely the information of same mode is not only considered when building the theme of a certain document, further contemplate the related information with other mode simultaneously, so that final theme merges multi-modal information, and finally build the theme distribution and cross-module state related information that obtain multi-modal document behind.

Compare to current existing multi-modal theme modeling method, method proposed by the invention also exists two big advantages in the application: first, accuracy is high, it is mainly reflected in: this method utilizes the degree of depth vocabulary built to replace tradition vocabulary, mode profound level information can be excavated deeper into ground, the problem that semantic gap brings can be alleviated well such that it is able to advantageously promote the efficiency of cross-media retrieval.Second, strong adaptability, because constructed model is modeled for the association between different modalities well, so going for image retrieval text and text retrieval image is two-way across media information retrieval, and this model can also expand to more easily for other mode in media information retrieval (such as audio frequency etc.).

Cross-module state topic relativity modeling method based on degree of depth study provided by the invention, specifically comprises the following steps that

(1) data prediction: from the data image of multi-medium data concentrated collection different modalities, obtains image and image description data, arranges seldom appearance or useless mark word in image labeling data set；

(2) multi-modal depth characteristic is extracted: utilize the semantic feature of degree of deep learning method visual signature Yu iamge description to extract image.Specifically, it is utilized respectively Region-CNN (ConvolutionalNeuralNetwork) model and Skip-gram model comes the provincial characteristics of abstract image and the lexical feature of text.Wherein, first Region-CNN detects representational region candidate collection in image, utilizes the convolutional neural networks of pre-training to come the feature extracted corresponding to respective regions afterwards；Skip-gram model is then utilize the co-occurrence information between text vocabulary and vocabulary directly to train the characteristic vector obtaining vocabulary to represent.

(3) degree of depth word bag model is built: obtained image area characteristics and text lexical feature in step (2) are clustered initially with clustering algorithm K-means, obtain limiting deep vision dictionary and the degree of depth text dictionary of dimension, and then provincial characteristics all of in respective image is mapped to corresponding visual dictionary, deep vision word bag model is obtained thus building, similarly, the vocabulary in all of text can also be mapped to text dictionary and obtains degree of depth text word bag model；

(4) multi-modal theme generates: utilize the hypothesis of potential Di Li Cray model to simulate the generation process of whole multi-modal data collection, and it is derived by the theme distribution feature that text collection and image set are hidden behind, makes full use of the co-occurrence information between vocabulary；

(5) the relation topic model modeling that cross-module state topic relativity is analyzed is merged: build corresponding relation topic model, namely while building topic model, consider the dependency of theme feature between different modalities, using the multi-modal theme feature that obtains in step (4) as initial value, utilize the dependency to calculate between image and text of the related information between image and text simultaneously, utilize calculated dependency to update the subject information of multi-modal document, thus cross-iteration ground carries out correlation calculations and obtains final relation topic model with theme distribution renewal and then structure；

(6) based on topic relativity across media information retrieval: the cross-module state topic relativity obtained is applied in media information retrieval, it is the inquiry of certain mode given respectively, utilizes correlation calculations to obtain and the data of this inquiry other mode maximally related.

Below above each step is described in detail:

(1) data prediction

The data image gathering different modalities is mainly carried out preliminary pretreatment by this step, specifically, because comprising some noises in the middle of the mark that image comprises, these noises are because the randomness of user annotation and cause, therefore the mode that can be filtered by word frequency, is filtered out word frequency lower than the word of certain threshold value thus obtaining new dictionary.

(2) multi-modal depth characteristic is extracted

In the present invention, it is utilized respectively Region-CNN and Skip-gram model and comes the provincial characteristics of abstract image and the lexical feature of text.Illustrate separately below:

Given image, Region-CNN, first with selecting the method for search to select position that object is likely to occur as Candidate Set (usual about 2,000) from image, exists with the form of region.Afterwards, then for each extracted region CNN feature.On implementing, each image-region is converted into fixing Pixel Dimensions 227*227 by Region-CNN, and the convolutional network for extracting feature is fully connected layer by 5 convolutional layers and 2 and constitutes.Extracting visual signature with Region-CNN to compare traditional visual signature, its advantage is mainly reflected in the CNN profound feature extracted and is more nearly the semanteme of image itself, it is possible to the problem alleviating semantic gap to a certain extent.

Given text document, utilizes Skip-gram model training to obtain each the word characteristic of correspondence vector occurred in text document.Skip-gram model is the distributed expression that a kind of very effective method carrys out learning text vocabulary, and this model is proposed in 2013 by Mikolov et al. the earliest, is used widely afterwards in the task of different natural language processings.This model can catch the syntax and semantics relation between text vocabulary well, and the word of semantic similitude can be condensed together, the text term vector learning method comparing traditional.One of Skip-gram important advantage is that training effectiveness when it is trained for mass data is high because being not related to the density matrix operation of complexity.Representing that with TD the text of whole multi-modal document data set describes part, TW is all of text vocabulary occurred in TD, and TV is the dictionary that text vocabulary is corresponding, for each vocabulary tw, iv in TW_twAnd ov_twIt is the input feature value for tw and output characteristic vector, Context (tw) is the word tw vocabulary hereinafter occurred thereon, in the present invention window size corresponding for context is set to 5, represents W ∈ R by unified to all input vectors corresponding to whole text data set and output vector with a long parameter vector^2*|TV|*dim, wherein dim is the dimension of input vector and output vector.Therefore, the object function of whole Skip-gram model can be described below:

B S G (ω) = \underset{ω}{argmax} \frac{1}{| W |} Σ_{i = 1}^{| W |} Σ_{j = 1}^{C o n t e x t (w_{i})} \log P (w_{j} | w_{i})

= {argmax}_{ω} \frac{1}{| W |} Σ_{i = 1}^{| W |} Σ_{j = 1}^{C o n t e x t (w_{i})} \frac{\exp (O_{w_{j}} \cdot I_{w_{i}})}{Σ_{k = 1}^{| V |} \exp (O_{w_{k}} \cdot I_{w_{i}})} - - - (1)

Skip-gram is trained, utilize traditional softmax train the calculation cost that brings can unusual height, therefore the negative sample method of sampling is utilized to approximate calculation ogP (tw_j|tw_i), its computing formula is as follows:

\log P (w_{j} | w_{i}) = l o g σ (O_{w_{j}} \cdot I_{w_{i}}) + Σ_{k = 1}^{m} E_{w_{k} ~ P (w)} l o g σ (O_{w_{j}} \cdot I_{w_{i}}) - - - (2)

Wherein, σ () is sigmoid function, and m is the quantity of negative sample, and each negative sample is to be distributed P (tw) from the noise based on word frequency to generate.

(3) degree of depth word bag model is built

Obtain on the basis of respective depth vocabulary in step (two), build degree of depth word bag model by the method for vector quantization (VectorQuantization) [25] further.Specifically, the region candidate collection obtained for utilizing R-CNN to extract and corresponding feature, multi-modal document data is concentrated the provincial characteristics that all images comprise to cluster by the method first with K-means, it is fixed the classification of quantity, the central point of each cluster classification is as the representative element of the category, and all these classifications constitute a corresponding dictionary.Afterwards, each candidate region in image is mapped in the middle of corresponding classification to represent, mapping method is the Euclidean distance by calculating the feature in each region and class center feature, thus finding the corresponding classification nearest with provincial characteristics, adds up in the position of the corresponding category of vector.Utilize such way the every piece image in whole data set all can be represented the form becoming deep vision word bag, the i.e. corresponding vector of every piece image, the dimension of vector is the number of classification, and the element value of the vector number of times that to be the category occur in the picture, with vector VT ∈ R^CRepresenting, wherein C is the class number that cluster obtains.Similarly, for all of term vector corresponding to text document, also the mode such as through cluster obtains corresponding degree of depth text dictionary, and each text is finally expressed as the form of degree of depth text word bag with same mapping method.

(4) multi-modal theme generates

Multi-modal information is a kind of very important expression way for multi-modal document content, say, that the visual information of image is combined with semantic description.Therefore, for better between computation vision image and text marking across Modal Correlation, extract representational multi-modal feature more exactly and become particularly significant, and multi-modal character representation can explore associating between the perceptual property of image and semantic meaning representation feature better.

Potential Di Li Cray distribution (LDA) algorithm is a production probabilistic model for discrete data, this algorithm is subject to showing great attention to of picture/text research field, LDA utilizes one group of probability distribution to represent every section of document, and each word in document is generated from an independent theme.The advantage of LDA is in that it considers the inherent statistical framework of document such as different word co-occurrence informations etc. in whole collection of document, it is assumed that each vocabulary in every section of document is generated from an independent theme, and this theme is generated by a Di Li Cray distribution on all themes.Each section of document is all expressed as one group of ProbabilityDistribution Vector in theme set by LDA, and these vectors are for representing visual signature and the text feature of sociogram.

In step (four), potential Di Li Cray model is utilized respectively image and text collection to be carried out probabilistic Modeling, potential Di Li Cray model hypothesis is a under cover common theme set in the behind of document sets, and each section of concrete document correspond to a probability distribution in this theme set behind respectively, each word in the document correspond to a theme generated by this probability distribution behind；And the probability distribution of all documents does not have no bearing on, generated from a common Di Li Cray distribution；On the basis of this model hypothesis, deep vision word bag step (three) obtained and degree of depth text word bag are as input, utilize the probability topic distribution that LDA model is hidden behind to be derived by different modalities document (text document and visual document), set up the relation topic model merging cross-module state related information for next step and set up basis.

(5) the relation topic model modeling that cross-module state topic relativity is analyzed is merged

Correlation information between different modalities is dissolved in topic model building process by structure relation topic model, specifically, the theme distribution of different modalities step (four) obtained is as initial value, the dependency obtained between different modalities theme feature is calculated by the theme feature of different modalities being mapped to the mode of public subspace, and the calculating of this dependency is dissolved in topic model, and then consider the correlation information with another mode when the theme that the document behind of a certain mode of deriving is hidden, so that the subject information finally given considers not only the distributed intelligence between same mode, it is also contemplated for the relation between other mode simultaneously.

The main target of this step is in that to build a joint probability distribution so that the multi-modal document likelihood angle value observed reaches maximum.In the process building model, by multi-modal collection of document D^MBeing divided into three parts to constitute, namely Part I is visual pattern set D^V, Part II is text description collections D^T, Part III is link set L^VT(related information between this set instruction image and text).Wherein, D^VBy deep vision lexical set DW^VConstitute, and DV^VIt is deep vision dictionary, text description collections D simultaneously^TBy degree of depth text lexical set DW^TConstitute, DV^TIt it is degree of depth text dictionary.For l^vt∈L^VT,l^vt=1 means visual pattern d^v∈D^VWith text, d is described^t∈D^TIt is relevant, and l^vt=0 means visual pattern d^vWith text, d is described^tIt is incoherent.Based on above description, relation topic model formalization representation is as follows: given TS^VFor visual theme set, TS^TBeing text subject set, α and β is two hyper parameter, and wherein α is distributed for the Di Li Cray of theme, and β is distributed for the Di Li Cray of theme-degree of depth vocabulary, θ^vCorresponding visual pattern d^vTheme distribution behind, θ^tCorresponding visual pattern d^tTheme distribution behind, Φ is the corresponding multinomial distribution corresponding to all degree of depth vocabulary of each theme, z is the behind subject information of all vocabulary of the correspondence actually generated by θ, and Dir () and Mult () represents the distribution of Di Li Cray and multinomial distribution, N respectively_dThe quantity of expression degree of depth vocabulary in document d, n represents the n-th degree of depth vocabulary.The generation process of whole relation topic model is as follows:

(1) for each theme tv ∈ DT in visual theme set^V:

A () obtains the multinomial distribution of the corresponding all visual vocabularies of tv according to the Di Li Cray profile samples of theme-visual vocabulary, it may be assumed that φ^v _tv～Dir (φ^v|β^v)。

(2) for each theme tt ∈ DT in text subject set^T:

A () obtains the multinomial distribution of the corresponding all text vocabulary of tt according to the Di Li Cray profile samples of theme-text vocabulary, it may be assumed that φ^t _tt～Dir (φ^t|β^t)。

(3) for each visual document d ∈ D^V:

A () obtains d theme distribution corresponding behind according to the Di Li Cray profile samples in theme set, it may be assumed that θ^v _d～Dir (θ^v|α^v)。

B () is for each deep vision vocabulary w in d^v _d,n:

I. the theme distribution according to document d behind obtains the theme that this vocabulary is corresponding, it may be assumed that z^v _d,n～Mult (θ^v _d)

Ii. the vocabulary corresponding in this position of document is obtained according to theme-visual vocabulary sampling, it may be assumed that w^v _d,n～Mult (φ^v _zd,n)

(4) for each text document d ∈ D^T:

A () obtains d theme distribution corresponding behind according to the Di Li Cray profile samples in theme set, it may be assumed that θ^t _d～Dir (θ^t|α^t)；

B () is for each degree of depth text vocabulary w in d^t _d,n:

I. the theme distribution according to document d behind obtains the theme that this vocabulary is corresponding, it may be assumed that z^t _d,n～Mult (θ^t _d)；

Ii. the vocabulary corresponding in this position of document is obtained according to theme-text vocabulary sampling, it may be assumed that w^t _d,n～Mult (φ^t _zd,n)；

(5) for each link l^vt∈L^VT, represent visual document d^vWith text document d^tBetween related information:

A () is according to d^vWith d^tTheme feature calculate its dependency thus to l^vtSample, it may be assumed thatM^v,M^t), whereinWithCorresponding document d respectively^vWith d^tExperience theme distribution, WithBeing that two mapping matrixes map vision and text subject feature respectively to public subspace, wherein the dimension of public subspace is dim dimension, TCor (l^vt=1) document d is represented^tWith d^vTopic relativity, and TCor (l^vt=0) document d is represented^tWith d^vTheme non-correlation.

Based on above procedure, the final joint probability distribution form that builds is modeled for whole multi-modal collection of document, as follows:

Wherein, the generation process of Section 1 correspondence theme-degree of depth vocabulary, the generation process of middle two corresponding deep vision vocabulary and degree of depth text vocabulary, last represents the generation process that image-description connects.

(6) across media information retrieval (application of relation topic model)

Step (six) is the relation topic model that step (five) is set up, for across media information retrieval, for image and text, two classes can be divided into across media information retrieval, i.e. text-inquiry-image and image-inquiry-text, text-inquiry-image is it is considered that according to given query text, utilizing relation topic model to calculate different images all images are ranked up by text degree of association, image-inquiry-text is then all text documents are ranked up for the degree of association of given query image according to different text documents.

For given inquiry (such as utilizing image querying text), the relation topic model of utilization derives corresponding theme feature, and the correlation calculations method utilizing the theme feature obtained in step (five) calculates the correlation information (such as text document) between other mode documents, by the height of correlation information, text document being ranked up, obtaining text document maximally related with query image thus returning.Similarly, said process be also applied for utilizing text query image across media information retrieval process.

In sum, the present invention is directed in multi-modal document the content isomerism between different modalities and relatedness, a kind of cross-module state topic relativity modeling method based on degree of depth study is proposed, and then by the form of probabilistic model, the generation process of whole multi-modal document can be described, and the dependency between the document of different modalities is quantified.The inventive method can effectively apply to for large-scale image in media information retrieval, improve retrieval relevance, strengthen Consumer's Experience.

Accompanying drawing explanation

Fig. 1 is the flow chart of the present invention.

Fig. 2 is the schematic diagram building the multi-modal document of degree of depth lexical representation.

Fig. 3 is the schematic diagram of cross-module state relation topic relativity modeling process.

Fig. 4 is the comparison diagram of proposed relation topic model and traditional multi-modal topic model.

Fig. 5 utilizes constructed relation topic model to carry out the design sketch across media information retrieval.

Detailed description of the invention

Below in conjunction with accompanying drawing, the cross-module state relatedness computation method that the present invention is directed to sociogram is discussed in detail.

(1) data object is gathered

Gather data object, obtain image and image labeling data, arrange and image labeling data seldom occur in whole data set or useless mark word.It is typically in the data set obtained, wherein with a lot of noise datas, so just it should be carried out suitable process and filtration before using these data to carry out feature extraction.For image, the image obtained is all unified JPG form, it is not necessary to do any conversion.For the text marking of image, the image labeling obtained contains a lot of meaningless words, does not have the word of any implication as word adds numeral.Some image labeling is up to tens, in order to allow image labeling describe the main information of image well, should give up those useless, insignificant marks.Therefore, the process method step taked is as follows:

Step 1: the frequency that in statistical data collection mark, all words occur in data set；

Step 2: filter out the meaningless word with numeral in those words；

Step 3: in each image labeling in whole data set the less word of the frequency of occurrences, be construed as in image the minor information of ratio, and deleted.

By above-mentioned steps, the image labeling after just can being processed.For removing the less word of frequency in step 3, it reason for this is that the mark of same class image in image clustering or there is much identical, words of being close in meaning.Therefore according to the frequency of occurrences, it is filtered completely rationally.

(2) multi-modal feature extraction

Fig. 2 displaying utilizes degree of deep learning style extract feature and build the process of degree of depth vocabulary, utilizes Region-CNN the region of image to be detected and extracted the CNN feature of correspondence in the present invention, and the dimension of feature is 4,096 dimensions.As a rule, Region-CNN can select the region of about 2,000 as candidate for every piece image, and such piece image characteristic of correspondence matrix just has 2,000*4,096 to tie up.And if afterwards all regions of all images are clustered, data volume is M*2,000*4, and 096, M is the number of image, it is clear that the space-time cost that such data volume is brought is huge.For solving such a practical problem, concrete operations carry out the method that internal-external cluster combines, namely internal cluster (being polymerized to 10 classes) is carried out once firstly for all regions comprised in every piece image, all regions are carried out once outside cluster (being polymerized to 100 classes) afterwards again, so actually finally carry out the data volume of outside cluster just for M*10*4,096, largely reduce the space-time cost of cluster.The problem that another one needs to illustrate, it is that Region-CNN extracts visual signature or Skip-gram extracts lexical feature and utilizes pre-training model to be operated, wherein Region-CNN utilizes AlexNet to carry out pre-training on ImageNet, and Skip-gram then utilizes and trains, on the wikipedia document comprising 6,000,000,000 vocabulary, the model obtained.This is primarily due to the substantial amounts of data of training need of deep neural network, therefore for avoiding the problem of over-fitting, utilizes the model trained on large-scale dataset to be operated real data extracting corresponding feature.

(3) cross-module state topic relativity calculates

Fig. 3 shows cross-module state relation topic relativity modeling process, mentions utilization in introduction before Carry out computation vision document d^vWith text document d^tDependency, M^vAnd M^tIt is the mapping matrix for visual theme feature and text subject feature respectively, TCor (l^vt=1) document d is represented_tWith d_vTopic relativity, and TCor (l^vt=0) document d is represented_tWith d_vTheme non-correlation, the definition of TCor () is as follows:

T C o r (l^{v t} | \overset{&OverBar;}{z_{d^{v}}}, \overset{&OverBar;}{z_{d^{t}}}, M^{v}, M^{t}) = \{\begin{matrix} \{\begin{matrix} s i g m o i d (f^{v} \cdot f^{t}), & l^{v t} = 1 \\ 1 - s i g m o i d (f^{v} \cdot f^{t}), & l^{v t} = 0 \end{matrix} \\ \{\begin{matrix} 0.5 + 0.5 * \cos i n e (f^{v}, f^{t}) & l^{v t} = 1 \\ 0.5 - 0.5 * \cos i n e (f^{v}, f^{t}) & l^{v t} = 0 \end{matrix} \end{matrix} - - - (4)

f^{v} = \overset{&OverBar;}{z_{d^{v}}} * M^{v}, f^{t} = \overset{&OverBar;}{z_{d^{t}}} * M^{t}

Here adopting both of which to come for different data types, pattern one is to utilize Sigmoid function to be mapped to by dot product in [0,1] scope, and second pattern calculates topic relativity by two vectorial cosine similarity of normalization.Meanwhile, based on the multi-modal theme distribution generated, it is possible to use the method for maximal possibility estimation (MLE) is trained and obtained parameter M^vAnd M^t, namely maximize the log likelihood angle value of formula (4), shown in object function is defined as:

F (M^{v}, M^{t}) = \{\begin{matrix} {argmax}_{(M^{v}, M^{t})} Σ_{l^{v t} = 1} \log \frac{1}{1 + e^{- (f^{v} \cdot f^{t})}} + Σ_{l^{v t} = 0} \log \frac{e^{- (f^{v} \cdot f^{t})}}{1 + e^{- (f^{v} \cdot f^{t})}} \\ {argmax}_{M^{v}, M^{t}} Σ_{l^{v t} = 1} \log (0.5 + \frac{f^{v} \cdot f^{t}}{2 * | f^{v} | * | f^{t} |}) + Σ_{l^{v t} = 0} \log (0.5 - \frac{f^{v} \cdot f^{t}}{2 * | f^{v} | * | f^{t} |}) \end{matrix} - - - (5)

Based on such object function, mapping matrix M^vAnd M^tGradient descent method calculating can be passed through obtain.It should be noted that in actual training process, it is assumed that the quantity of multi-modal document is | D^M|, under normal circumstances each multi-modal document only comprises one group of image and text, the number of image document and the number of text document substantially the same, and be equal to the quantity of multi-modal document, namely | D^v|=| D^T|=| D^M|.If the text occurred in same multi-modal document and image are relevant, then not uncorrelated at same multi-modal document, the ratio of the positive sample of the training data being so converted to (i.e. image-text relevant to) and negative sample (image-text uncorrelated to) is about 1/ | D^M|.Such serious disproportion causing negative sample and positive sample than regular meeting, additionally image and text can not illustrate this image and text completely uncorrelated (being likely to belong to same category) completely at same multi-modal document, therefore the ratio making negative sample and positive sample in practice is 1:1, and following constraint is met when randomly choosing negative sample, namely corresponding image and text can not from same categories.

(4) multi-modal relation topic model is derived

Formula (3) shows relation topic model constructed in the present invention, utilizes the method for this sampling of jeep to be derived by the parameter [26] of model.Being intended to of this sampling of jeep obtains the theme that in multi-modal document, each vocabulary is implied behind, first the process of sampling is derived by the subject information corresponding about degree of depth vocabulary, vocabulary and the marginal distribution of corresponding cross-module state association link, as follows:

Wherein, m_d,ttCorresponding is the number of times that theme tt occurs in document d, n_tt,wCorresponding is the number of the vocabulary that theme tt generates in whole document sets.The single argument probability distribution for subject information z can be derived by further according to formula (6), and then obtain for the sampling rule of the behind theme of each word in document.As shown in formula (7),

P ({z^{v}}_{d, n} = t v | Z^{- d, n}, {DW}^{V}, {DW}^{T}, L^{V T}) &Proportional; P (Z^{T}, Z^{V}, {DW}^{V}, {DW}^{T}, L^{V T})

\begin{matrix} &Proportional; \frac{{\hat{m}}^{v}_{d, t v} + α^{v}}{Σ_{t v &Element; {DT}^{V}} {\hat{m}}^{v}_{d, t v} + | {DT}^{V} | α^{v}} \frac{{\hat{n}}^{v}_{t v, {w^{v}}_{d, n}} + β^{v}}{Σ_{w &Element; {DV}^{V}} {\hat{n}}^{v}_{t v, w} + | {DV}^{V} | β^{v}} \underset{{dinl}^{v t}}{\underset{l^{v t} &Element; L^{V T}}{Π}} T C o r (l^{v t} | \overset{&OverBar;}{z_{d^{v}}}, \overset{&OverBar;}{z_{d^{t}}}, M^{v}, M^{t}) \\ P ({z^{t}}_{d, n} = t t | Z^{- d, n}, {DW}^{V}, {DW}^{T}, L^{V T}) &Proportional; P (Z^{T}, Z^{V}, {DW}^{V}, {DW}^{T}, L^{V T}) \\ &Proportional; \frac{{\hat{m}}^{t}_{d, t t} + α^{t}}{Σ_{t t &Element; {DT}^{T}} {\hat{m}}^{t}_{d, t t} + | {DT}^{T} | α^{t}} \frac{{\hat{n}}^{t}_{t t, {w^{t}}_{d, n}} + β^{t}}{Σ_{w &Element; {DV}^{T}} {\hat{n}}^{t}_{t t, w} + | {DV}^{T} | β^{t}} Π_{\underset{{dinl}^{v t}}{l^{v t} &Element; L^{V T}}} T C o r (l^{v t} | \overset{&OverBar;}{z_{d^{v}}}, \overset{&OverBar;}{z_{d^{t}}}, M^{v}, M^{t}) \end{matrix} - - - (7)

WhereinRepresent in document d, remove the occurrence number of theme tt after current word, andRepresent the number removing the current word theme tt word comprised.Based on such sampling rule, it is possible to sampling obtains the subject information that in whole document sets, each word is implied behind.Similarly, after the sampling each time terminates, all utilize formula (5) calculates how to obtain mapping matrix M on the theme distribution basis that present sample obtains^tAnd M^v, and the M obtained within the present sample time^tAnd M^vUsing the input as sampling process next time, so move in circles, until reaching iteration termination condition, thus obtaining final subject information and mapping matrix M^tAnd M^v.Correspondingly, in relation topic model other parameters asθ^V、θ^TThen can pass through computing formula (8) to finally give:

(5) application example

Fig. 5 utilizes constructed relation topic model to carry out the design sketch across media information retrieval, wherein it is divided into both of which, one is to utilize image retrieval text (ImageQuery-to-Text), another kind is to utilize text retrieval image (TextQuery-to-Image), and its relevance score calculates as shown in formula (9).

\begin{matrix} R a n k i n g S c o r e (i m a g e q u e r y - t o - t e x t) \\ = R a n k i n g S c o r e (d^{t} | d^{v}) = \frac{T C o r (l^{v t} = 1 | {θ^{v}}_{d^{v}}, {θ^{t}}_{d^{t}}, M^{v}, M^{t})}{Σ_{d^{t} &Element; D^{T}} T C o r (l^{v t} = 1 | {θ^{v}}_{d^{v}}, {θ^{t}}_{d^{t}}, M^{v}, M^{t})} \\ R a n k i n g S c o r e (t e x t q u e r y - t o - i m a g e) \end{matrix}

= R a n k i n g S c o r e (d^{v} | d^{t}) = \frac{T C o r (l^{v t} = 1 | {θ^{v}}_{d^{v}}, {θ^{t}}_{d^{t}}, M^{v}, M^{t})}{Σ_{d^{v} &Element; D^{V}} T C o r (l^{v t} = 1 | {θ^{v}}_{d^{v}}, {θ^{t}}_{d^{t}}, M^{v}, M^{t})} - - - (9) .

List of references

[1]Fan,J.P.；He,X.F.；Zhou,N.；Peng,J.Y.；andJain,R.2012.QuantitativeCharacterizationofSemanticGapsforLearningComplexityEstimationandInferenceModelSelection.IEEETransactionsonMultimedia14(5):1414-1428.

[2]Datta,R.；Joshi,D.；Li,J.；andWang,J.Z.2008.ImageRetrieval:Ideas,Influences,andTrendsoftheNewAge.ACMComputingSurveys(CSUR)40(2),Article5.

[3]Rasiwasia,N.；Pereira,J.C.；Coviello,E.；Doyle,G.；Lanckriet,G.R.G.；Levy,R.；andVasconcelos,N.2010.ANewApproachtoCross-modalMultimediaRetrieval.InProceedingsofMM2010,251-260.

[4]Pereira,J,C.；Coviello,E.；Doyle,G.；Rasiwasia,N.；Lanckriet,G.R.G.；Levy,R.；andVasconcelos,N.2014.OntheRoleofCorrelationandAbstractioninCross-ModalMultimediaRetrieval.IEEETransactionsonPatternAnalysisandMachineIntelligence(PAMI)36(3):521-535.

[5]Barnard,K.；Duygulu,P.；Forsyth,D.；Freitas,N.；Blei,D.M.；andJordan,M.I.2003.MatchingWordsandPictures.JournalofMachineLearningResearch.3:1107-1135.

[6]Wang,X.；Liu,Y.；Wang,D.；andWu,F.2013.Cross-mediaTopicMiningonWikipedia.InProceedingsofMM2013,689-692.

[7]Frome,A.；Corrado,G.S.；Shlens,J.；Bengio,S.；Dean,J.；Ranzato,M.A.；andMikolov,T.2013.DeViSE:ADeepVisual-SemanticEmbeddingModel.InProceedingsofNIPS2013.

[8]Feng,F.X.；Wang,X.J.；andLi,R.F.2014.Cross-modalRetrievalwithCorrespondenceAutoencoder.InProceedingsofMM2014,7-16.

[9]Nguyen,C.T.；Kaothanthong,N.；Tokuyama,T.；andPhanX.H.2013.AFeature-Word-TopicModelforImageAnnotationandRetrieval.ACMTransactionsontheWeb7(3),Article12.

[10]Ramage,D.；Heymann,P.；Manning,C.D.；andMolina,H.G.2009.ClusteringtheTaggedWeb.InProceedingsofWSDM2009,54-63.

[11]Blei,D.M.；andJordan,M.I.2003.ModelingAnnotatedData.InProceedingsofSIGIR2003,127-134.

[12]Wang,C.；Blei,D.；andFei-FeiL.2009.SimultaneousImageClassificationandAnnotation.InProceedingsofCVPR2009,1903-1910.

[13]Putthividhya,D.；Attias,H.T.；andNagarajan,S.S.2010.TopicRegressionMulti-ModalLatentDirichletAllocationforImageAnnotation.InProceedingsofCVPR2010,3408-3415.

[14]Niu,Z.X.；Hua,G.；Gao,X.B.；andTian,Q.2014.Semi-supervisedRelationalTopicModelforWeaklyAnnotatedImageRecognitioninSocialMedia.InProceedingsofCVPR2014,4233-4240.

[15]Wang,Y.F.；Wu,F.；Song,J.；Li,X.；andZhuang,Y.T.2014.Multi-modalMutualTopicReinforceModelingforCross-mediaRetrieval.InProceedingsofMM2014,307-316.

[16]Zheng,Y.；Zhang,Y.J.；andLarochelle,H.2014.TopicModelingofMultimodalData:anAutoregressiveApproach.InProceedingsofCVPR2014,1370-1377.

[17]Chen,T.；SalahEldeen,H.M.；He,X.N.；Kan,M.Y.；andLu,D.Y.2015.VELDA:RelatinganImageTweet’sTextandImages.InProceedingsofAAAI2015.

[18]Girshick,R.；Donahue,J.；Darrell,T.；andMalik,J.2014.Richfeaturehierarchiesforaccurateobjectdetectionandsemanticsegmentation.InProceedingsofCVPR2014,580-587.

[19]Hariharan,B.；Arbelaez,P.；Girshick,R.；andMalik,J.2014.SimultaneousDetectionandSegmentation.InProceedingsofECCV2014,297-312.

[20]Karpathy,A.；Joulin,A.；andFei-Fei,L.2014.DeepFragmentEmbeddingsforBidirectionalImageSentenceMapping.InProceedingsofNIPS2014.

[21]Zhang,N.；Donahue,J.；Girshick,R.；andDarrell,T.2014.Part-BasedR-CNNsforFine-GrainedCategoryDetection.InProceedingsofECCV2014,834-849.

[22]Mikolov,T.；Sutskever,I.；Chen,K.；Corrado,G.；andDean,J.2013.DistributedRepresentationsofWordsandPhrasesandtheirCompositionality.InProceedingsofNIPS2013.

[23]Tang,D.Y.；Wei,F.R.；Qin,B.；Zhou,M.；andLiu,T.2014.BuildingLarge-ScaleTwitter-SpecificSentimentLexicon:ARepresentationLearningApproach.InProceedingsofCOLING2014,172-182.

[24]Karpathy,A.；Joulin,A.；andFei-Fei,L.2014.DeepFragmentEmbeddingsforBidirectionalImageSentenceMapping.InProceedingsofNIPS2014.

[25]Sivic,J.,andZisserman,A.2003.VideoGoogle:ATextRetrievalApproachtoObjectMatchinginVideos.InProceedingsofICCV2003,2:1470-1477.

[26]Griffiths,T.L.；andSteyvers,M.2004.FindingScientificTopics.

InProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica,101(1):5228-5235。

Claims

1. the cross-module state topic relativity modeling method based on degree of depth study, it is characterised in that specifically comprise the following steps that

(2) multi-modal depth characteristic is extracted: utilize the semantic feature of degree of deep learning method visual signature Yu iamge description to extract image；Specifically, it is utilized respectively Region-CNN model and Skip-gram model comes the provincial characteristics of abstract image and the lexical feature of text；Wherein, first Region-CNN detects representational region candidate collection in image, utilizes the convolutional neural networks of pre-training to come the feature extracted corresponding to respective regions afterwards；Skip-gram model is then utilize the co-occurrence information between text vocabulary and vocabulary directly to train the characteristic vector obtaining vocabulary to represent；

(3) degree of depth word bag model is built: obtained image area characteristics and text lexical feature in step (2) are clustered initially with clustering algorithm K-means, obtain limiting deep vision dictionary and the degree of depth text dictionary of dimension, and then provincial characteristics all of in respective image is mapped to corresponding visual dictionary, obtain deep vision word bag model thus building；Similarly, the vocabulary in all of text is also mapped onto text dictionary and obtains degree of depth text word bag model；

2. method according to claim 1, it is characterised in that: in step (2), described in be utilized respectively Region-CNN and Skip-gram model and come the provincial characteristics of abstract image and the lexical feature of text, detailed process is as follows:

Given image, Region-CNN, first with selecting the method for search to select position that object is likely to occur as Candidate Set from image, exists with the form of region；Afterwards, then for each extracted region CNN feature；On implementing, each image-region is converted into fixing Pixel Dimensions 227*227 by Region-CNN, and the convolutional network for extracting feature is fully connected layer by 5 convolutional layers and 2 and constitutes；

Given text document, utilizes Skip-gram model training to obtain each the word characteristic of correspondence vector occurred in text document；Representing that with TD the text of whole multi-modal document data set describes part, TW is all of text vocabulary occurred in TD, and TV is the dictionary that text vocabulary is corresponding, for each vocabulary tw, iv in TW_twAnd ov_twBeing the input feature value for tw and output characteristic vector, Context (tw) is the word tw vocabulary hereinafter occurred thereon；Window size corresponding for context is set to 5, represents W ∈ R by unified to all input vectors corresponding to whole text data set and output vector with a long parameter vector^2*|TV|*dim, wherein dim is the dimension of input vector and output vector；The object function of whole Skip-gram model is described below:

\begin{matrix} B S G (ω) = \underset{ω}{\arg \max} \frac{1}{| W |} Σ_{i = 1}^{| W |} Σ_{j = 1}^{C o n t e x t (w_{i})} \log P (w_{j} | w_{i}) \\ = {argmax}_{ω} \frac{1}{| W |} Σ_{i = 1}^{| W |} Σ_{j = 1}^{C o n t e x t (w_{i})} \frac{\exp (O_{w_{j}} \cdot I_{w_{i}})}{Σ_{k = 1}^{| V |} \exp (O_{w_{k}} \cdot I_{w_{i}})} - - - (1) \end{matrix}

The negative sample method of sampling is utilized to carry out approximate calculation ogP (tw_j|tw_i), its computing formula is as follows:

\log P (w_{j} | w_{i}) = l o g σ (O_{w_{j}} \cdot I_{w_{i}}) + Σ_{k = 1}^{m} E_{w_{k} ~ P (w)} l o g σ (O_{w_{j}} \cdot I_{w_{i}}) - - - (2)

3. method according to claim 1, it is characterized in that: step (3) is to obtain on the basis of respective depth vocabulary in step (2), degree of depth word bag model is built further by the method for vector quantization, detailed process is: for utilizing R-CNN to extract the region candidate collection that obtains and corresponding feature, multi-modal document data is concentrated the provincial characteristics that all images comprise to cluster by the method first with K-means, it is fixed the classification of quantity, the central point of each cluster classification is as the representative element of the category, all these classifications constitute a corresponding dictionary；Afterwards, each candidate region in image is mapped in the middle of corresponding classification to represent, mapping method is the Euclidean distance by calculating the feature in each region and class center feature, thus finding the corresponding classification nearest with provincial characteristics, add up in the position of the corresponding category of vector, thus the every piece image in whole data set is all represented the form becoming deep vision word bag, the i.e. corresponding vector of every piece image, the dimension of vector is the number of classification, and the element value of the vector number of times that to be the category occur in the picture, with vector VT ∈ R^CRepresenting, wherein C is the class number that cluster obtains；Similarly, for all of term vector corresponding to text document, the mode also by cluster obtains corresponding degree of depth text dictionary, and each text is finally expressed as the form of degree of depth text word bag with same mapping method.

4. method according to claim 1, it is characterized in that: in step (4), potential Di Li Cray model is utilized respectively image and text collection to be carried out probabilistic Modeling, potential Di Li Cray model hypothesis is a under cover common theme set in the behind of document sets, and each section of concrete document correspond to a probability distribution in this theme set behind respectively, each word in the document correspond to a theme generated by this probability distribution behind；And the probability distribution of all documents does not have no bearing on, generated from a common Di Li Cray distribution；On the basis of this model hypothesis, deep vision word bag step (3) obtained and degree of depth text word bag, as input, utilize the probability topic distribution that LDA model is hidden behind to be derived by different modalities document.

5. method according to claim 1, it is characterised in that: step (5) is in the process building model, by multi-modal collection of document D^MBeing divided into three parts to constitute, namely Part I is visual pattern set D^V, Part II is text description collections D^T, Part III is link set L^VT, this set indicates the related information between image and text；Wherein, D^VBy deep vision lexical set DW^VConstitute, and DV^VIt is deep vision dictionary, text description collections D simultaneously^TBy degree of depth text lexical set DW^TConstitute, DV^TIt it is degree of depth text dictionary；For l^vt∈L^VT,l^vt=1 means visual pattern d^v∈D^VWith text, d is described^t∈D^TIt is relevant, and l^vt=0 means visual pattern d^vWith text, d is described^tIt is incoherent；Based on above description, relation topic model formalization representation is as follows: given DT^VFor visual theme set, DT^TBeing text subject set, α and β is two hyper parameter, and wherein α is distributed for the Di Li Cray of theme, and β is distributed for the Di Li Cray of theme-degree of depth vocabulary, θ^vCorresponding visual pattern d^vTheme distribution behind, θ^tCorresponding visual pattern d^tTheme distribution behind, Φ is the corresponding multinomial distribution corresponding to all degree of depth vocabulary of each theme, z is the behind subject information of all vocabulary of the correspondence actually generated by θ, and Dir () and Mult () represents the distribution of Di Li Cray and multinomial distribution, N respectively_dThe quantity of expression degree of depth vocabulary in document d, n represents the n-th degree of depth vocabulary；The generation process of whole relation topic model is as follows:

(1) for each theme tv ∈ DT in visual theme set^V:

Di Li Cray profile samples according to theme-visual vocabulary obtains the multinomial distribution of the corresponding all visual vocabularies of tv, it may be assumed that φ^v _tv～Dir (φ^v|β^v)；

(2) for each theme tt ∈ DT in text subject set^T:

Di Li Cray profile samples according to theme-text vocabulary obtains the multinomial distribution of the corresponding all text vocabulary of tt, it may be assumed that φ^t _tt～Dir (φ^t|β^t)；

(3) for each visual document d ∈ D^V:

A () obtains d theme distribution corresponding behind according to the Di Li Cray profile samples in theme set, it may be assumed that

θ^v _d～Dir (θ^v|α^v)；

B () is for each deep vision vocabulary w in d^v _d,n:

I. the theme distribution according to document d behind obtains the theme that this vocabulary is corresponding, it may be assumed that z^v _d,n～Mult (θ^v _d)；

Ii. the vocabulary corresponding in this position of document is obtained according to theme-visual vocabulary sampling, it may be assumed that w^v _d,n～Mult (φ^v _zd,n)；

(4) for each text document d ∈ D^T:

θ^t _d～Dir (θ^t|α^t)；

B () is for each degree of depth text vocabulary w in d^t _d,n:

A () is according to d^vWith d^tTheme feature calculate its dependency thus to l^vtSample, it may be assumed that WhereinWithCorresponding document d respectively^vWith d^tExperience theme distribution, WithBeing that two mapping matrixes map vision and text subject feature respectively to public subspace, wherein the dimension of public subspace is dim dimension, TCor (l^vt=1) document d is represented^tWith d^vTopic relativity, and TCor (l^vt=0) document d is represented^tWith d^vTheme non-correlation；

6. method according to claim 1, it is characterised in that: step (6) is the relation topic model that step (5) is set up, for across media information retrieval；It is divided into two classes across media information retrieval, i.e. text-inquiry-image and image-inquiry-text, text-inquiry-image is it is considered that according to given query text, utilizing relation topic model to calculate different images all images are ranked up by text degree of association, image-inquiry-text is all text documents to be ranked up for the degree of association of given query image according to different text documents；

Image querying text is utilized for given, the relation topic model of utilization derives corresponding theme feature, and the correlation calculations method utilizing the theme feature obtained in step (5) calculates the correlation information between other mode documents, by the height of correlation information, text document being ranked up, obtaining text document maximally related with query image thus returning；Similarly, said process be also applied for utilizing text query image across media information retrieval process.