CN105760507B

CN105760507B - Cross-module state topic relativity modeling method based on deep learning

Info

Publication number: CN105760507B
Application number: CN201610099438.9A
Authority: CN
Inventors: 张玥杰; 程勇; 刘志鑫; 金城; 张涛
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2016-02-23
Filing date: 2016-02-23
Publication date: 2019-05-03
Anticipated expiration: 2036-02-23
Also published as: CN105760507A

Abstract

The invention belongs to across Media Correlation learning art fields, specially the cross-module state topic relativity modeling method based on deep learning.The present invention includes two main algorithms: the multi-modal document representation based on depth vocabulary, the relationship topic model modeling of fusion cross-module state topic relativity study.The present invention describes semantic description part and the image section in multi-modal document using depth learning technology come construction depth semantic vocabulary and deep vision vocabulary respectively.Based on such multi-modal document representation, entire multi-modal collection of document is modeled by building cross-module state relationship topic model, so that the association between the generating process and different modalities of multi-modal document be described.This method accuracy is high, adaptable.This is on the basis of extensive multi-modal document (text adds image), consider multi-modal semantic information and efficiently have great importance across media information retrieval, it can be improved retrieval relevance, enhance user experience, be with a wide range of applications in across media information retrieval field.

Description

Cross-module state topic relativity modeling method based on deep learning

Technical field

The invention belongs to across Media Correlation learning art fields, and in particular to based on deep learning across modality images- Text subject correlation study method.

Background technique

With the development of internet technology with the maturation of Web2.0, add up the multi-modal document of magnanimity on the internet, such as What analyzes and handles the labyrinth of these multi-modal documents, to provide theories integration for practical applications such as cross-media retrievals Become a very important research hotspot.Usually, a multi-modal document is usually in the form of multiple modalities co-occurrence In the presence of for example, many web graph pictures are attached to many customized iamge descriptions of user or mark, in addition there are also nets The document of network includes the form of some illustrations.However, although these multi-modal data are usually associated with each other, due to semantic letter , there are very big difference and difference [1] in the problem of ditch between the visual information and text description information of image, this makes sufficiently Become very difficult using the semantic association between different modalities.Therefore, it is implicit behind that different modalities data how sufficiently to be excavated Relationship, and preferably fusion multi-modal information come to multi-modal document carry out modeling become very important [2,3].And it is sharp Multi-modal document is modeled with topic model, and then the association excavated between different modalities is a Critical policies, In the research of cross-module state theme modeling, it there are three interrelated needs while being resolved:

1, find and construct more representative, more valuable document elements come in multi-modal document image and Expression is described in content of text respectively.

2, more reasonable topic relativity model can be established to come preferably to different modalities data in multi-modal document Between association be described, i.e., visual pattern and text description between association.

3, learn to establish one come the internal association being directed between image and content of text by cross-module state topic relativity Kind objectively measures mechanism.

To solve first problem, most important is exactly the document elements how exploration establishes one group of optimization, thus Using these optimization document elements can it is more accurate, more fully hereinafter in multi-modal document vision and semantic feature into Row expression.

To solve Second Problem, it is most important that a more robust probability topic model can be established, to dig The implicit subject information of pick behind makes the likelihood angle value for the multi-modal subject document observed reach maximum.

To solve the problems, such as third, most effective settling mode be the attributive character of different modalities is mapped to it is common embedding Enter in subspace, to maximize the related information between different modalities information.

Currently have some researchers and propose distinct methods for multi-modal data modeling, from these sides from the point of view of modeling angle Method can be roughly divided into two types, and the first kind is statistics dependence modeling method, and the second class is that building joint probability generates model method.

(1) modeling method that statistics relies on

The core concept of statistical modeling method be the data characteristics of different modalities is mapped to identical latent space, thus It is expected that farthest excavating the statistic correlation between different modalities data characteristics.By taking image and text as an example, pass through building The characteristics of image of different structure and text feature are mapped in identical public subspace by corresponding mapping matrix respectively, Public subspace is fallen into a trap the correlation of nomogram picture and text, and the distance of more relevant image and text in public subspace is also It is closer, on the contrary distance far means that image is also lower with the correlation of text.Canonical correlation analysis method (Canonical Correlation Analysis, CCA) be a kind of most typical statistics dependence method, by seek visual signature matrix with The maximum correlation of semantic feature matrix obtains its corresponding space basal orientation moment matrix；Space basal orientation moment matrix is maximumlly kept The correlation of Image Visual Feature and semantic feature, and its mapping relations for being mapped to isomorphism subspace is provided；In turn by image Visual feature vector and semantic feature DUAL PROBLEMS OF VECTOR MAPPING to same dimension under in isomorphism subspace and construct cross-module state fusion feature, it is real The unified representation of existing media data different modalities information.Such as KernelCCA of work later (KCCA) and deepCCA (DCCA) The dependence between image and text is inquired into deeper time.

Statistical modeling method is combined together by work [4] with topic model, and this method is first with potential Di Li Cray Model extracts the visual theme feature of image and the text subject feature of text respectively, later will using Canonical correlation method Visual theme feature finds into isomorphism subspace with text subject Feature Mapping and calculates its correlation.It expands in [5] Its work is opened up, and calculates its correlation using KCCA.

(2) building joint probability generates model method

Multi-modal topic model is the Typical Representative method for constructing joint probability and generating model, has many phases recent years It closes work and carries out probability topic modeling [6,7,8,9,10] the vision content that is directed in multi-modal document and semantic description. [Blei2003] establishes the topic model [11] of a series of complexity step by step in its work in 2003, wherein Correspondence Latent Dirchlet Allocation (Corr-LDA) is wherein optimal cross-module state theme mould Type, there are corresponding dependences between the implicit theme between the model hypothesis different modalities, i.e., the corresponding implicit master of mark Topic comes from the implicit theme of image vision information behind.This hypothesis establishes a unidirectional mapping relations i.e. text vocabulary Generation depend on image vision content information.Later, [Wang2009] proposes a kind of topic model for having supervision to learn Potential relationship [12] between image and mark word, [Putthividhva 2010] then propose a kind of based on the more of theme recurrence The potential Di Li Cray model [13] of mode.Text and picture material, which are combined, in the multi-modal document of [Rasiwasia2010] research builds Mould [3].A kind of method that [Nguyen 2013] proposes image labeling, this method be distribution based on union feature and word and The distribution [9] of word and theme.[Niu2014] propose a kind of semi-supervised relationship topic model come to picture material and image it Between relationship explicitly modeled [14].[Wang2014] then proposes that a kind of semi-supervised multi-modal common theme reinforces model, The model inquires into the relationship [15] mutually promoted between different modalities theme.[Zheng2014] proposes a kind of for DocNADE's There is supervision mutation model to be modeled [16] come visual vocabulary, mark vocabulary and the Joint Distribution of category to image. [Chen, 2015] solves the modeling wide gap [17] between image and text by building vision-emotion LDA model.

As seen through the above analysis, current method all obtains some progress in multi-modal Document Modeling, however with Upper all methods do not fully consider yet to be influenced brought by following three aspects:

(1) depth information excavates in multi-modal document --- and most of existing images-label degree of correlation learning method is logical It often only focuses on and explores the association between different modalities in traditional visual signature representation method and markup information feature, not There is the depth characteristic for considering to be contained in these different modalities.For building overall Vision semanteme and internal semantic association, This will will lead to a series of serious loss of learning problems.And this can then be made up to the exploration of the depth of multi-modal document and lacked It falls into, so that obtained characteristic element preferably indicates multi-modal document.

(2) the relationship topic relativity modeling based on depth analysis --- most existing theme modeling methods are considering structure When building the topic relativity of different modalities, it is typically based on such it is assumed that the theme that i.e. different modalities are hidden behind is consistent 's.And such hypothesis is usually excessively absolute, some unnecessary noises can be introduced while constructing topic relativity, Therefore building one is more reasonable it is assumed that merging the characteristic information of depth, and the more optimal relationship theme of formation one is related Property modeling mechanism becomes particularly significant.

(3) being learnt by the cross-module state degree of correlation of depth theme feature --- most existing multi-modal topic models are based on The theme distribution feature for going matching different modalities to hide behind is typically directly considered when calculating correlation between different modalities, thus Catch the internal association between visual pattern and text description.However, there is no well for a kind of such direct matching way Consider the isomerism of image and text, therefore by the way that depth theme feature is mapped to public space to learn its correlated performance It is enough to excavate its correlation well, to solve the problems, such as suggested above.

Therefore, it is highly desirable to use for reference current existing related mature technology, while the problem above that takes one thing with another, more Add and comprehensively analyzes and calculate the topic relativity calculation method between different modalities.The present invention is exactly thus to excite, from part To entirety, designs a kind of novel technological frame (including three main algorithms) and cover, the depth vocabulary structure in multi-modal document It builds, the building of relationship topic model, the study of heterologous topic relativity, to establish effective cross-module state topic relativity calculating side Method finally improves for across media image retrieval performance.

Summary of the invention

It is an object of the invention to propose a kind of cross-module state topic relativity modeling method based on deep learning, to improve Across Media Society image retrieval performance.

Present invention firstly provides a novel depth cross-module state topic relativity correlation models, and the model is for extensive Multi-modal corpus is modeled, and can analyse in depth and understand the related information in multi-modal document between image and text, benefit With constructed model, the performance of cross-media retrieval can be effectively facilitated.The model mainly includes following components:

(1) depth vocabulary building (DeepWordConstruction).For multi-modal document, depth learning technology is utilized Building depth vocabulary is indicated as basic element respectively；Depth vocabulary includes deep vision vocabulary and depth text vocabulary, Wherein, deep vision vocabulary is used to better describe the image vision content in document, and depth text vocabulary is then used as and is used to The basic element of content of text in document is described.Compared with traditional visual vocabulary and text vocabulary, depth vocabulary can be deeper Excavate to level the semantic information of document.Building mode in this way, multi-modal document can be with depth vocabulary come more preferable Ground indicates.

(2) multi-modal subject information generates (Multimodal Topic Information Generation).It is constructing Depth vocabulary on the basis of, the theme that different modalities data are hidden behind is further excavated using topic model LDA and is believed Breath.Topic model assumes that document sets have one group of common theme collection behind, and each word corresponds to a master in document Topic, based on such it is assumed that being carried out by the theme feature for deriving each available document behind to document further Expression.

(3) cross-module state theme association analysis (Cross-modal Topic Correlation Analysis).Assuming that not With the document of mode, the theme hidden is heterologous but relevant, such as " wedding " the corresponding theme in text document behind There may be very high related information with image behind " white " theme, therefore the method by constructing common subspace is different The theme feature of mode is mapped in public subspace, to find the related information between different modalities.

(4) relationship theme modeling (Relational Topic Modeling).Relationship topic model is generating different modalities Theme feature when, while considering the related information of image and document, i.e., not only consider when constructing the theme of a certain document same The information of one mode, while the related information with other mode is also considered, so that final theme merges multi-modal information, And finally building obtains the theme distribution and cross-module state related information of multi-modal document behind.

For current existing multi-modal theme modeling method, method proposed by the invention exists in the application Two big advantages: first, accuracy is high, and be mainly reflected in: this method replaces traditional vocabulary, energy using the depth vocabulary of building It is enough to excavate mode profound level information deeper into ground, problem brought by semantic gap can be alleviated, well so as to more preferable The efficiency of ground promotion cross-media retrieval.Second, it is adaptable, because constructed model is directed between different modalities well Association is modeled, it is possible to and it is suitable for image retrieval text and text retrieval image is two-way across media information retrieval, and And the model can also be expanded to more easily for other mode across (such as audio) in media information retrieval.

Cross-module state topic relativity modeling method provided by the invention based on deep learning, the specific steps are as follows:

(1) data prediction: the data image of acquisition different modalities is concentrated from multi-medium data, image is obtained and image is retouched It states data, arranges in image labeling data set and seldom occur or useless mark word；

(2) it extracts multi-modal depth characteristic: extracting visual signature and the iamge description of image using deep learning method Semantic feature.Specifically, be utilized respectively Region-CNN (Convolutional Neural Network) model and Skip-gram model comes the provincial characteristics of abstract image and the lexical feature of text.Wherein, Region-CNN detection image first In representational region candidate collection, later using the convolutional neural networks of pre-training come to extract corresponding region corresponding to spy Sign；Skip-gram model be then obtained using the directly training of the co-occurrence information between text vocabulary and vocabulary the feature of vocabulary to Amount indicates.

(3) depth bag of words are constructed: using clustering algorithm K-means by obtained image district in step (2) first Characteristic of field and text lexical feature are clustered, and obtain the deep vision dictionary and depth text dictionary that limit dimension, and then will All provincial characteristics are mapped to corresponding visual dictionary in respective image, so that building obtains deep vision bag of words, phase As, the vocabulary in all texts also may map to text dictionary and obtain depth text bag of words；

(4) multi-modal theme generates: simulating entire multi-modal data collection using the hypothesis of potential Di Li Cray model Generating process, and be derived by the theme distribution feature that text collection and image set are hidden behind, make full use of vocabulary it Between co-occurrence information；

(5) the relationship topic model modeling of fusion cross-module state topic relativity analysis: constructing corresponding relationship topic model, The correlation that theme feature between different modalities is considered while constructing topic model, by multimode obtained in step (4) State theme feature calculates using the related information between image and text the phase between image and text as initial value Guan Xing, calculated correlation update the subject information of multi-modal document, thus cross-iteration carry out correlation It calculates to update and then construct with theme distribution and obtains final relationship topic model；

(6) based on topic relativity across media information retrieval: obtained cross-module state topic relativity is applied to across matchmaker The inquiry for giving certain mode respectively in body information retrieval, using correlation calculations obtain with the inquiry it is maximally related other The data of mode.

Above each step is described in detail below:

(1) data prediction

The step mainly carries out preliminary pretreatment to the data image of acquisition different modalities, specifically, because of image It include some noises in the mark for being included, these noises can be passed through because the randomness of user annotation causes The mode of word frequency filtering filters out word frequency lower than the word of some threshold value to obtain new dictionary.

(2) multi-modal depth characteristic is extracted

In the present invention, it is utilized respectively provincial characteristics and text that Region-CNN and Skip-gram model carrys out abstract image Lexical feature.It is illustrated separately below:

Given image, Region-CNN go out the position that object is likely to occur from image selection first with the method for selection search It sets as Candidate Set (usual 2,000 or so), exists in the form of region.And then it is special for each extracted region CNN Sign.In specific implementation, each image-region is converted into fixed Pixel Dimensions 227*227 by Region-CNN, for mentioning It takes the convolutional network of feature to be fully connected layer by 5 convolutional layers and 2 to constitute.Visual signature is extracted with Region-CNN to compare Traditional visual signature, advantage are mainly reflected in the semanteme that the extracted profound feature of CNN is more nearly image itself, can The problem of to alleviate semantic gap to a certain extent.

Given text document, each word for obtaining occurring in text document using Skip-gram model training are corresponding Feature vector.Skip-gram model is that a kind of very effective method carrys out the distributed of learning text vocabulary and indicates that the model is most It is early to be proposed by Mikolov et al. in 2013, it is used widely in the task of different natural language processings later.The mould Type can capture the syntax and semantics relationship between text vocabulary well, and semantic similar word is aggregated in Together, compared to more traditional text term vector learning method.An important advantage of Skip-gram is because not being related to Complicated density matrix operation, training effectiveness when for mass data training are high.Entire multi-modal text is indicated with TD The text description section of file data set, TW are all text vocabulary occurred in TD, and TV is that text vocabulary is corresponding Dictionary, for each of TW vocabulary tw, iv_twAnd ov_twIt is the input feature value and output feature vector for tw, Context (tw) is the vocabulary that word tw hereinafter occurs on it, and the corresponding window size of context is arranged in the present invention It is 5, all input vectors corresponding to entire text data set and output vector is uniformly indicated into W with a long parameter vector ∈R^2*|TV|*dim, wherein dim is the dimension of input vector and output vector.Therefore, the objective function of entire Skip-gram model It can be described below:

For Skip-gram training, train the brought cost that calculates can be unusual using traditional softmax Height, therefore the negative sample method of sampling is utilized to approximate calculation ogP (tw_j|tw_i), calculation formula is as follows:

Wherein, σ () is sigmoid function, and m is the quantity of negative sample, each negative sample is from based on word frequency It is generated that noise is distributed P (tw).

(3) depth bag of words are constructed

On the basis of step (2) obtains respective depth vocabulary, further pass through vector quantization (Vector Quantization) method of [25] constructs depth bag of words.Specifically, the area for being extracted using R-CNN Multi-modal document data is concentrated all images to be wrapped by domain Candidate Set and corresponding feature first with the method for K-means The provincial characteristics contained is clustered, and obtains the classification of fixed quantity, and the central point of each cluster classification is as the category Representative element, all these classifications constitute a corresponding dictionary.Later, each candidate region in image is mapped to Indicated in corresponding classification, mapping method be by calculate the Euclidean of the feature in each region and class center feature away from From, so that the corresponding classification nearest with provincial characteristics is found, it is cumulative in the position that vector corresponds to the category.It is done using such Method can all indicate every piece image in entire data set the form as deep vision bag of words, i.e., every piece image is corresponding The dimension of one vector, vector is the number of classification, and the element value of vector is the number that the category occurs in the picture, with to Measure VT ∈ R^CIt indicates, wherein C is the class number that cluster obtains.Similarly, for word all corresponding to text document Vector can also obtain corresponding depth text dictionary by way of cluster, and finally will with same mapping method Each text is expressed as the form of depth text bag of words.

(4) multi-modal theme generates

Multi-modal information is a kind of very important expression way for multi-modal document content, that is to say, that The visual information of image combines with semantic description.Therefore, for preferably between computation vision image and text marking across Modal Correlation, more accurately extracting representational multi-modal feature becomes particularly significant, and multi-modal character representation Being associated between the perceptual property of image and semantic meaning representation feature can preferably be explored.

Latent Dirichletal location (LDA) algorithm be one be directed to discrete data production probabilistic model, the algorithm by To the highest attention of picture/text research field, LDA indicates every document using one group of probability distribution, and every in document A word is generated from an individual theme.The advantage of LDA is that it considers that the inherent statistical framework of document is such as different Co-occurrence information etc. of the word in entire collection of document, it is assumed that each of every document vocabulary is all individually main from one Topic is generated, and the theme is that Di Li Cray by one on all themes is distributed and is generated.LDA is by each document tables It is shown as one group of ProbabilityDistribution Vector closed in theme collection, these vectors are used to indicate the visual signature and text of sociogram Feature.

In step (4), probabilistic Modeling is carried out to image and text collection respectively using potential Di Li Cray model, is dived In Di Li Cray model hypothesis in the behind of a document sets under cover common theme set, and specific each document back The probability distribution closed in the theme collection is respectively corresponded again afterwards, each of the document word all corresponds to one behind By probability distribution theme generated；And the probability distribution of all documents does not have no bearing on, and is all common from one The distribution of Di Li Cray is generated；On the basis of this model hypothesis, the deep vision bag of words and depth that step (3) is obtained are literary This bag of words is as input, and different modalities document (text document and visual document) is derived by using LDA model, and institute is hidden behind The probability topic of hiding is distributed, and establishes basis to establish the relationship topic model of fusion cross-module state related information in next step.

(5) the relationship topic model modeling of fusion cross-module state topic relativity analysis

Correlation information between different modalities is dissolved into topic model building process by building relationship topic model, tool For body, the theme distributions of the different modalities that step (4) is obtained is as initial value, by by the theme feature of different modalities The correlation for being mapped to the mode of public subspace to be calculated between different modalities theme feature, and by the meter of the correlation Calculation is dissolved into topic model, and then is considered and another mode in the theme that the document for deriving a certain mode is hidden behind Correlation information so that finally obtained subject information considers not only the distributed intelligence between same mode, while also being examined Consider the relationship between other mode.

The step for main target be construct a joint probability distribution so that the multi-modal document likelihood observed Angle value reaches maximum.During constructing model, by multi-modal collection of document D^MIt is divided into three parts composition, i.e., first part is Visual pattern set D^V, second part is text description collections D^T, Part III is link set L^VT(set indicate image and Related information between text).Wherein, D^VBy deep vision lexical set DW^VIt constitutes, and DV^VIt is deep vision dictionary, simultaneously Text description collections D^TBy depth text lexical set DW^TIt constitutes, DV^TIt is depth text dictionary.For l^vt∈L^VT,l^vt=1 meaning Taste visual pattern d^v∈D^VD is described with text^t∈D^TIt is relevant, and l^vt=0 means visual pattern d^vIt is described with text d^tIt is incoherent.Based on above description, relationship topic model formalization representation is as follows: given TS^VFor visual theme set, TS^T It is text subject set, α and β are two hyper parameters, and wherein α is distributed for the Di Li Cray of theme, and β is directed to theme-depth word The Di Li Cray of remittance is distributed, θ^vCorresponding visual pattern d^vThe theme distribution of behind, θ^tCorresponding visual pattern d^tThe theme of behind point Cloth, Φ are that each theme corresponds to multinomial distribution corresponding to all depth vocabulary, and z is all words of correspondence actually generated by θ The behind subject information of remittance, Dir () and Mult () respectively indicate the distribution of Di Li Cray and multinomial distribution, N_dIt indicates in document d In depth vocabulary quantity, n indicate n-th of depth vocabulary.The generating process of entire relationship topic model is as follows:

(1) for each theme tv ∈ DT in visual theme set^V:

(a) multinomial that tv corresponds to all visual vocabularies is obtained according to theme-visual vocabulary Di Li Cray profile samples Distribution, it may be assumed that φ^v _tv~Dir (φ^v|β^v)。

(2) for each theme tt ∈ DT in text subject set^T:

(a) multinomial that tt corresponds to all text vocabulary is obtained according to theme-text vocabulary Di Li Cray profile samples Distribution, it may be assumed that φ^t _tt~Dir (φ^t|β^t)。

(3) for each visual document d ∈ D^V:

(a) d corresponding theme distribution behind is obtained according in the Di Li Cray profile samples that theme collection closes, it may be assumed that θ^v _d~ Dir(θ^v|α^v)。

(b) for each of d deep vision vocabulary w^v _d,n:

I. the corresponding theme of the vocabulary is obtained according to the theme distribution of the behind document d, it may be assumed that z^v _d,n~Mult (θ^v _d)

Ii. it is sampled to obtain vocabulary corresponding in the document position according to theme-visual vocabulary, it may be assumed that w^v _d,n~Mult (φ^v _zd,n)

(4) for each text document d ∈ D^T:

(a) d corresponding theme distribution behind is obtained according in the Di Li Cray profile samples that theme collection closes, it may be assumed that θ^t _d~ Dir(θ^t|α^t)；

(b) for each of d depth text vocabulary w^t _d,n:

I. the corresponding theme of the vocabulary is obtained according to the theme distribution of the behind document d, it may be assumed that z^t _d,n~Mult (θ^t _d)；

Ii. it is sampled to obtain vocabulary corresponding in the document position according to theme-text vocabulary, it may be assumed that w^t _d,n~Mult (φ^t _zd,n)；

(5) l is linked for each^vt∈L^VT, indicate visual document d^vWith text document d^tBetween related information:

(a) according to d^vWith d^tTheme feature calculate its correlation to l^vtIt is sampled, it may be assumed thatM^v,M^t), whereinWithRespectively correspond document d^vWith d^tExperience theme distribution, WithIt is two mapping matrixes point Not Ying She vision and text subject feature to public subspace, wherein the dimension of public subspace is dim dimension, TCor (l^vt=1) Indicate document d^tWith d^vTopic relativity, and TCor (l^vt=0) document d is indicated^tWith d^vTheme non-correlation.

It is final to construct joint probability distribution form to be directed to entire multi-modal collection of document and be built based on above procedure Mould, as follows:

Wherein, first item corresponds to theme-depth vocabulary generating process, intermediate two corresponding deep vision vocabulary and depth The generating process of text vocabulary, last indicates image-description connection generating process.

(6) across media information retrieval (application of relationship topic model)

Step (6) is the relationship topic model that step (5) is established, and is used for across media information retrieval, with image and text For this, two classes can be divided into across media information retrieval, i.e. text-inquiry-image and image-inquiry-text, text-inquiry- What image considered be according to given query text, using relationship topic model calculate different images to the text degree of correlation come pair All images are ranked up, and image-inquiry-text is then the degree of correlation according to different text documents for given query image To be ranked up to all text documents.

For given inquiry (such as utilizing image querying text), derive that corresponding theme is special using relationship topic model Sign, and calculated between other mode documents using the correlation calculations method of theme feature obtained in step (5) Correlation information (such as text document), is ranked up text document by the height of correlation information, to return To with the maximally related text document of query image.Similarly, the above process is also applied for using text query image across media Information retrieval process.

In conclusion the present invention is proposed for content isomerism and relevance between different modalities in multi-modal document A kind of cross-module state topic relativity modeling method based on deep learning, and then can be with the form of probabilistic model to entire multimode The generating process of state document is described, and the correlation between the document of different modalities is quantified.The method of the present invention can Effectively to apply to improve retrieval relevance across in media information retrieval for large-scale image, enhance user experience.

Detailed description of the invention

Fig. 1 is flow chart of the invention.

Fig. 2 is the schematic diagram for constructing the multi-modal document of depth lexical representation.

Fig. 3 is the schematic diagram of cross-module state relationship topic relativity modeling process.

Fig. 4 is the relationship topic model proposed and the comparison diagram of traditional multi-modal topic model.

Fig. 5 is to carry out the effect picture across media information retrieval using constructed relationship topic model.

Specific embodiment

With reference to the accompanying drawing, the cross-module state relatedness computation method that the present invention is directed to sociogram is discussed in detail.

(1) data object is acquired

Data object is acquired, image and image labeling data are obtained, is arranged in image labeling data in entire data set Seldom appearance or useless mark word.Generally in the data set of acquirement, wherein many noise datas are had, so using These data carry out just carrying out it processing and filtering appropriate before feature extraction.For image, obtained figure As being all unified JPG format, do not need to do any transformation.For the text marking of image, obtained image labeling contains There are many meaningless words, as word addend word does not have the word of any meaning.Some image labelings up to tens, in order to It allows image labeling to describe the main information of image well, those useless, meaningless marks should be given up.Therefore, it is taken Process method step it is as follows:

Step 1: the frequency that all words occur in data set in statistical data collection mark；

Step 2: filtering out the meaningless word with number in those words；

Step 3: in each image labeling in entire data set the less word of the frequency of occurrences, be construed as figure Than minor information as in, and deleted.

Through the above steps, the image labeling that can obtain that treated.For removing the less word of frequency in step 3, Reason for this is that the mark of same class image or there are many identical, words for being close in meaning in image clustering in it.Therefore according to The frequency of occurrences is completely reasonable to be filtered to it.

(2) multi-modal feature extraction

Fig. 2 shows the process extracted feature in the way of deep learning and construct depth vocabulary, utilizes in the present invention Region-CNN is detected to the region of image and is extracted corresponding CNN feature, and the dimension of feature is 4,096 dimension. Usually, Region-CNN can select 2,000 or so region as candidate, such width figure for every piece image As corresponding eigenmatrix just has 2,000*4,096 dimension.And later if all areas to all images cluster, Data volume is M*2,000*4,096, and M is the number of image, it is clear that such data volume bring space-time cost is huge.For Such a practical problem is solved, the method that internal-external cluster combines is carried out in concrete operations, i.e., firstly for each width The all areas for including in image carry out primary internal cluster (being polymerized to 10 classes), carry out again to all areas later primary external poly- Class (is polymerized to 100 classes), and the data volume for carrying out external cluster actually final so is just M*10*4, and 096, largely reduce The space-time cost of cluster.Another needs the problem of illustrating, either Region-CNN extracts visual signature or Skip- Gram is extracted lexical feature and is operated using pre-training model, and wherein Region-CNN is utilized on ImageNet AlexNet carries out pre-training, and Skip-gram then utilizes the mould that training obtains on the wikipedia document comprising 6,000,000,000 vocabulary Type.This training for being primarily due to deep neural network needs a large amount of data, therefore to avoid the problem that over-fitting, using Trained model operates real data to extract corresponding feature on large-scale dataset.

(3) cross-module state topic relativity calculates

Fig. 3 shows cross-module state relationship topic relativity modeling process, and utilization is mentioned in introduction before Carry out computation vision document d^vWith text document d^tCorrelation, M^vAnd M^tIt is for visual theme feature and text respectively The mapping matrix of this theme feature, TCor (l^vt=1) document d is indicated_tWith d_vTopic relativity, and TCor (l^vt=0) it indicates Document d_tWith d_vTheme non-correlation, shown in TCor () is defined as follows:

Here different data types is directed to using both of which, mode is first is that using Sigmoid function by dot product It is mapped in [0,1] range, and second mode calculates topic relativity by normalizing the cosine similarity of two vectors.Together When, the multi-modal theme distribution based on generation can use the method for maximal possibility estimation (MLE) to train to obtain parameter M^vWith M^t, that is, the log likelihood angle value of formula (4) is maximized, objective function is defined as shown:

Based on such objective function, mapping matrix M^vAnd M^tIt can be calculated by gradient descent method.It needs to illustrate It is, in actual training process, it is assumed that the quantity of multi-modal document is | D^M|, under normal conditions in each multi-modal document It only include one group of image and text, the number of image document and the number of text document are substantially the same, and are equal to multi-modal The quantity of document, i.e., | D^v|=| D^T|=| D^M|.If the text and image that occur in same multi-modal document be it is relevant, Without then uncorrelated in same multi-modal document, (i.e. image-text is related for the positive sample for the training data being converted in this way It is right) and the ratio of negative sample (image-text uncorrelated to) be about 1/ | D^M|.Such ratio will lead to negative sample and positive sample Serious disproportion, in addition image and text can not illustrate that the image and text are complete in same multi-modal document completely Uncorrelated (same category may be belonged to), therefore enabling the ratio of negative sample and positive sample in practice is 1:1, and is being randomly choosed Meet following constraint when negative sample, i.e., corresponding image and text cannot come from same category.

(4) multi-modal relationship topic model derives

Formula (3) shows relationship topic model constructed in the present invention, is derived using the method for jeep this sampling Obtain the parameter [26] of model.The purpose of this sampling of jeep will obtain each vocabulary in multi-modal document and imply behind Theme is derived by first about depth vocabulary, the corresponding subject information of vocabulary and corresponding cross-module during sampling The edge distribution of state association link, as follows:

Wherein, m_d,ttCorresponding is the number that theme tt occurs in document d, n_tt,wCorresponding is in entire document sets The number of theme tt vocabulary generated.It is general that the single argument for subject information z can be further derived by according to formula (6) Rate distribution, and then obtain the sampling rule for the behind theme of each word in document.As shown in formula (7),

WhereinIndicate the frequency of occurrence of the theme tt after removing current word in document d, andIt indicates to remove current word The number for the word that theme tt is included.Based on such sampling rule, it can sample to obtain each word in entire document sets and carry on the back Implied subject information afterwards.Similarly, it after sampling each time, is all calculated using formula (5) and is obtained in present sample To theme distribution on the basis of how to obtain mapping matrix M^tAnd M^v, and the M obtained within the present sample time^tAnd M^vBy conduct The input of sampling process next time, loops back and forth like this, until reaching iteration termination condition, to obtain final subject information And mapping matrix M^tAnd M^v.Correspondingly, other parameters are such as in relationship topic modelθ^V、θ^TIt then can be public by calculating Formula (8) finally obtains:

(5) example is applied

Fig. 5 is to carry out the effect picture across media information retrieval using constructed relationship topic model, wherein being divided into two kinds Mode, one is image retrieval text (Image Query-to-Text) is utilized, another kind is to utilize text retrieval image (Text Query-to-Image), relevance score are calculated as shown in formula (9).

Bibliography

[1]Fan,J.P.；He,X.F.；Zhou,N.；Peng,J.Y.；and Jain,R.2012.Quantitative Characterization of Semantic Gaps for Learning Complexity Estimation and Inference Model Selection.IEEE Transactions on Multimedia 14(5):1414-1428.

[2]Datta,R.；Joshi,D.；Li,J.；and Wang,J.Z.2008.Image Retrieval:Ideas, Influences,and Trends of the New Age.ACM Computing Surveys(CSUR)40(2), Article5.

[3]Rasiwasia,N.；Pereira,J.C.；Coviello,E.；Doyle,G.；Lanckriet,G.R.G.； Levy,R.；and Vasconcelos,N.2010.A New Approach to Cross-modal Multimedia Retrieval.In Proceedings of MM 2010,251-260.

[4]Pereira,J,C.；Coviello,E.；Doyle,G.；Rasiwasia,N.；Lanckriet,G.R.G.； Levy,R.；and Vasconcelos,N.2014.On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence(PAMI)36(3):521-535.

[5]Barnard,K.；Duygulu,P.；Forsyth,D.；Freitas,N.；Blei,D.M.；and Jordan, M.I.2003.Matching Words and Pictures.Journal of Machine Learning Research.3: 1107-1135.

[6]Wang,X.；Liu,Y.；Wang,D.；and Wu,F.2013.Cross-media Topic Mining on Wikipedia.In Proceedings of MM 2013,689-692.

[7]Frome,A.；Corrado,G.S.；Shlens,J.；Bengio,S.；Dean,J.；Ranzato,M.A.；and Mikolov,T.2013.DeViSE:A Deep Visual-Semantic Embedding Model.In Proceedings of NIPS 2013.

[8]Feng,F.X.；Wang,X.J.；and Li,R.F.2014.Cross-modal Retrieval with Correspondence Autoencoder.In Proceedings of MM 2014,7-16.

[9]Nguyen,C.T.；Kaothanthong,N.；Tokuyama,T.；and Phan X.H.2013.A Feature-Word-Topic Model for Image Annotation and Retrieval.ACM Transactions on the Web 7(3),Article 12.

[10]Ramage,D.；Heymann,P.；Manning,C.D.；and Molina,H.G.2009.Clustering the Tagged Web.In Proceedings of WSDM 2009,54-63.

[11]Blei,D.M.；and Jordan,M.I.2003.Modeling Annotated Data.In Proceedings of SIGIR 2003,127-134.

[12]Wang,C.；Blei,D.；and Fei-Fei L.2009.Simultaneous Image Classification and Annotation.In Proceedings of CVPR 2009,1903-1910.

[13]Putthividhya,D.；Attias,H.T.；and Nagarajan,S.S.2010.Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation.In Proceedings of CVPR2010,3408-3415.

[14]Niu,Z.X.；Hua,G.；Gao,X.B.；and Tian,Q.2014.Semi-supervised Relational Topic Model for Weakly Annotated Image Recognition in Social Media.In Proceedings of CVPR2014,4233-4240.

[15]Wang,Y.F.；Wu,F.；Song,J.；Li,X.；and Zhuang,Y.T.2014.Multi-modal Mutual Topic Reinforce Modeling for Cross-media Retrieval.In Proceedings of MM 2014,307-316.

[16]Zheng,Y.；Zhang,Y.J.；and Larochelle,H.2014.Topic Modeling of Multimodal Data:an Autoregressive Approach.In Proceedings of CVPR 2014,1370- 1377.

[17]Chen,T.；SalahEldeen,H.M.；He,X.N.；Kan,M.Y.；and Lu,D.Y.2015.VELDA: Relating an Image Tweet’s Text and Images.In Proceedings of AAAI 2015.

[18]Girshick,R.；Donahue,J.；Darrell,T.；and Malik,J.2014.Rich feature hierarchies for accurate object detection and semantic segmentation.In Proceedings of CVPR 2014,580-587.

[19]Hariharan,B.；Arbelaez,P.；Girshick,R.；and Malik, J.2014.Simultaneous Detection and Segmentation.In Proceedings of ECCV 2014, 297-312.

[20]Karpathy,A.；Joulin,A.；and Fei-Fei,L.2014.Deep Fragment Embeddings for Bidirectional Image Sentence Mapping.In Proceedings of NIPS 2014.

[21]Zhang,N.；Donahue,J.；Girshick,R.；and Darrell,T.2014.Part-Based R- CNNs for Fine-Grained Category Detection.In Proceedings of ECCV 2014,834-849.

[22]Mikolov,T.；Sutskever,I.；Chen,K.；Corrado,G.；and Dean, J.2013.Distributed Representations of Words and Phrases and their Compositionality.In Proceedings of NIPS 2013.

[23]Tang,D.Y.；Wei,F.R.；Qin,B.；Zhou,M.；and Liu,T.2014.Building Large- Scale Twitter-Specific Sentiment Lexicon:A Representation Learning Approach.In Proceedings of COLING 2014,172-182.

[24]Karpathy,A.；Joulin,A.；and Fei-Fei,L.2014.Deep Fragment Embeddings for Bidirectional Image Sentence Mapping.In Proceedings of NIPS 2014.

[25]Sivic,J.,and Zisserman,A.2003.Video Google:A Text Retrieval Approach to Object Matching in Videos.In Proceedings of ICCV 2003,2:1470- 1477.

[26]Griffiths,T.L.；and Steyvers,M.2004.Finding Scientific Topics.

In Proceedings of the National Academy of Sciences of the United States of America,101(1):5228-5235。

Claims

1. a kind of cross-module state topic relativity modeling method based on deep learning, it is characterised in that specific step is as follows:

(1) data prediction: the data image of acquisition different modalities is concentrated from multi-medium data, obtains image and iamge description number According to, arrange image labeling data set in seldom occur or useless mark word；

(2) it extracts multi-modal depth characteristic: extracting the visual signature of image and the language of iamge description using deep learning method Adopted feature；Specifically, it is utilized respectively Region-CNN model and Skip-gram model comes the provincial characteristics and text of abstract image This lexical feature；Wherein, representational region candidate collection in Region-CNN detection image first utilizes pre-training later Convolutional neural networks come to extract corresponding region corresponding to feature；Skip-gram model is then to utilize text vocabulary and word The feature vector that co-occurrence information directly training between remittance obtains vocabulary indicates；

(3) depth bag of words are constructed: using clustering algorithm K-means that image-region obtained in step (2) is special first Text lexical feature of seeking peace is clustered, and obtains the deep vision dictionary and depth text dictionary that limit dimension, and then will be corresponding All provincial characteristics are mapped to corresponding visual dictionary in image, so that building obtains deep vision bag of words；Similarly, Vocabulary in all texts is also mapped onto text dictionary and obtains depth text bag of words；

(4) multi-modal theme generates: the generation of entire multi-modal data collection is simulated using the hypothesis of potential Di Li Cray model Process, and it is derived by the theme distribution feature that text collection and image set are hidden behind, it makes full use of between vocabulary Co-occurrence information；

(5) the relationship topic model modeling of fusion cross-module state topic relativity analysis: corresponding relationship topic model is constructed, that is, is existed The correlation that theme feature between different modalities is considered while constructing topic model, by multi-modal master obtained in step (4) Feature is inscribed as initial value, while calculating the correlation between image and text using the related information between image and text Property, calculated correlation update the subject information of multi-modal document, thus cross-iteration carry out correlation meter It calculates to update and then construct with theme distribution and obtains final relationship topic model；

(6) based on topic relativity across media information retrieval: obtained cross-module state topic relativity is applied to across media letters In breath retrieval, it is the inquiry for giving certain mode respectively, is obtained and other maximally related mode of the inquiry using correlation calculations Data.

2. according to the method described in claim 1, it is characterized by: in step (2), it is described be utilized respectively Region-CNN and Skip-gram model comes the provincial characteristics of abstract image and the lexical feature of text, and detailed process is as follows:

Given image, Region-CNN go out the position that object is likely to occur from image selection first with the method for selection search and make For Candidate Set, exist in the form of region；And then it is directed to each extracted region CNN feature；In specific implementation, Each image-region is converted into fixed Pixel Dimensions 227*227 by Region-CNN, for extracting the convolutional network of feature Layer is fully connected by 5 convolutional layers and 2 to constitute；

Given text document, the corresponding feature of each word occurred in text document is obtained using Skip-gram model training Vector；The text description section of entire multi-modal document data set is indicated with TD, TW is all texts occurred in TD This vocabulary, TV is the corresponding dictionary of text vocabulary, for each of TW vocabulary tw, iv_twAnd ov_twIt is the input for tw Feature vector and output feature vector, Context (tw) is the vocabulary that word tw hereinafter occurs on it；Context is corresponding Window size is set as 5, and all input vectors corresponding to entire text data set and output vector are uniformly used a long ginseng Number vector indicates W ∈ R^2*|TV|*dim, wherein dim is the dimension of input vector and output vector；Entire Skip-gram model Objective function is described below:

Using the negative sample method of sampling come approximate calculation ogP (tw_j|tw_i), calculation formula is as follows:

Wherein, σ () is sigmoid function, and m is the quantity of negative sample, each negative sample is from the noise based on word frequency It is generated to be distributed P (tw).

3. according to the method described in claim 1, it is characterized by: step (3) is to obtain respective depth vocabulary in step (2) On the basis of, depth bag of words, detailed process are further constructed by the method for vector quantization are as follows: for mentioning using R-CNN Multi-modal document data is concentrated institute first with the method for K-means by the region candidate collection obtained and corresponding feature The provincial characteristics for having image to be included is clustered, and obtains the classification of fixed quantity, and central point of each cluster classification is made For the representative element of the category, all these classifications constitute a corresponding dictionary；Later, each candidate regions in image Domain is mapped in corresponding classification and indicates, mapping method is special by calculating the feature in each region and class center The Euclidean distance of sign, so that the corresponding classification nearest with provincial characteristics is found, it is cumulative in the position that vector corresponds to the category, from And every piece image in entire data set all being indicated the form as deep vision bag of words, i.e., every piece image is one corresponding The dimension of vector, vector is the number of classification, and the element value of vector is the number that the category occurs in the picture, with vector VT ∈R^CIt indicates, wherein C is the class number that cluster obtains；Similarly, for term vector all corresponding to text document, Corresponding depth text dictionary is obtained also by the mode of cluster, finally with same mapping method by each text table It is shown as the form of depth text bag of words.

4. according to the method described in claim 1, it is characterized by: being distinguished in step (4) using potential Di Li Cray model Probabilistic Modeling carried out to image and text collection, potential Di Li Cray model hypothesis the behind of document sets under cover one it is common Theme set, and specific each document respectively corresponds the probability distribution closed in the theme collection behind, this Each of document word all corresponds to one by probability distribution theme generated behind；And the probability distribution of all documents It does not have no bearing on, is generated from a common Di Li Cray distribution；On the basis of this model hypothesis, by step (3) the deep vision bag of words and depth text bag of words obtained are derived by different modalities document using LDA model as inputting The probability topic distribution hidden behind.

5. according to the method described in claim 1, it is characterized by: step (5) during constructing model, by multi-modal text Shelves set D^MIt is divided into three parts composition, i.e., first part is visual pattern set D^V, second part is text description collections D^T, the Three parts are link set L^VT, which indicates the related information between image and text；Wherein, D^VBy deep vision word finder Close DW^VIt constitutes, and DV^VIt is deep vision dictionary, while text description collections D^TBy depth text lexical set DW^TIt constitutes, DV^TIt is Depth text dictionary；For l^vt∈L^VT,l^vt=1 means visual pattern d^v∈D^VD is described with text^t∈D^TIt is relevant, and l^vt=0 means visual pattern d^vD is described with text^tIt is incoherent；Based on above description, the formalization of relationship topic model It is expressed as follows: given DT^VFor visual theme set, DT^TIt is text subject set, α and β are two hyper parameters, and wherein α is for master The Di Li Cray of topic is distributed, and β is distributed for theme-depth vocabulary Di Li Cray, θ^vCorresponding visual pattern d^vThe theme of behind point Cloth, θ^tCorresponding visual pattern d^tThe theme distribution of behind, Φ are that each theme corresponds to multinomial corresponding to all depth vocabulary point Cloth, z are the behind subject informations of all vocabulary of correspondence actually generated by θ, and Dir () and Mult () respectively indicate Di Li Cray Distribution and multinomial distribution, N_dIndicate the quantity of the depth vocabulary in document d, n indicates n-th of depth vocabulary；Entire relationship The generating process of topic model is as follows:

(1) for each theme tv ∈ DT in visual theme set^V:

The multinomial distribution that tv corresponds to all visual vocabularies is obtained according to theme-visual vocabulary Di Li Cray profile samples, it may be assumed that φ^v _tv~Dir (φ^v|β^v)；

(2) for each theme tt ∈ DT in text subject set^T:

The multinomial distribution that tt corresponds to all text vocabulary is obtained according to theme-text vocabulary Di Li Cray profile samples, it may be assumed that φ^t _tt~Dir (φ^t|β^t)；

(3) for each visual document d ∈ D^V:

(a) d corresponding theme distribution behind is obtained according in the Di Li Cray profile samples that theme collection closes, it may be assumed that

θ^v _d~Dir (θ^v|α^v)；

(b) for each of d deep vision vocabulary w^v _d,n:

I. the corresponding theme of the vocabulary is obtained according to the theme distribution of the behind document d, it may be assumed that z^v _d,n~Mult (θ^v _d)；

Ii. sample to obtain corresponding vocabulary in a document according to theme-visual vocabulary, it may be assumed that w^v _d,n~Mult (φ^v _zd,n)；

(4) for each text document d ∈ D^T:

(a) d corresponding theme distribution behind is obtained according in the Di Li Cray profile samples that theme collection closes, it may be assumed that θ^t _d~Dir (θ^t|α^t)；

(b) for each of d depth text vocabulary w^t _d,n:

Ii. sample to obtain corresponding vocabulary in a document according to theme-text vocabulary, it may be assumed that w^t _d,n~Mult (φ^t _zd,n)；

(a) according to d^vWith d^tTheme feature calculate its correlation to l^vtIt is sampled, it may be assumed that WhereinWithRespectively correspond document d^vWith d^tExperience theme distribution, WithIt is that two mapping matrixes map vision and text subject feature respectively To public subspace, wherein the dimension of public subspace is dim dimension, TCor (l^vt=1) document d is indicated^tWith d^vTheme it is related Property, and TCor (l^vt=0) document d is indicated^tWith d^vTheme non-correlation；

It is final to construct joint probability distribution form to be directed to entire multi-modal collection of document and be modeled, such as based on above procedure Shown in lower:

Wherein, first item corresponds to theme-depth vocabulary generating process, intermediate two corresponding deep vision vocabulary and depth text The generating process of vocabulary, last indicates image-description connection generating process.

6. according to the method described in claim 1, it is characterized by: step (6) is the relationship theme mould that step (5) is established Type, for across media information retrieval；It is divided into two classes across media information retrieval, i.e. text-inquiry-image and image-inquiry-text This, what text-inquiry-image considered is to calculate different images to this using relationship topic model according to given query text The text degree of correlation is ranked up all images, and image-inquiry-text is according to different text documents for giving query graph The degree of correlation of picture is ranked up all text documents；

Image querying text is utilized for given, derives corresponding theme feature using relationship topic model, and utilize The correlation calculations method of theme feature obtained in step (5) calculates the correlation information between other mode documents, Text document is ranked up by the height of correlation information, is obtained and the maximally related text text of query image to return Shelves；Similarly, the above process is also applied for across the media information retrieval process using text query image.