CN105760507B - Cross-module state topic relativity modeling method based on deep learning - Google Patents
Cross-module state topic relativity modeling method based on deep learning Download PDFInfo
- Publication number
- CN105760507B CN105760507B CN201610099438.9A CN201610099438A CN105760507B CN 105760507 B CN105760507 B CN 105760507B CN 201610099438 A CN201610099438 A CN 201610099438A CN 105760507 B CN105760507 B CN 105760507B
- Authority
- CN
- China
- Prior art keywords
- text
- theme
- image
- document
- vocabulary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 92
- 238000013135 deep learning Methods 0.000 title claims abstract description 11
- 230000008569 process Effects 0.000 claims abstract description 25
- 230000004927 fusion Effects 0.000 claims abstract description 7
- 238000009826 distribution Methods 0.000 claims description 48
- 230000000007 visual effect Effects 0.000 claims description 42
- 239000013598 vector Substances 0.000 claims description 36
- 238000013527 convolutional neural network Methods 0.000 claims description 24
- 238000013507 mapping Methods 0.000 claims description 15
- 238000012549 training Methods 0.000 claims description 15
- 238000004364 calculation method Methods 0.000 claims description 10
- 238000002372 labelling Methods 0.000 claims description 10
- 238000005070 sampling Methods 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 7
- 238000004458 analytical method Methods 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 4
- 238000013480 data collection Methods 0.000 claims description 3
- 239000004744 fabric Substances 0.000 claims description 3
- 238000013139 quantization Methods 0.000 claims description 3
- 230000001186 cumulative effect Effects 0.000 claims description 2
- 238000001514 detection method Methods 0.000 claims description 2
- 239000000203 mixture Substances 0.000 claims description 2
- 241000208340 Araliaceae Species 0.000 claims 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 claims 1
- 235000003140 Panax quinquefolius Nutrition 0.000 claims 1
- 235000008434 ginseng Nutrition 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 4
- 238000010276 construction Methods 0.000 abstract 1
- 230000000875 corresponding effect Effects 0.000 description 35
- 239000011159 matrix material Substances 0.000 description 10
- 230000008901 benefit Effects 0.000 description 5
- 238000011160 research Methods 0.000 description 4
- 238000010219 correlation analysis Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000012098 association analyses Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000035800 maturation Effects 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/93—Document management systems
- G06F16/94—Hypermedia
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- Business, Economics & Management (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to across Media Correlation learning art fields, specially the cross-module state topic relativity modeling method based on deep learning.The present invention includes two main algorithms: the multi-modal document representation based on depth vocabulary, the relationship topic model modeling of fusion cross-module state topic relativity study.The present invention describes semantic description part and the image section in multi-modal document using depth learning technology come construction depth semantic vocabulary and deep vision vocabulary respectively.Based on such multi-modal document representation, entire multi-modal collection of document is modeled by building cross-module state relationship topic model, so that the association between the generating process and different modalities of multi-modal document be described.This method accuracy is high, adaptable.This is on the basis of extensive multi-modal document (text adds image), consider multi-modal semantic information and efficiently have great importance across media information retrieval, it can be improved retrieval relevance, enhance user experience, be with a wide range of applications in across media information retrieval field.
Description
Technical field
The invention belongs to across Media Correlation learning art fields, and in particular to based on deep learning across modality images-
Text subject correlation study method.
Background technique
With the development of internet technology with the maturation of Web2.0, add up the multi-modal document of magnanimity on the internet, such as
What analyzes and handles the labyrinth of these multi-modal documents, to provide theories integration for practical applications such as cross-media retrievals
Become a very important research hotspot.Usually, a multi-modal document is usually in the form of multiple modalities co-occurrence
In the presence of for example, many web graph pictures are attached to many customized iamge descriptions of user or mark, in addition there are also nets
The document of network includes the form of some illustrations.However, although these multi-modal data are usually associated with each other, due to semantic letter
, there are very big difference and difference [1] in the problem of ditch between the visual information and text description information of image, this makes sufficiently
Become very difficult using the semantic association between different modalities.Therefore, it is implicit behind that different modalities data how sufficiently to be excavated
Relationship, and preferably fusion multi-modal information come to multi-modal document carry out modeling become very important [2,3].And it is sharp
Multi-modal document is modeled with topic model, and then the association excavated between different modalities is a Critical policies,
In the research of cross-module state theme modeling, it there are three interrelated needs while being resolved:
1, find and construct more representative, more valuable document elements come in multi-modal document image and
Expression is described in content of text respectively.
2, more reasonable topic relativity model can be established to come preferably to different modalities data in multi-modal document
Between association be described, i.e., visual pattern and text description between association.
3, learn to establish one come the internal association being directed between image and content of text by cross-module state topic relativity
Kind objectively measures mechanism.
To solve first problem, most important is exactly the document elements how exploration establishes one group of optimization, thus
Using these optimization document elements can it is more accurate, more fully hereinafter in multi-modal document vision and semantic feature into
Row expression.
To solve Second Problem, it is most important that a more robust probability topic model can be established, to dig
The implicit subject information of pick behind makes the likelihood angle value for the multi-modal subject document observed reach maximum.
To solve the problems, such as third, most effective settling mode be the attributive character of different modalities is mapped to it is common embedding
Enter in subspace, to maximize the related information between different modalities information.
Currently have some researchers and propose distinct methods for multi-modal data modeling, from these sides from the point of view of modeling angle
Method can be roughly divided into two types, and the first kind is statistics dependence modeling method, and the second class is that building joint probability generates model method.
(1) modeling method that statistics relies on
The core concept of statistical modeling method be the data characteristics of different modalities is mapped to identical latent space, thus
It is expected that farthest excavating the statistic correlation between different modalities data characteristics.By taking image and text as an example, pass through building
The characteristics of image of different structure and text feature are mapped in identical public subspace by corresponding mapping matrix respectively,
Public subspace is fallen into a trap the correlation of nomogram picture and text, and the distance of more relevant image and text in public subspace is also
It is closer, on the contrary distance far means that image is also lower with the correlation of text.Canonical correlation analysis method (Canonical
Correlation Analysis, CCA) be a kind of most typical statistics dependence method, by seek visual signature matrix with
The maximum correlation of semantic feature matrix obtains its corresponding space basal orientation moment matrix;Space basal orientation moment matrix is maximumlly kept
The correlation of Image Visual Feature and semantic feature, and its mapping relations for being mapped to isomorphism subspace is provided;In turn by image
Visual feature vector and semantic feature DUAL PROBLEMS OF VECTOR MAPPING to same dimension under in isomorphism subspace and construct cross-module state fusion feature, it is real
The unified representation of existing media data different modalities information.Such as KernelCCA of work later (KCCA) and deepCCA (DCCA)
The dependence between image and text is inquired into deeper time.
Statistical modeling method is combined together by work [4] with topic model, and this method is first with potential Di Li Cray
Model extracts the visual theme feature of image and the text subject feature of text respectively, later will using Canonical correlation method
Visual theme feature finds into isomorphism subspace with text subject Feature Mapping and calculates its correlation.It expands in [5]
Its work is opened up, and calculates its correlation using KCCA.
(2) building joint probability generates model method
Multi-modal topic model is the Typical Representative method for constructing joint probability and generating model, has many phases recent years
It closes work and carries out probability topic modeling [6,7,8,9,10] the vision content that is directed in multi-modal document and semantic description.
[Blei2003] establishes the topic model [11] of a series of complexity step by step in its work in 2003, wherein
Correspondence Latent Dirchlet Allocation (Corr-LDA) is wherein optimal cross-module state theme mould
Type, there are corresponding dependences between the implicit theme between the model hypothesis different modalities, i.e., the corresponding implicit master of mark
Topic comes from the implicit theme of image vision information behind.This hypothesis establishes a unidirectional mapping relations i.e. text vocabulary
Generation depend on image vision content information.Later, [Wang2009] proposes a kind of topic model for having supervision to learn
Potential relationship [12] between image and mark word, [Putthividhva 2010] then propose a kind of based on the more of theme recurrence
The potential Di Li Cray model [13] of mode.Text and picture material, which are combined, in the multi-modal document of [Rasiwasia2010] research builds
Mould [3].A kind of method that [Nguyen 2013] proposes image labeling, this method be distribution based on union feature and word and
The distribution [9] of word and theme.[Niu2014] propose a kind of semi-supervised relationship topic model come to picture material and image it
Between relationship explicitly modeled [14].[Wang2014] then proposes that a kind of semi-supervised multi-modal common theme reinforces model,
The model inquires into the relationship [15] mutually promoted between different modalities theme.[Zheng2014] proposes a kind of for DocNADE's
There is supervision mutation model to be modeled [16] come visual vocabulary, mark vocabulary and the Joint Distribution of category to image.
[Chen, 2015] solves the modeling wide gap [17] between image and text by building vision-emotion LDA model.
As seen through the above analysis, current method all obtains some progress in multi-modal Document Modeling, however with
Upper all methods do not fully consider yet to be influenced brought by following three aspects:
(1) depth information excavates in multi-modal document --- and most of existing images-label degree of correlation learning method is logical
It often only focuses on and explores the association between different modalities in traditional visual signature representation method and markup information feature, not
There is the depth characteristic for considering to be contained in these different modalities.For building overall Vision semanteme and internal semantic association,
This will will lead to a series of serious loss of learning problems.And this can then be made up to the exploration of the depth of multi-modal document and lacked
It falls into, so that obtained characteristic element preferably indicates multi-modal document.
(2) the relationship topic relativity modeling based on depth analysis --- most existing theme modeling methods are considering structure
When building the topic relativity of different modalities, it is typically based on such it is assumed that the theme that i.e. different modalities are hidden behind is consistent
's.And such hypothesis is usually excessively absolute, some unnecessary noises can be introduced while constructing topic relativity,
Therefore building one is more reasonable it is assumed that merging the characteristic information of depth, and the more optimal relationship theme of formation one is related
Property modeling mechanism becomes particularly significant.
(3) being learnt by the cross-module state degree of correlation of depth theme feature --- most existing multi-modal topic models are based on
The theme distribution feature for going matching different modalities to hide behind is typically directly considered when calculating correlation between different modalities, thus
Catch the internal association between visual pattern and text description.However, there is no well for a kind of such direct matching way
Consider the isomerism of image and text, therefore by the way that depth theme feature is mapped to public space to learn its correlated performance
It is enough to excavate its correlation well, to solve the problems, such as suggested above.
Therefore, it is highly desirable to use for reference current existing related mature technology, while the problem above that takes one thing with another, more
Add and comprehensively analyzes and calculate the topic relativity calculation method between different modalities.The present invention is exactly thus to excite, from part
To entirety, designs a kind of novel technological frame (including three main algorithms) and cover, the depth vocabulary structure in multi-modal document
It builds, the building of relationship topic model, the study of heterologous topic relativity, to establish effective cross-module state topic relativity calculating side
Method finally improves for across media image retrieval performance.
Summary of the invention
It is an object of the invention to propose a kind of cross-module state topic relativity modeling method based on deep learning, to improve
Across Media Society image retrieval performance.
Present invention firstly provides a novel depth cross-module state topic relativity correlation models, and the model is for extensive
Multi-modal corpus is modeled, and can analyse in depth and understand the related information in multi-modal document between image and text, benefit
With constructed model, the performance of cross-media retrieval can be effectively facilitated.The model mainly includes following components:
(1) depth vocabulary building (DeepWordConstruction).For multi-modal document, depth learning technology is utilized
Building depth vocabulary is indicated as basic element respectively;Depth vocabulary includes deep vision vocabulary and depth text vocabulary,
Wherein, deep vision vocabulary is used to better describe the image vision content in document, and depth text vocabulary is then used as and is used to
The basic element of content of text in document is described.Compared with traditional visual vocabulary and text vocabulary, depth vocabulary can be deeper
Excavate to level the semantic information of document.Building mode in this way, multi-modal document can be with depth vocabulary come more preferable
Ground indicates.
(2) multi-modal subject information generates (Multimodal Topic Information Generation).It is constructing
Depth vocabulary on the basis of, the theme that different modalities data are hidden behind is further excavated using topic model LDA and is believed
Breath.Topic model assumes that document sets have one group of common theme collection behind, and each word corresponds to a master in document
Topic, based on such it is assumed that being carried out by the theme feature for deriving each available document behind to document further
Expression.
(3) cross-module state theme association analysis (Cross-modal Topic Correlation Analysis).Assuming that not
With the document of mode, the theme hidden is heterologous but relevant, such as " wedding " the corresponding theme in text document behind
There may be very high related information with image behind " white " theme, therefore the method by constructing common subspace is different
The theme feature of mode is mapped in public subspace, to find the related information between different modalities.
(4) relationship theme modeling (Relational Topic Modeling).Relationship topic model is generating different modalities
Theme feature when, while considering the related information of image and document, i.e., not only consider when constructing the theme of a certain document same
The information of one mode, while the related information with other mode is also considered, so that final theme merges multi-modal information,
And finally building obtains the theme distribution and cross-module state related information of multi-modal document behind.
For current existing multi-modal theme modeling method, method proposed by the invention exists in the application
Two big advantages: first, accuracy is high, and be mainly reflected in: this method replaces traditional vocabulary, energy using the depth vocabulary of building
It is enough to excavate mode profound level information deeper into ground, problem brought by semantic gap can be alleviated, well so as to more preferable
The efficiency of ground promotion cross-media retrieval.Second, it is adaptable, because constructed model is directed between different modalities well
Association is modeled, it is possible to and it is suitable for image retrieval text and text retrieval image is two-way across media information retrieval, and
And the model can also be expanded to more easily for other mode across (such as audio) in media information retrieval.
Cross-module state topic relativity modeling method provided by the invention based on deep learning, the specific steps are as follows:
(1) data prediction: the data image of acquisition different modalities is concentrated from multi-medium data, image is obtained and image is retouched
It states data, arranges in image labeling data set and seldom occur or useless mark word;
(2) it extracts multi-modal depth characteristic: extracting visual signature and the iamge description of image using deep learning method
Semantic feature.Specifically, be utilized respectively Region-CNN (Convolutional Neural Network) model and
Skip-gram model comes the provincial characteristics of abstract image and the lexical feature of text.Wherein, Region-CNN detection image first
In representational region candidate collection, later using the convolutional neural networks of pre-training come to extract corresponding region corresponding to spy
Sign;Skip-gram model be then obtained using the directly training of the co-occurrence information between text vocabulary and vocabulary the feature of vocabulary to
Amount indicates.
(3) depth bag of words are constructed: using clustering algorithm K-means by obtained image district in step (2) first
Characteristic of field and text lexical feature are clustered, and obtain the deep vision dictionary and depth text dictionary that limit dimension, and then will
All provincial characteristics are mapped to corresponding visual dictionary in respective image, so that building obtains deep vision bag of words, phase
As, the vocabulary in all texts also may map to text dictionary and obtain depth text bag of words;
(4) multi-modal theme generates: simulating entire multi-modal data collection using the hypothesis of potential Di Li Cray model
Generating process, and be derived by the theme distribution feature that text collection and image set are hidden behind, make full use of vocabulary it
Between co-occurrence information;
(5) the relationship topic model modeling of fusion cross-module state topic relativity analysis: constructing corresponding relationship topic model,
The correlation that theme feature between different modalities is considered while constructing topic model, by multimode obtained in step (4)
State theme feature calculates using the related information between image and text the phase between image and text as initial value
Guan Xing, calculated correlation update the subject information of multi-modal document, thus cross-iteration carry out correlation
It calculates to update and then construct with theme distribution and obtains final relationship topic model;
(6) based on topic relativity across media information retrieval: obtained cross-module state topic relativity is applied to across matchmaker
The inquiry for giving certain mode respectively in body information retrieval, using correlation calculations obtain with the inquiry it is maximally related other
The data of mode.
Above each step is described in detail below:
(1) data prediction
The step mainly carries out preliminary pretreatment to the data image of acquisition different modalities, specifically, because of image
It include some noises in the mark for being included, these noises can be passed through because the randomness of user annotation causes
The mode of word frequency filtering filters out word frequency lower than the word of some threshold value to obtain new dictionary.
(2) multi-modal depth characteristic is extracted
In the present invention, it is utilized respectively provincial characteristics and text that Region-CNN and Skip-gram model carrys out abstract image
Lexical feature.It is illustrated separately below:
Given image, Region-CNN go out the position that object is likely to occur from image selection first with the method for selection search
It sets as Candidate Set (usual 2,000 or so), exists in the form of region.And then it is special for each extracted region CNN
Sign.In specific implementation, each image-region is converted into fixed Pixel Dimensions 227*227 by Region-CNN, for mentioning
It takes the convolutional network of feature to be fully connected layer by 5 convolutional layers and 2 to constitute.Visual signature is extracted with Region-CNN to compare
Traditional visual signature, advantage are mainly reflected in the semanteme that the extracted profound feature of CNN is more nearly image itself, can
The problem of to alleviate semantic gap to a certain extent.
Given text document, each word for obtaining occurring in text document using Skip-gram model training are corresponding
Feature vector.Skip-gram model is that a kind of very effective method carrys out the distributed of learning text vocabulary and indicates that the model is most
It is early to be proposed by Mikolov et al. in 2013, it is used widely in the task of different natural language processings later.The mould
Type can capture the syntax and semantics relationship between text vocabulary well, and semantic similar word is aggregated in
Together, compared to more traditional text term vector learning method.An important advantage of Skip-gram is because not being related to
Complicated density matrix operation, training effectiveness when for mass data training are high.Entire multi-modal text is indicated with TD
The text description section of file data set, TW are all text vocabulary occurred in TD, and TV is that text vocabulary is corresponding
Dictionary, for each of TW vocabulary tw, ivtwAnd ovtwIt is the input feature value and output feature vector for tw,
Context (tw) is the vocabulary that word tw hereinafter occurs on it, and the corresponding window size of context is arranged in the present invention
It is 5, all input vectors corresponding to entire text data set and output vector is uniformly indicated into W with a long parameter vector
∈R2*|TV|*dim, wherein dim is the dimension of input vector and output vector.Therefore, the objective function of entire Skip-gram model
It can be described below:
For Skip-gram training, train the brought cost that calculates can be unusual using traditional softmax
Height, therefore the negative sample method of sampling is utilized to approximate calculation ogP (twj|twi), calculation formula is as follows:
Wherein, σ () is sigmoid function, and m is the quantity of negative sample, each negative sample is from based on word frequency
It is generated that noise is distributed P (tw).
(3) depth bag of words are constructed
On the basis of step (2) obtains respective depth vocabulary, further pass through vector quantization (Vector
Quantization) method of [25] constructs depth bag of words.Specifically, the area for being extracted using R-CNN
Multi-modal document data is concentrated all images to be wrapped by domain Candidate Set and corresponding feature first with the method for K-means
The provincial characteristics contained is clustered, and obtains the classification of fixed quantity, and the central point of each cluster classification is as the category
Representative element, all these classifications constitute a corresponding dictionary.Later, each candidate region in image is mapped to
Indicated in corresponding classification, mapping method be by calculate the Euclidean of the feature in each region and class center feature away from
From, so that the corresponding classification nearest with provincial characteristics is found, it is cumulative in the position that vector corresponds to the category.It is done using such
Method can all indicate every piece image in entire data set the form as deep vision bag of words, i.e., every piece image is corresponding
The dimension of one vector, vector is the number of classification, and the element value of vector is the number that the category occurs in the picture, with to
Measure VT ∈ RCIt indicates, wherein C is the class number that cluster obtains.Similarly, for word all corresponding to text document
Vector can also obtain corresponding depth text dictionary by way of cluster, and finally will with same mapping method
Each text is expressed as the form of depth text bag of words.
(4) multi-modal theme generates
Multi-modal information is a kind of very important expression way for multi-modal document content, that is to say, that
The visual information of image combines with semantic description.Therefore, for preferably between computation vision image and text marking across
Modal Correlation, more accurately extracting representational multi-modal feature becomes particularly significant, and multi-modal character representation
Being associated between the perceptual property of image and semantic meaning representation feature can preferably be explored.
Latent Dirichletal location (LDA) algorithm be one be directed to discrete data production probabilistic model, the algorithm by
To the highest attention of picture/text research field, LDA indicates every document using one group of probability distribution, and every in document
A word is generated from an individual theme.The advantage of LDA is that it considers that the inherent statistical framework of document is such as different
Co-occurrence information etc. of the word in entire collection of document, it is assumed that each of every document vocabulary is all individually main from one
Topic is generated, and the theme is that Di Li Cray by one on all themes is distributed and is generated.LDA is by each document tables
It is shown as one group of ProbabilityDistribution Vector closed in theme collection, these vectors are used to indicate the visual signature and text of sociogram
Feature.
In step (4), probabilistic Modeling is carried out to image and text collection respectively using potential Di Li Cray model, is dived
In Di Li Cray model hypothesis in the behind of a document sets under cover common theme set, and specific each document back
The probability distribution closed in the theme collection is respectively corresponded again afterwards, each of the document word all corresponds to one behind
By probability distribution theme generated;And the probability distribution of all documents does not have no bearing on, and is all common from one
The distribution of Di Li Cray is generated;On the basis of this model hypothesis, the deep vision bag of words and depth that step (3) is obtained are literary
This bag of words is as input, and different modalities document (text document and visual document) is derived by using LDA model, and institute is hidden behind
The probability topic of hiding is distributed, and establishes basis to establish the relationship topic model of fusion cross-module state related information in next step.
(5) the relationship topic model modeling of fusion cross-module state topic relativity analysis
Correlation information between different modalities is dissolved into topic model building process by building relationship topic model, tool
For body, the theme distributions of the different modalities that step (4) is obtained is as initial value, by by the theme feature of different modalities
The correlation for being mapped to the mode of public subspace to be calculated between different modalities theme feature, and by the meter of the correlation
Calculation is dissolved into topic model, and then is considered and another mode in the theme that the document for deriving a certain mode is hidden behind
Correlation information so that finally obtained subject information considers not only the distributed intelligence between same mode, while also being examined
Consider the relationship between other mode.
The step for main target be construct a joint probability distribution so that the multi-modal document likelihood observed
Angle value reaches maximum.During constructing model, by multi-modal collection of document DMIt is divided into three parts composition, i.e., first part is
Visual pattern set DV, second part is text description collections DT, Part III is link set LVT(set indicate image and
Related information between text).Wherein, DVBy deep vision lexical set DWVIt constitutes, and DVVIt is deep vision dictionary, simultaneously
Text description collections DTBy depth text lexical set DWTIt constitutes, DVTIt is depth text dictionary.For lvt∈LVT,lvt=1 meaning
Taste visual pattern dv∈DVD is described with textt∈DTIt is relevant, and lvt=0 means visual pattern dvIt is described with text
dtIt is incoherent.Based on above description, relationship topic model formalization representation is as follows: given TSVFor visual theme set, TST
It is text subject set, α and β are two hyper parameters, and wherein α is distributed for the Di Li Cray of theme, and β is directed to theme-depth word
The Di Li Cray of remittance is distributed, θvCorresponding visual pattern dvThe theme distribution of behind, θtCorresponding visual pattern dtThe theme of behind point
Cloth, Φ are that each theme corresponds to multinomial distribution corresponding to all depth vocabulary, and z is all words of correspondence actually generated by θ
The behind subject information of remittance, Dir () and Mult () respectively indicate the distribution of Di Li Cray and multinomial distribution, NdIt indicates in document d
In depth vocabulary quantity, n indicate n-th of depth vocabulary.The generating process of entire relationship topic model is as follows:
(1) for each theme tv ∈ DT in visual theme setV:
(a) multinomial that tv corresponds to all visual vocabularies is obtained according to theme-visual vocabulary Di Li Cray profile samples
Distribution, it may be assumed that φv tv~Dir (φv|βv)。
(2) for each theme tt ∈ DT in text subject setT:
(a) multinomial that tt corresponds to all text vocabulary is obtained according to theme-text vocabulary Di Li Cray profile samples
Distribution, it may be assumed that φt tt~Dir (φt|βt)。
(3) for each visual document d ∈ DV:
(a) d corresponding theme distribution behind is obtained according in the Di Li Cray profile samples that theme collection closes, it may be assumed that θv d~
Dir(θv|αv)。
(b) for each of d deep vision vocabulary wv d,n:
I. the corresponding theme of the vocabulary is obtained according to the theme distribution of the behind document d, it may be assumed that zv d,n~Mult (θv d)
Ii. it is sampled to obtain vocabulary corresponding in the document position according to theme-visual vocabulary, it may be assumed that wv d,n~Mult
(φv zd,n)
(4) for each text document d ∈ DT:
(a) d corresponding theme distribution behind is obtained according in the Di Li Cray profile samples that theme collection closes, it may be assumed that θt d~
Dir(θt|αt);
(b) for each of d depth text vocabulary wt d,n:
I. the corresponding theme of the vocabulary is obtained according to the theme distribution of the behind document d, it may be assumed that zt d,n~Mult (θt d);
Ii. it is sampled to obtain vocabulary corresponding in the document position according to theme-text vocabulary, it may be assumed that wt d,n~Mult
(φt zd,n);
(5) l is linked for eachvt∈LVT, indicate visual document dvWith text document dtBetween related information:
(a) according to dvWith dtTheme feature calculate its correlation to lvtIt is sampled, it may be assumed thatMv,Mt), whereinWithRespectively correspond document dvWith dtExperience theme distribution, WithIt is two mapping matrixes point
Not Ying She vision and text subject feature to public subspace, wherein the dimension of public subspace is dim dimension, TCor (lvt=1)
Indicate document dtWith dvTopic relativity, and TCor (lvt=0) document d is indicatedtWith dvTheme non-correlation.
It is final to construct joint probability distribution form to be directed to entire multi-modal collection of document and be built based on above procedure
Mould, as follows:
Wherein, first item corresponds to theme-depth vocabulary generating process, intermediate two corresponding deep vision vocabulary and depth
The generating process of text vocabulary, last indicates image-description connection generating process.
(6) across media information retrieval (application of relationship topic model)
Step (6) is the relationship topic model that step (5) is established, and is used for across media information retrieval, with image and text
For this, two classes can be divided into across media information retrieval, i.e. text-inquiry-image and image-inquiry-text, text-inquiry-
What image considered be according to given query text, using relationship topic model calculate different images to the text degree of correlation come pair
All images are ranked up, and image-inquiry-text is then the degree of correlation according to different text documents for given query image
To be ranked up to all text documents.
For given inquiry (such as utilizing image querying text), derive that corresponding theme is special using relationship topic model
Sign, and calculated between other mode documents using the correlation calculations method of theme feature obtained in step (5)
Correlation information (such as text document), is ranked up text document by the height of correlation information, to return
To with the maximally related text document of query image.Similarly, the above process is also applied for using text query image across media
Information retrieval process.
In conclusion the present invention is proposed for content isomerism and relevance between different modalities in multi-modal document
A kind of cross-module state topic relativity modeling method based on deep learning, and then can be with the form of probabilistic model to entire multimode
The generating process of state document is described, and the correlation between the document of different modalities is quantified.The method of the present invention can
Effectively to apply to improve retrieval relevance across in media information retrieval for large-scale image, enhance user experience.
Detailed description of the invention
Fig. 1 is flow chart of the invention.
Fig. 2 is the schematic diagram for constructing the multi-modal document of depth lexical representation.
Fig. 3 is the schematic diagram of cross-module state relationship topic relativity modeling process.
Fig. 4 is the relationship topic model proposed and the comparison diagram of traditional multi-modal topic model.
Fig. 5 is to carry out the effect picture across media information retrieval using constructed relationship topic model.
Specific embodiment
With reference to the accompanying drawing, the cross-module state relatedness computation method that the present invention is directed to sociogram is discussed in detail.
(1) data object is acquired
Data object is acquired, image and image labeling data are obtained, is arranged in image labeling data in entire data set
Seldom appearance or useless mark word.Generally in the data set of acquirement, wherein many noise datas are had, so using
These data carry out just carrying out it processing and filtering appropriate before feature extraction.For image, obtained figure
As being all unified JPG format, do not need to do any transformation.For the text marking of image, obtained image labeling contains
There are many meaningless words, as word addend word does not have the word of any meaning.Some image labelings up to tens, in order to
It allows image labeling to describe the main information of image well, those useless, meaningless marks should be given up.Therefore, it is taken
Process method step it is as follows:
Step 1: the frequency that all words occur in data set in statistical data collection mark;
Step 2: filtering out the meaningless word with number in those words;
Step 3: in each image labeling in entire data set the less word of the frequency of occurrences, be construed as figure
Than minor information as in, and deleted.
Through the above steps, the image labeling that can obtain that treated.For removing the less word of frequency in step 3,
Reason for this is that the mark of same class image or there are many identical, words for being close in meaning in image clustering in it.Therefore according to
The frequency of occurrences is completely reasonable to be filtered to it.
(2) multi-modal feature extraction
Fig. 2 shows the process extracted feature in the way of deep learning and construct depth vocabulary, utilizes in the present invention
Region-CNN is detected to the region of image and is extracted corresponding CNN feature, and the dimension of feature is 4,096 dimension.
Usually, Region-CNN can select 2,000 or so region as candidate, such width figure for every piece image
As corresponding eigenmatrix just has 2,000*4,096 dimension.And later if all areas to all images cluster,
Data volume is M*2,000*4,096, and M is the number of image, it is clear that such data volume bring space-time cost is huge.For
Such a practical problem is solved, the method that internal-external cluster combines is carried out in concrete operations, i.e., firstly for each width
The all areas for including in image carry out primary internal cluster (being polymerized to 10 classes), carry out again to all areas later primary external poly-
Class (is polymerized to 100 classes), and the data volume for carrying out external cluster actually final so is just M*10*4, and 096, largely reduce
The space-time cost of cluster.Another needs the problem of illustrating, either Region-CNN extracts visual signature or Skip-
Gram is extracted lexical feature and is operated using pre-training model, and wherein Region-CNN is utilized on ImageNet
AlexNet carries out pre-training, and Skip-gram then utilizes the mould that training obtains on the wikipedia document comprising 6,000,000,000 vocabulary
Type.This training for being primarily due to deep neural network needs a large amount of data, therefore to avoid the problem that over-fitting, using
Trained model operates real data to extract corresponding feature on large-scale dataset.
(3) cross-module state topic relativity calculates
Fig. 3 shows cross-module state relationship topic relativity modeling process, and utilization is mentioned in introduction before Carry out computation vision document dvWith text document dtCorrelation, MvAnd MtIt is for visual theme feature and text respectively
The mapping matrix of this theme feature, TCor (lvt=1) document d is indicatedtWith dvTopic relativity, and TCor (lvt=0) it indicates
Document dtWith dvTheme non-correlation, shown in TCor () is defined as follows:
Here different data types is directed to using both of which, mode is first is that using Sigmoid function by dot product
It is mapped in [0,1] range, and second mode calculates topic relativity by normalizing the cosine similarity of two vectors.Together
When, the multi-modal theme distribution based on generation can use the method for maximal possibility estimation (MLE) to train to obtain parameter MvWith
Mt, that is, the log likelihood angle value of formula (4) is maximized, objective function is defined as shown:
Based on such objective function, mapping matrix MvAnd MtIt can be calculated by gradient descent method.It needs to illustrate
It is, in actual training process, it is assumed that the quantity of multi-modal document is | DM|, under normal conditions in each multi-modal document
It only include one group of image and text, the number of image document and the number of text document are substantially the same, and are equal to multi-modal
The quantity of document, i.e., | Dv|=| DT|=| DM|.If the text and image that occur in same multi-modal document be it is relevant,
Without then uncorrelated in same multi-modal document, (i.e. image-text is related for the positive sample for the training data being converted in this way
It is right) and the ratio of negative sample (image-text uncorrelated to) be about 1/ | DM|.Such ratio will lead to negative sample and positive sample
Serious disproportion, in addition image and text can not illustrate that the image and text are complete in same multi-modal document completely
Uncorrelated (same category may be belonged to), therefore enabling the ratio of negative sample and positive sample in practice is 1:1, and is being randomly choosed
Meet following constraint when negative sample, i.e., corresponding image and text cannot come from same category.
(4) multi-modal relationship topic model derives
Formula (3) shows relationship topic model constructed in the present invention, is derived using the method for jeep this sampling
Obtain the parameter [26] of model.The purpose of this sampling of jeep will obtain each vocabulary in multi-modal document and imply behind
Theme is derived by first about depth vocabulary, the corresponding subject information of vocabulary and corresponding cross-module during sampling
The edge distribution of state association link, as follows:
Wherein, md,ttCorresponding is the number that theme tt occurs in document d, ntt,wCorresponding is in entire document sets
The number of theme tt vocabulary generated.It is general that the single argument for subject information z can be further derived by according to formula (6)
Rate distribution, and then obtain the sampling rule for the behind theme of each word in document.As shown in formula (7),
WhereinIndicate the frequency of occurrence of the theme tt after removing current word in document d, andIt indicates to remove current word
The number for the word that theme tt is included.Based on such sampling rule, it can sample to obtain each word in entire document sets and carry on the back
Implied subject information afterwards.Similarly, it after sampling each time, is all calculated using formula (5) and is obtained in present sample
To theme distribution on the basis of how to obtain mapping matrix MtAnd Mv, and the M obtained within the present sample timetAnd MvBy conduct
The input of sampling process next time, loops back and forth like this, until reaching iteration termination condition, to obtain final subject information
And mapping matrix MtAnd Mv.Correspondingly, other parameters are such as in relationship topic modelθV、θTIt then can be public by calculating
Formula (8) finally obtains:
(5) example is applied
Fig. 5 is to carry out the effect picture across media information retrieval using constructed relationship topic model, wherein being divided into two kinds
Mode, one is image retrieval text (Image Query-to-Text) is utilized, another kind is to utilize text retrieval image
(Text Query-to-Image), relevance score are calculated as shown in formula (9).
Bibliography
[1]Fan,J.P.;He,X.F.;Zhou,N.;Peng,J.Y.;and Jain,R.2012.Quantitative
Characterization of Semantic Gaps for Learning Complexity Estimation and
Inference Model Selection.IEEE Transactions on Multimedia 14(5):1414-1428.
[2]Datta,R.;Joshi,D.;Li,J.;and Wang,J.Z.2008.Image Retrieval:Ideas,
Influences,and Trends of the New Age.ACM Computing Surveys(CSUR)40(2),
Article5.
[3]Rasiwasia,N.;Pereira,J.C.;Coviello,E.;Doyle,G.;Lanckriet,G.R.G.;
Levy,R.;and Vasconcelos,N.2010.A New Approach to Cross-modal Multimedia
Retrieval.In Proceedings of MM 2010,251-260.
[4]Pereira,J,C.;Coviello,E.;Doyle,G.;Rasiwasia,N.;Lanckriet,G.R.G.;
Levy,R.;and Vasconcelos,N.2014.On the Role of Correlation and Abstraction in
Cross-Modal Multimedia Retrieval.IEEE Transactions on Pattern Analysis and
Machine Intelligence(PAMI)36(3):521-535.
[5]Barnard,K.;Duygulu,P.;Forsyth,D.;Freitas,N.;Blei,D.M.;and Jordan,
M.I.2003.Matching Words and Pictures.Journal of Machine Learning Research.3:
1107-1135.
[6]Wang,X.;Liu,Y.;Wang,D.;and Wu,F.2013.Cross-media Topic Mining on
Wikipedia.In Proceedings of MM 2013,689-692.
[7]Frome,A.;Corrado,G.S.;Shlens,J.;Bengio,S.;Dean,J.;Ranzato,M.A.;and
Mikolov,T.2013.DeViSE:A Deep Visual-Semantic Embedding Model.In Proceedings
of NIPS 2013.
[8]Feng,F.X.;Wang,X.J.;and Li,R.F.2014.Cross-modal Retrieval with
Correspondence Autoencoder.In Proceedings of MM 2014,7-16.
[9]Nguyen,C.T.;Kaothanthong,N.;Tokuyama,T.;and Phan X.H.2013.A
Feature-Word-Topic Model for Image Annotation and Retrieval.ACM Transactions
on the Web 7(3),Article 12.
[10]Ramage,D.;Heymann,P.;Manning,C.D.;and Molina,H.G.2009.Clustering
the Tagged Web.In Proceedings of WSDM 2009,54-63.
[11]Blei,D.M.;and Jordan,M.I.2003.Modeling Annotated Data.In
Proceedings of SIGIR 2003,127-134.
[12]Wang,C.;Blei,D.;and Fei-Fei L.2009.Simultaneous Image
Classification and Annotation.In Proceedings of CVPR 2009,1903-1910.
[13]Putthividhya,D.;Attias,H.T.;and Nagarajan,S.S.2010.Topic
Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation.In
Proceedings of CVPR2010,3408-3415.
[14]Niu,Z.X.;Hua,G.;Gao,X.B.;and Tian,Q.2014.Semi-supervised
Relational Topic Model for Weakly Annotated Image Recognition in Social
Media.In Proceedings of CVPR2014,4233-4240.
[15]Wang,Y.F.;Wu,F.;Song,J.;Li,X.;and Zhuang,Y.T.2014.Multi-modal
Mutual Topic Reinforce Modeling for Cross-media Retrieval.In Proceedings of
MM 2014,307-316.
[16]Zheng,Y.;Zhang,Y.J.;and Larochelle,H.2014.Topic Modeling of
Multimodal Data:an Autoregressive Approach.In Proceedings of CVPR 2014,1370-
1377.
[17]Chen,T.;SalahEldeen,H.M.;He,X.N.;Kan,M.Y.;and Lu,D.Y.2015.VELDA:
Relating an Image Tweet’s Text and Images.In Proceedings of AAAI 2015.
[18]Girshick,R.;Donahue,J.;Darrell,T.;and Malik,J.2014.Rich feature
hierarchies for accurate object detection and semantic segmentation.In
Proceedings of CVPR 2014,580-587.
[19]Hariharan,B.;Arbelaez,P.;Girshick,R.;and Malik,
J.2014.Simultaneous Detection and Segmentation.In Proceedings of ECCV 2014,
297-312.
[20]Karpathy,A.;Joulin,A.;and Fei-Fei,L.2014.Deep Fragment Embeddings
for Bidirectional Image Sentence Mapping.In Proceedings of NIPS 2014.
[21]Zhang,N.;Donahue,J.;Girshick,R.;and Darrell,T.2014.Part-Based R-
CNNs for Fine-Grained Category Detection.In Proceedings of ECCV 2014,834-849.
[22]Mikolov,T.;Sutskever,I.;Chen,K.;Corrado,G.;and Dean,
J.2013.Distributed Representations of Words and Phrases and their
Compositionality.In Proceedings of NIPS 2013.
[23]Tang,D.Y.;Wei,F.R.;Qin,B.;Zhou,M.;and Liu,T.2014.Building Large-
Scale Twitter-Specific Sentiment Lexicon:A Representation Learning
Approach.In Proceedings of COLING 2014,172-182.
[24]Karpathy,A.;Joulin,A.;and Fei-Fei,L.2014.Deep Fragment Embeddings
for Bidirectional Image Sentence Mapping.In Proceedings of NIPS 2014.
[25]Sivic,J.,and Zisserman,A.2003.Video Google:A Text Retrieval
Approach to Object Matching in Videos.In Proceedings of ICCV 2003,2:1470-
1477.
[26]Griffiths,T.L.;and Steyvers,M.2004.Finding Scientific Topics.
In Proceedings of the National Academy of Sciences of the United
States of America,101(1):5228-5235。
Claims (6)
1. a kind of cross-module state topic relativity modeling method based on deep learning, it is characterised in that specific step is as follows:
(1) data prediction: the data image of acquisition different modalities is concentrated from multi-medium data, obtains image and iamge description number
According to, arrange image labeling data set in seldom occur or useless mark word;
(2) it extracts multi-modal depth characteristic: extracting the visual signature of image and the language of iamge description using deep learning method
Adopted feature;Specifically, it is utilized respectively Region-CNN model and Skip-gram model comes the provincial characteristics and text of abstract image
This lexical feature;Wherein, representational region candidate collection in Region-CNN detection image first utilizes pre-training later
Convolutional neural networks come to extract corresponding region corresponding to feature;Skip-gram model is then to utilize text vocabulary and word
The feature vector that co-occurrence information directly training between remittance obtains vocabulary indicates;
(3) depth bag of words are constructed: using clustering algorithm K-means that image-region obtained in step (2) is special first
Text lexical feature of seeking peace is clustered, and obtains the deep vision dictionary and depth text dictionary that limit dimension, and then will be corresponding
All provincial characteristics are mapped to corresponding visual dictionary in image, so that building obtains deep vision bag of words;Similarly,
Vocabulary in all texts is also mapped onto text dictionary and obtains depth text bag of words;
(4) multi-modal theme generates: the generation of entire multi-modal data collection is simulated using the hypothesis of potential Di Li Cray model
Process, and it is derived by the theme distribution feature that text collection and image set are hidden behind, it makes full use of between vocabulary
Co-occurrence information;
(5) the relationship topic model modeling of fusion cross-module state topic relativity analysis: corresponding relationship topic model is constructed, that is, is existed
The correlation that theme feature between different modalities is considered while constructing topic model, by multi-modal master obtained in step (4)
Feature is inscribed as initial value, while calculating the correlation between image and text using the related information between image and text
Property, calculated correlation update the subject information of multi-modal document, thus cross-iteration carry out correlation meter
It calculates to update and then construct with theme distribution and obtains final relationship topic model;
(6) based on topic relativity across media information retrieval: obtained cross-module state topic relativity is applied to across media letters
In breath retrieval, it is the inquiry for giving certain mode respectively, is obtained and other maximally related mode of the inquiry using correlation calculations
Data.
2. according to the method described in claim 1, it is characterized by: in step (2), it is described be utilized respectively Region-CNN and
Skip-gram model comes the provincial characteristics of abstract image and the lexical feature of text, and detailed process is as follows:
Given image, Region-CNN go out the position that object is likely to occur from image selection first with the method for selection search and make
For Candidate Set, exist in the form of region;And then it is directed to each extracted region CNN feature;In specific implementation,
Each image-region is converted into fixed Pixel Dimensions 227*227 by Region-CNN, for extracting the convolutional network of feature
Layer is fully connected by 5 convolutional layers and 2 to constitute;
Given text document, the corresponding feature of each word occurred in text document is obtained using Skip-gram model training
Vector;The text description section of entire multi-modal document data set is indicated with TD, TW is all texts occurred in TD
This vocabulary, TV is the corresponding dictionary of text vocabulary, for each of TW vocabulary tw, ivtwAnd ovtwIt is the input for tw
Feature vector and output feature vector, Context (tw) is the vocabulary that word tw hereinafter occurs on it;Context is corresponding
Window size is set as 5, and all input vectors corresponding to entire text data set and output vector are uniformly used a long ginseng
Number vector indicates W ∈ R2*|TV|*dim, wherein dim is the dimension of input vector and output vector;Entire Skip-gram model
Objective function is described below:
Using the negative sample method of sampling come approximate calculation ogP (twj|twi), calculation formula is as follows:
Wherein, σ () is sigmoid function, and m is the quantity of negative sample, each negative sample is from the noise based on word frequency
It is generated to be distributed P (tw).
3. according to the method described in claim 1, it is characterized by: step (3) is to obtain respective depth vocabulary in step (2)
On the basis of, depth bag of words, detailed process are further constructed by the method for vector quantization are as follows: for mentioning using R-CNN
Multi-modal document data is concentrated institute first with the method for K-means by the region candidate collection obtained and corresponding feature
The provincial characteristics for having image to be included is clustered, and obtains the classification of fixed quantity, and central point of each cluster classification is made
For the representative element of the category, all these classifications constitute a corresponding dictionary;Later, each candidate regions in image
Domain is mapped in corresponding classification and indicates, mapping method is special by calculating the feature in each region and class center
The Euclidean distance of sign, so that the corresponding classification nearest with provincial characteristics is found, it is cumulative in the position that vector corresponds to the category, from
And every piece image in entire data set all being indicated the form as deep vision bag of words, i.e., every piece image is one corresponding
The dimension of vector, vector is the number of classification, and the element value of vector is the number that the category occurs in the picture, with vector VT
∈RCIt indicates, wherein C is the class number that cluster obtains;Similarly, for term vector all corresponding to text document,
Corresponding depth text dictionary is obtained also by the mode of cluster, finally with same mapping method by each text table
It is shown as the form of depth text bag of words.
4. according to the method described in claim 1, it is characterized by: being distinguished in step (4) using potential Di Li Cray model
Probabilistic Modeling carried out to image and text collection, potential Di Li Cray model hypothesis the behind of document sets under cover one it is common
Theme set, and specific each document respectively corresponds the probability distribution closed in the theme collection behind, this
Each of document word all corresponds to one by probability distribution theme generated behind;And the probability distribution of all documents
It does not have no bearing on, is generated from a common Di Li Cray distribution;On the basis of this model hypothesis, by step
(3) the deep vision bag of words and depth text bag of words obtained are derived by different modalities document using LDA model as inputting
The probability topic distribution hidden behind.
5. according to the method described in claim 1, it is characterized by: step (5) during constructing model, by multi-modal text
Shelves set DMIt is divided into three parts composition, i.e., first part is visual pattern set DV, second part is text description collections DT, the
Three parts are link set LVT, which indicates the related information between image and text;Wherein, DVBy deep vision word finder
Close DWVIt constitutes, and DVVIt is deep vision dictionary, while text description collections DTBy depth text lexical set DWTIt constitutes, DVTIt is
Depth text dictionary;For lvt∈LVT,lvt=1 means visual pattern dv∈DVD is described with textt∈DTIt is relevant, and
lvt=0 means visual pattern dvD is described with texttIt is incoherent;Based on above description, the formalization of relationship topic model
It is expressed as follows: given DTVFor visual theme set, DTTIt is text subject set, α and β are two hyper parameters, and wherein α is for master
The Di Li Cray of topic is distributed, and β is distributed for theme-depth vocabulary Di Li Cray, θvCorresponding visual pattern dvThe theme of behind point
Cloth, θtCorresponding visual pattern dtThe theme distribution of behind, Φ are that each theme corresponds to multinomial corresponding to all depth vocabulary point
Cloth, z are the behind subject informations of all vocabulary of correspondence actually generated by θ, and Dir () and Mult () respectively indicate Di Li Cray
Distribution and multinomial distribution, NdIndicate the quantity of the depth vocabulary in document d, n indicates n-th of depth vocabulary;Entire relationship
The generating process of topic model is as follows:
(1) for each theme tv ∈ DT in visual theme setV:
The multinomial distribution that tv corresponds to all visual vocabularies is obtained according to theme-visual vocabulary Di Li Cray profile samples, it may be assumed that
φv tv~Dir (φv|βv);
(2) for each theme tt ∈ DT in text subject setT:
The multinomial distribution that tt corresponds to all text vocabulary is obtained according to theme-text vocabulary Di Li Cray profile samples, it may be assumed that
φt tt~Dir (φt|βt);
(3) for each visual document d ∈ DV:
(a) d corresponding theme distribution behind is obtained according in the Di Li Cray profile samples that theme collection closes, it may be assumed that
θv d~Dir (θv|αv);
(b) for each of d deep vision vocabulary wv d,n:
I. the corresponding theme of the vocabulary is obtained according to the theme distribution of the behind document d, it may be assumed that zv d,n~Mult (θv d);
Ii. sample to obtain corresponding vocabulary in a document according to theme-visual vocabulary, it may be assumed that wv d,n~Mult (φv zd,n);
(4) for each text document d ∈ DT:
(a) d corresponding theme distribution behind is obtained according in the Di Li Cray profile samples that theme collection closes, it may be assumed that θt d~Dir
(θt|αt);
(b) for each of d depth text vocabulary wt d,n:
I. the corresponding theme of the vocabulary is obtained according to the theme distribution of the behind document d, it may be assumed that zt d,n~Mult (θt d);
Ii. sample to obtain corresponding vocabulary in a document according to theme-text vocabulary, it may be assumed that wt d,n~Mult (φt zd,n);
(5) l is linked for eachvt∈LVT, indicate visual document dvWith text document dtBetween related information:
(a) according to dvWith dtTheme feature calculate its correlation to lvtIt is sampled, it may be assumed that WhereinWithRespectively correspond document dvWith dtExperience theme distribution, WithIt is that two mapping matrixes map vision and text subject feature respectively
To public subspace, wherein the dimension of public subspace is dim dimension, TCor (lvt=1) document d is indicatedtWith dvTheme it is related
Property, and TCor (lvt=0) document d is indicatedtWith dvTheme non-correlation;
It is final to construct joint probability distribution form to be directed to entire multi-modal collection of document and be modeled, such as based on above procedure
Shown in lower:
Wherein, first item corresponds to theme-depth vocabulary generating process, intermediate two corresponding deep vision vocabulary and depth text
The generating process of vocabulary, last indicates image-description connection generating process.
6. according to the method described in claim 1, it is characterized by: step (6) is the relationship theme mould that step (5) is established
Type, for across media information retrieval;It is divided into two classes across media information retrieval, i.e. text-inquiry-image and image-inquiry-text
This, what text-inquiry-image considered is to calculate different images to this using relationship topic model according to given query text
The text degree of correlation is ranked up all images, and image-inquiry-text is according to different text documents for giving query graph
The degree of correlation of picture is ranked up all text documents;
Image querying text is utilized for given, derives corresponding theme feature using relationship topic model, and utilize
The correlation calculations method of theme feature obtained in step (5) calculates the correlation information between other mode documents,
Text document is ranked up by the height of correlation information, is obtained and the maximally related text text of query image to return
Shelves;Similarly, the above process is also applied for across the media information retrieval process using text query image.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610099438.9A CN105760507B (en) | 2016-02-23 | 2016-02-23 | Cross-module state topic relativity modeling method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610099438.9A CN105760507B (en) | 2016-02-23 | 2016-02-23 | Cross-module state topic relativity modeling method based on deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105760507A CN105760507A (en) | 2016-07-13 |
CN105760507B true CN105760507B (en) | 2019-05-03 |
Family
ID=56330274
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610099438.9A Expired - Fee Related CN105760507B (en) | 2016-02-23 | 2016-02-23 | Cross-module state topic relativity modeling method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105760507B (en) |
Families Citing this family (55)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018046452A1 (en) | 2016-09-07 | 2018-03-15 | Koninklijke Philips N.V. | Systems, methods, and apparatus for diagnostic inferencing with a multimodal deep memory network |
CN106156374A (en) * | 2016-09-13 | 2016-11-23 | 华侨大学 | A kind of view-based access control model dictionary optimizes and the image search method of query expansion |
US11068652B2 (en) * | 2016-11-04 | 2021-07-20 | Mitsubishi Electric Corporation | Information processing device |
CN108073576A (en) * | 2016-11-09 | 2018-05-25 | 上海诺悦智能科技有限公司 | Intelligent search method, searcher and search engine system |
CN108198625B (en) * | 2016-12-08 | 2021-07-20 | 推想医疗科技股份有限公司 | Deep learning method and device for analyzing high-dimensional medical data |
CN106777050B (en) * | 2016-12-09 | 2019-09-06 | 大连海事大学 | It is a kind of based on bag of words and to take into account the shoes stamp line expression and system of semantic dependency |
CN106778880B (en) * | 2016-12-23 | 2020-04-07 | 南开大学 | Microblog topic representation and topic discovery method based on multi-mode deep Boltzmann machine |
CN106650756B (en) * | 2016-12-28 | 2019-12-10 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | knowledge migration-based image text description method of multi-mode recurrent neural network |
CN106886783B (en) * | 2017-01-20 | 2020-11-10 | 清华大学 | Image retrieval method and system based on regional characteristics |
CN107145910A (en) * | 2017-05-08 | 2017-09-08 | 京东方科技集团股份有限公司 | Performance generation system, its training method and the performance generation method of medical image |
CN107273517B (en) * | 2017-06-21 | 2021-07-23 | 复旦大学 | Graph-text cross-modal retrieval method based on graph embedding learning |
CN109213988B (en) * | 2017-06-29 | 2022-06-21 | 武汉斗鱼网络科技有限公司 | Barrage theme extraction method, medium, equipment and system based on N-gram model |
TWI636404B (en) * | 2017-07-31 | 2018-09-21 | 財團法人工業技術研究院 | Deep neural network and method for using the same and computer readable media |
CN107480289B (en) * | 2017-08-24 | 2020-06-30 | 成都澳海川科技有限公司 | User attribute acquisition method and device |
CN108305296B (en) * | 2017-08-30 | 2021-02-26 | 深圳市腾讯计算机系统有限公司 | Image description generation method, model training method, device and storage medium |
CN107870992A (en) * | 2017-10-27 | 2018-04-03 | 上海交通大学 | Editable image of clothing searching method based on multichannel topic model |
CN107798624B (en) * | 2017-10-30 | 2021-09-28 | 北京航空航天大学 | Technical label recommendation method in software question-and-answer community |
CN108256549B (en) * | 2017-12-13 | 2019-03-15 | 北京达佳互联信息技术有限公司 | Image classification method, device and terminal |
CN108399409B (en) | 2018-01-19 | 2019-06-18 | 北京达佳互联信息技术有限公司 | Image classification method, device and terminal |
CN110119505A (en) | 2018-02-05 | 2019-08-13 | 阿里巴巴集团控股有限公司 | Term vector generation method, device and equipment |
CN108595636A (en) * | 2018-04-25 | 2018-09-28 | 复旦大学 | The image search method of cartographical sketching based on depth cross-module state correlation study |
CN108830903B (en) * | 2018-04-28 | 2021-11-05 | 杨晓春 | Billet position detection method based on CNN |
CN109145936B (en) * | 2018-06-20 | 2019-07-09 | 北京达佳互联信息技术有限公司 | A kind of model optimization method and device |
CN110110122A (en) * | 2018-06-22 | 2019-08-09 | 北京交通大学 | Image based on multilayer semanteme depth hash algorithm-text cross-module state retrieval |
CN109214412A (en) * | 2018-07-12 | 2019-01-15 | 北京达佳互联信息技术有限公司 | A kind of training method and device of disaggregated model |
CN109213853B (en) * | 2018-08-16 | 2022-04-12 | 昆明理工大学 | CCA algorithm-based Chinese community question-answer cross-modal retrieval method |
EP3644616A1 (en) * | 2018-10-22 | 2020-04-29 | Samsung Electronics Co., Ltd. | Display apparatus and operating method of the same |
CN109472232B (en) * | 2018-10-31 | 2020-09-29 | 山东师范大学 | Video semantic representation method, system and medium based on multi-mode fusion mechanism |
CN110442721B (en) * | 2018-11-28 | 2023-01-06 | 腾讯科技(深圳)有限公司 | Neural network language model, training method, device and storage medium |
CN111464881B (en) * | 2019-01-18 | 2021-08-13 | 复旦大学 | Full-convolution video description generation method based on self-optimization mechanism |
CN109886326B (en) * | 2019-01-31 | 2022-01-04 | 深圳市商汤科技有限公司 | Cross-modal information retrieval method and device and storage medium |
CN109816039B (en) * | 2019-01-31 | 2021-04-20 | 深圳市商汤科技有限公司 | Cross-modal information retrieval method and device and storage medium |
CN110209822B (en) * | 2019-06-11 | 2021-12-21 | 中译语通科技股份有限公司 | Academic field data correlation prediction method based on deep learning and computer |
CN110337016B (en) * | 2019-06-13 | 2020-08-14 | 山东大学 | Short video personalized recommendation method and system based on multimodal graph convolution network, readable storage medium and computer equipment |
CN110647632B (en) * | 2019-08-06 | 2020-09-04 | 上海孚典智能科技有限公司 | Image and text mapping technology based on machine learning |
CN110503147B (en) * | 2019-08-22 | 2022-04-08 | 山东大学 | Multi-mode image classification system based on correlation learning |
CN111310453B (en) * | 2019-11-05 | 2023-04-25 | 上海金融期货信息技术有限公司 | User theme vectorization representation method and system based on deep learning |
CN111259152A (en) * | 2020-01-20 | 2020-06-09 | 刘秀萍 | Deep multilayer network driven feature aggregation category divider |
CN112257445B (en) * | 2020-10-19 | 2024-01-26 | 浙大城市学院 | Multi-mode push text named entity recognition method based on text-picture relation pre-training |
CN112507064B (en) * | 2020-11-09 | 2022-05-24 | 国网天津市电力公司 | Cross-modal sequence-to-sequence generation method based on topic perception |
CN114547259B (en) * | 2020-11-26 | 2024-05-24 | 北京大学 | Automatic formula description generation method and system based on topic relation graph |
CN112632969B (en) * | 2020-12-13 | 2022-06-21 | 复旦大学 | Incremental industry dictionary updating method and system |
CN113157959B (en) * | 2020-12-17 | 2024-05-31 | 云知声智能科技股份有限公司 | Cross-modal retrieval method, device and system based on multi-modal topic supplementation |
CN112836746B (en) * | 2021-02-02 | 2022-09-09 | 中国科学技术大学 | Semantic correspondence method based on consistency graph modeling |
CN115017911A (en) * | 2021-03-05 | 2022-09-06 | 微软技术许可有限责任公司 | Cross-modal processing for vision and language |
CN113051932B (en) * | 2021-04-06 | 2023-11-03 | 合肥工业大学 | Category detection method for network media event of semantic and knowledge expansion theme model |
CN113139468B (en) * | 2021-04-24 | 2023-04-11 | 西安交通大学 | Video abstract generation method fusing local target features and global features |
CN113298265B (en) * | 2021-05-22 | 2024-01-09 | 西北工业大学 | Heterogeneous sensor potential correlation learning method based on deep learning |
CN113297485B (en) * | 2021-05-24 | 2023-01-24 | 中国科学院计算技术研究所 | Method for generating cross-modal representation vector and cross-modal recommendation method |
CN113392196B (en) * | 2021-06-04 | 2023-04-21 | 北京师范大学 | Question retrieval method and system based on multi-mode cross comparison |
CN113343679B (en) * | 2021-07-06 | 2024-02-13 | 合肥工业大学 | Multi-mode subject mining method based on label constraint |
CN113516118B (en) * | 2021-07-29 | 2023-06-16 | 西北大学 | Multi-mode cultural resource processing method for joint embedding of images and texts |
CN113408282B (en) * | 2021-08-06 | 2021-11-09 | 腾讯科技(深圳)有限公司 | Method, device, equipment and storage medium for topic model training and topic prediction |
CN114880527B (en) * | 2022-06-09 | 2023-03-24 | 哈尔滨工业大学(威海) | Multi-modal knowledge graph representation method based on multi-prediction task |
CN118378168B (en) * | 2024-06-25 | 2024-09-06 | 北京联合永道软件股份有限公司 | Unstructured data modeling method and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103559193A (en) * | 2013-09-10 | 2014-02-05 | 浙江大学 | Topic modeling method based on selected cell |
CN103559192A (en) * | 2013-09-10 | 2014-02-05 | 浙江大学 | Media-crossed retrieval method based on modal-crossed sparse topic modeling |
CN104317837A (en) * | 2014-10-10 | 2015-01-28 | 浙江大学 | Cross-modal searching method based on topic model |
CN104899253A (en) * | 2015-05-13 | 2015-09-09 | 复旦大学 | Cross-modality image-label relevance learning method facing social image |
-
2016
- 2016-02-23 CN CN201610099438.9A patent/CN105760507B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103559193A (en) * | 2013-09-10 | 2014-02-05 | 浙江大学 | Topic modeling method based on selected cell |
CN103559192A (en) * | 2013-09-10 | 2014-02-05 | 浙江大学 | Media-crossed retrieval method based on modal-crossed sparse topic modeling |
CN104317837A (en) * | 2014-10-10 | 2015-01-28 | 浙江大学 | Cross-modal searching method based on topic model |
CN104899253A (en) * | 2015-05-13 | 2015-09-09 | 复旦大学 | Cross-modality image-label relevance learning method facing social image |
Non-Patent Citations (1)
Title |
---|
"跨媒体组合语义深度学习";吴飞等;《浙江省信号处理学会2015年年会——信号处理在大数据》;20151031;第1-5页 |
Also Published As
Publication number | Publication date |
---|---|
CN105760507A (en) | 2016-07-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105760507B (en) | Cross-module state topic relativity modeling method based on deep learning | |
Zhang et al. | A quantum-inspired multimodal sentiment analysis framework | |
Liu et al. | A survey of sentiment analysis based on transfer learning | |
Peng et al. | An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges | |
Liu et al. | Image annotation via graph learning | |
Park et al. | Efficient extraction of domain specific sentiment lexicon with active learning | |
Gao et al. | Multi‐dimensional data modelling of video image action recognition and motion capture in deep learning framework | |
Ranjan et al. | LFNN: Lion fuzzy neural network-based evolutionary model for text classification using context and sense based features | |
Li et al. | Modeling continuous visual features for semantic image annotation and retrieval | |
Niu et al. | Knowledge-based topic model for unsupervised object discovery and localization | |
Papadopoulos et al. | Image clustering through community detection on hybrid image similarity graphs | |
Sumathi et al. | An overview of automated image annotation approaches | |
Li et al. | Fusing semantic aspects for image annotation and retrieval | |
Tian et al. | Automatic image annotation based on Gaussian mixture model considering cross-modal correlations | |
Xie et al. | A semantic model for cross-modal and multi-modal retrieval | |
Wang et al. | Rare-aware attention network for image–text matching | |
Long et al. | Bi-calibration networks for weakly-supervised video representation learning | |
Wu et al. | Multiple hypergraph clustering of web images by miningword2image correlations | |
Chen et al. | An annotation rule extraction algorithm for image retrieval | |
Papapanagiotou et al. | Improving concept-based image retrieval with training weights computed from tags | |
Tian et al. | Scene graph generation by multi-level semantic tasks | |
CN105677830B (en) | A kind of dissimilar medium similarity calculation method and search method based on entity mapping | |
Guo | [Retracted] Intelligent Sports Video Classification Based on Deep Neural Network (DNN) Algorithm and Transfer Learning | |
Xiao et al. | Research on multimodal emotion analysis algorithm based on deep learning | |
Xue et al. | Few-shot node classification via local adaptive discriminant structure learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190503 |
|
CF01 | Termination of patent right due to non-payment of annual fee |