CN105760507B - Cross-module state topic relativity modeling method based on deep learning - Google Patents

Cross-module state topic relativity modeling method based on deep learning Download PDF

Info

Publication number
CN105760507B
CN105760507B CN201610099438.9A CN201610099438A CN105760507B CN 105760507 B CN105760507 B CN 105760507B CN 201610099438 A CN201610099438 A CN 201610099438A CN 105760507 B CN105760507 B CN 105760507B
Authority
CN
China
Prior art keywords
text
theme
image
document
vocabulary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201610099438.9A
Other languages
Chinese (zh)
Other versions
CN105760507A (en
Inventor
张玥杰
程勇
刘志鑫
金城
张涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN201610099438.9A priority Critical patent/CN105760507B/en
Publication of CN105760507A publication Critical patent/CN105760507A/en
Application granted granted Critical
Publication of CN105760507B publication Critical patent/CN105760507B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • G06F16/94Hypermedia
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Business, Economics & Management (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to across Media Correlation learning art fields, specially the cross-module state topic relativity modeling method based on deep learning.The present invention includes two main algorithms: the multi-modal document representation based on depth vocabulary, the relationship topic model modeling of fusion cross-module state topic relativity study.The present invention describes semantic description part and the image section in multi-modal document using depth learning technology come construction depth semantic vocabulary and deep vision vocabulary respectively.Based on such multi-modal document representation, entire multi-modal collection of document is modeled by building cross-module state relationship topic model, so that the association between the generating process and different modalities of multi-modal document be described.This method accuracy is high, adaptable.This is on the basis of extensive multi-modal document (text adds image), consider multi-modal semantic information and efficiently have great importance across media information retrieval, it can be improved retrieval relevance, enhance user experience, be with a wide range of applications in across media information retrieval field.

Description

Cross-module state topic relativity modeling method based on deep learning
Technical field
The invention belongs to across Media Correlation learning art fields, and in particular to based on deep learning across modality images- Text subject correlation study method.
Background technique
With the development of internet technology with the maturation of Web2.0, add up the multi-modal document of magnanimity on the internet, such as What analyzes and handles the labyrinth of these multi-modal documents, to provide theories integration for practical applications such as cross-media retrievals Become a very important research hotspot.Usually, a multi-modal document is usually in the form of multiple modalities co-occurrence In the presence of for example, many web graph pictures are attached to many customized iamge descriptions of user or mark, in addition there are also nets The document of network includes the form of some illustrations.However, although these multi-modal data are usually associated with each other, due to semantic letter , there are very big difference and difference [1] in the problem of ditch between the visual information and text description information of image, this makes sufficiently Become very difficult using the semantic association between different modalities.Therefore, it is implicit behind that different modalities data how sufficiently to be excavated Relationship, and preferably fusion multi-modal information come to multi-modal document carry out modeling become very important [2,3].And it is sharp Multi-modal document is modeled with topic model, and then the association excavated between different modalities is a Critical policies, In the research of cross-module state theme modeling, it there are three interrelated needs while being resolved:
1, find and construct more representative, more valuable document elements come in multi-modal document image and Expression is described in content of text respectively.
2, more reasonable topic relativity model can be established to come preferably to different modalities data in multi-modal document Between association be described, i.e., visual pattern and text description between association.
3, learn to establish one come the internal association being directed between image and content of text by cross-module state topic relativity Kind objectively measures mechanism.
To solve first problem, most important is exactly the document elements how exploration establishes one group of optimization, thus Using these optimization document elements can it is more accurate, more fully hereinafter in multi-modal document vision and semantic feature into Row expression.
To solve Second Problem, it is most important that a more robust probability topic model can be established, to dig The implicit subject information of pick behind makes the likelihood angle value for the multi-modal subject document observed reach maximum.
To solve the problems, such as third, most effective settling mode be the attributive character of different modalities is mapped to it is common embedding Enter in subspace, to maximize the related information between different modalities information.
Currently have some researchers and propose distinct methods for multi-modal data modeling, from these sides from the point of view of modeling angle Method can be roughly divided into two types, and the first kind is statistics dependence modeling method, and the second class is that building joint probability generates model method.
(1) modeling method that statistics relies on
The core concept of statistical modeling method be the data characteristics of different modalities is mapped to identical latent space, thus It is expected that farthest excavating the statistic correlation between different modalities data characteristics.By taking image and text as an example, pass through building The characteristics of image of different structure and text feature are mapped in identical public subspace by corresponding mapping matrix respectively, Public subspace is fallen into a trap the correlation of nomogram picture and text, and the distance of more relevant image and text in public subspace is also It is closer, on the contrary distance far means that image is also lower with the correlation of text.Canonical correlation analysis method (Canonical Correlation Analysis, CCA) be a kind of most typical statistics dependence method, by seek visual signature matrix with The maximum correlation of semantic feature matrix obtains its corresponding space basal orientation moment matrix;Space basal orientation moment matrix is maximumlly kept The correlation of Image Visual Feature and semantic feature, and its mapping relations for being mapped to isomorphism subspace is provided;In turn by image Visual feature vector and semantic feature DUAL PROBLEMS OF VECTOR MAPPING to same dimension under in isomorphism subspace and construct cross-module state fusion feature, it is real The unified representation of existing media data different modalities information.Such as KernelCCA of work later (KCCA) and deepCCA (DCCA) The dependence between image and text is inquired into deeper time.
Statistical modeling method is combined together by work [4] with topic model, and this method is first with potential Di Li Cray Model extracts the visual theme feature of image and the text subject feature of text respectively, later will using Canonical correlation method Visual theme feature finds into isomorphism subspace with text subject Feature Mapping and calculates its correlation.It expands in [5] Its work is opened up, and calculates its correlation using KCCA.
(2) building joint probability generates model method
Multi-modal topic model is the Typical Representative method for constructing joint probability and generating model, has many phases recent years It closes work and carries out probability topic modeling [6,7,8,9,10] the vision content that is directed in multi-modal document and semantic description. [Blei2003] establishes the topic model [11] of a series of complexity step by step in its work in 2003, wherein Correspondence Latent Dirchlet Allocation (Corr-LDA) is wherein optimal cross-module state theme mould Type, there are corresponding dependences between the implicit theme between the model hypothesis different modalities, i.e., the corresponding implicit master of mark Topic comes from the implicit theme of image vision information behind.This hypothesis establishes a unidirectional mapping relations i.e. text vocabulary Generation depend on image vision content information.Later, [Wang2009] proposes a kind of topic model for having supervision to learn Potential relationship [12] between image and mark word, [Putthividhva 2010] then propose a kind of based on the more of theme recurrence The potential Di Li Cray model [13] of mode.Text and picture material, which are combined, in the multi-modal document of [Rasiwasia2010] research builds Mould [3].A kind of method that [Nguyen 2013] proposes image labeling, this method be distribution based on union feature and word and The distribution [9] of word and theme.[Niu2014] propose a kind of semi-supervised relationship topic model come to picture material and image it Between relationship explicitly modeled [14].[Wang2014] then proposes that a kind of semi-supervised multi-modal common theme reinforces model, The model inquires into the relationship [15] mutually promoted between different modalities theme.[Zheng2014] proposes a kind of for DocNADE's There is supervision mutation model to be modeled [16] come visual vocabulary, mark vocabulary and the Joint Distribution of category to image. [Chen, 2015] solves the modeling wide gap [17] between image and text by building vision-emotion LDA model.
As seen through the above analysis, current method all obtains some progress in multi-modal Document Modeling, however with Upper all methods do not fully consider yet to be influenced brought by following three aspects:
(1) depth information excavates in multi-modal document --- and most of existing images-label degree of correlation learning method is logical It often only focuses on and explores the association between different modalities in traditional visual signature representation method and markup information feature, not There is the depth characteristic for considering to be contained in these different modalities.For building overall Vision semanteme and internal semantic association, This will will lead to a series of serious loss of learning problems.And this can then be made up to the exploration of the depth of multi-modal document and lacked It falls into, so that obtained characteristic element preferably indicates multi-modal document.
(2) the relationship topic relativity modeling based on depth analysis --- most existing theme modeling methods are considering structure When building the topic relativity of different modalities, it is typically based on such it is assumed that the theme that i.e. different modalities are hidden behind is consistent 's.And such hypothesis is usually excessively absolute, some unnecessary noises can be introduced while constructing topic relativity, Therefore building one is more reasonable it is assumed that merging the characteristic information of depth, and the more optimal relationship theme of formation one is related Property modeling mechanism becomes particularly significant.
(3) being learnt by the cross-module state degree of correlation of depth theme feature --- most existing multi-modal topic models are based on The theme distribution feature for going matching different modalities to hide behind is typically directly considered when calculating correlation between different modalities, thus Catch the internal association between visual pattern and text description.However, there is no well for a kind of such direct matching way Consider the isomerism of image and text, therefore by the way that depth theme feature is mapped to public space to learn its correlated performance It is enough to excavate its correlation well, to solve the problems, such as suggested above.
Therefore, it is highly desirable to use for reference current existing related mature technology, while the problem above that takes one thing with another, more Add and comprehensively analyzes and calculate the topic relativity calculation method between different modalities.The present invention is exactly thus to excite, from part To entirety, designs a kind of novel technological frame (including three main algorithms) and cover, the depth vocabulary structure in multi-modal document It builds, the building of relationship topic model, the study of heterologous topic relativity, to establish effective cross-module state topic relativity calculating side Method finally improves for across media image retrieval performance.
Summary of the invention
It is an object of the invention to propose a kind of cross-module state topic relativity modeling method based on deep learning, to improve Across Media Society image retrieval performance.
Present invention firstly provides a novel depth cross-module state topic relativity correlation models, and the model is for extensive Multi-modal corpus is modeled, and can analyse in depth and understand the related information in multi-modal document between image and text, benefit With constructed model, the performance of cross-media retrieval can be effectively facilitated.The model mainly includes following components:
(1) depth vocabulary building (DeepWordConstruction).For multi-modal document, depth learning technology is utilized Building depth vocabulary is indicated as basic element respectively;Depth vocabulary includes deep vision vocabulary and depth text vocabulary, Wherein, deep vision vocabulary is used to better describe the image vision content in document, and depth text vocabulary is then used as and is used to The basic element of content of text in document is described.Compared with traditional visual vocabulary and text vocabulary, depth vocabulary can be deeper Excavate to level the semantic information of document.Building mode in this way, multi-modal document can be with depth vocabulary come more preferable Ground indicates.
(2) multi-modal subject information generates (Multimodal Topic Information Generation).It is constructing Depth vocabulary on the basis of, the theme that different modalities data are hidden behind is further excavated using topic model LDA and is believed Breath.Topic model assumes that document sets have one group of common theme collection behind, and each word corresponds to a master in document Topic, based on such it is assumed that being carried out by the theme feature for deriving each available document behind to document further Expression.
(3) cross-module state theme association analysis (Cross-modal Topic Correlation Analysis).Assuming that not With the document of mode, the theme hidden is heterologous but relevant, such as " wedding " the corresponding theme in text document behind There may be very high related information with image behind " white " theme, therefore the method by constructing common subspace is different The theme feature of mode is mapped in public subspace, to find the related information between different modalities.
(4) relationship theme modeling (Relational Topic Modeling).Relationship topic model is generating different modalities Theme feature when, while considering the related information of image and document, i.e., not only consider when constructing the theme of a certain document same The information of one mode, while the related information with other mode is also considered, so that final theme merges multi-modal information, And finally building obtains the theme distribution and cross-module state related information of multi-modal document behind.
For current existing multi-modal theme modeling method, method proposed by the invention exists in the application Two big advantages: first, accuracy is high, and be mainly reflected in: this method replaces traditional vocabulary, energy using the depth vocabulary of building It is enough to excavate mode profound level information deeper into ground, problem brought by semantic gap can be alleviated, well so as to more preferable The efficiency of ground promotion cross-media retrieval.Second, it is adaptable, because constructed model is directed between different modalities well Association is modeled, it is possible to and it is suitable for image retrieval text and text retrieval image is two-way across media information retrieval, and And the model can also be expanded to more easily for other mode across (such as audio) in media information retrieval.
Cross-module state topic relativity modeling method provided by the invention based on deep learning, the specific steps are as follows:
(1) data prediction: the data image of acquisition different modalities is concentrated from multi-medium data, image is obtained and image is retouched It states data, arranges in image labeling data set and seldom occur or useless mark word;
(2) it extracts multi-modal depth characteristic: extracting visual signature and the iamge description of image using deep learning method Semantic feature.Specifically, be utilized respectively Region-CNN (Convolutional Neural Network) model and Skip-gram model comes the provincial characteristics of abstract image and the lexical feature of text.Wherein, Region-CNN detection image first In representational region candidate collection, later using the convolutional neural networks of pre-training come to extract corresponding region corresponding to spy Sign;Skip-gram model be then obtained using the directly training of the co-occurrence information between text vocabulary and vocabulary the feature of vocabulary to Amount indicates.
(3) depth bag of words are constructed: using clustering algorithm K-means by obtained image district in step (2) first Characteristic of field and text lexical feature are clustered, and obtain the deep vision dictionary and depth text dictionary that limit dimension, and then will All provincial characteristics are mapped to corresponding visual dictionary in respective image, so that building obtains deep vision bag of words, phase As, the vocabulary in all texts also may map to text dictionary and obtain depth text bag of words;
(4) multi-modal theme generates: simulating entire multi-modal data collection using the hypothesis of potential Di Li Cray model Generating process, and be derived by the theme distribution feature that text collection and image set are hidden behind, make full use of vocabulary it Between co-occurrence information;
(5) the relationship topic model modeling of fusion cross-module state topic relativity analysis: constructing corresponding relationship topic model, The correlation that theme feature between different modalities is considered while constructing topic model, by multimode obtained in step (4) State theme feature calculates using the related information between image and text the phase between image and text as initial value Guan Xing, calculated correlation update the subject information of multi-modal document, thus cross-iteration carry out correlation It calculates to update and then construct with theme distribution and obtains final relationship topic model;
(6) based on topic relativity across media information retrieval: obtained cross-module state topic relativity is applied to across matchmaker The inquiry for giving certain mode respectively in body information retrieval, using correlation calculations obtain with the inquiry it is maximally related other The data of mode.
Above each step is described in detail below:
(1) data prediction
The step mainly carries out preliminary pretreatment to the data image of acquisition different modalities, specifically, because of image It include some noises in the mark for being included, these noises can be passed through because the randomness of user annotation causes The mode of word frequency filtering filters out word frequency lower than the word of some threshold value to obtain new dictionary.
(2) multi-modal depth characteristic is extracted
In the present invention, it is utilized respectively provincial characteristics and text that Region-CNN and Skip-gram model carrys out abstract image Lexical feature.It is illustrated separately below:
Given image, Region-CNN go out the position that object is likely to occur from image selection first with the method for selection search It sets as Candidate Set (usual 2,000 or so), exists in the form of region.And then it is special for each extracted region CNN Sign.In specific implementation, each image-region is converted into fixed Pixel Dimensions 227*227 by Region-CNN, for mentioning It takes the convolutional network of feature to be fully connected layer by 5 convolutional layers and 2 to constitute.Visual signature is extracted with Region-CNN to compare Traditional visual signature, advantage are mainly reflected in the semanteme that the extracted profound feature of CNN is more nearly image itself, can The problem of to alleviate semantic gap to a certain extent.
Given text document, each word for obtaining occurring in text document using Skip-gram model training are corresponding Feature vector.Skip-gram model is that a kind of very effective method carrys out the distributed of learning text vocabulary and indicates that the model is most It is early to be proposed by Mikolov et al. in 2013, it is used widely in the task of different natural language processings later.The mould Type can capture the syntax and semantics relationship between text vocabulary well, and semantic similar word is aggregated in Together, compared to more traditional text term vector learning method.An important advantage of Skip-gram is because not being related to Complicated density matrix operation, training effectiveness when for mass data training are high.Entire multi-modal text is indicated with TD The text description section of file data set, TW are all text vocabulary occurred in TD, and TV is that text vocabulary is corresponding Dictionary, for each of TW vocabulary tw, ivtwAnd ovtwIt is the input feature value and output feature vector for tw, Context (tw) is the vocabulary that word tw hereinafter occurs on it, and the corresponding window size of context is arranged in the present invention It is 5, all input vectors corresponding to entire text data set and output vector is uniformly indicated into W with a long parameter vector ∈R2*|TV|*dim, wherein dim is the dimension of input vector and output vector.Therefore, the objective function of entire Skip-gram model It can be described below:
For Skip-gram training, train the brought cost that calculates can be unusual using traditional softmax Height, therefore the negative sample method of sampling is utilized to approximate calculation ogP (twj|twi), calculation formula is as follows:
Wherein, σ () is sigmoid function, and m is the quantity of negative sample, each negative sample is from based on word frequency It is generated that noise is distributed P (tw).
(3) depth bag of words are constructed
On the basis of step (2) obtains respective depth vocabulary, further pass through vector quantization (Vector Quantization) method of [25] constructs depth bag of words.Specifically, the area for being extracted using R-CNN Multi-modal document data is concentrated all images to be wrapped by domain Candidate Set and corresponding feature first with the method for K-means The provincial characteristics contained is clustered, and obtains the classification of fixed quantity, and the central point of each cluster classification is as the category Representative element, all these classifications constitute a corresponding dictionary.Later, each candidate region in image is mapped to Indicated in corresponding classification, mapping method be by calculate the Euclidean of the feature in each region and class center feature away from From, so that the corresponding classification nearest with provincial characteristics is found, it is cumulative in the position that vector corresponds to the category.It is done using such Method can all indicate every piece image in entire data set the form as deep vision bag of words, i.e., every piece image is corresponding The dimension of one vector, vector is the number of classification, and the element value of vector is the number that the category occurs in the picture, with to Measure VT ∈ RCIt indicates, wherein C is the class number that cluster obtains.Similarly, for word all corresponding to text document Vector can also obtain corresponding depth text dictionary by way of cluster, and finally will with same mapping method Each text is expressed as the form of depth text bag of words.
(4) multi-modal theme generates
Multi-modal information is a kind of very important expression way for multi-modal document content, that is to say, that The visual information of image combines with semantic description.Therefore, for preferably between computation vision image and text marking across Modal Correlation, more accurately extracting representational multi-modal feature becomes particularly significant, and multi-modal character representation Being associated between the perceptual property of image and semantic meaning representation feature can preferably be explored.
Latent Dirichletal location (LDA) algorithm be one be directed to discrete data production probabilistic model, the algorithm by To the highest attention of picture/text research field, LDA indicates every document using one group of probability distribution, and every in document A word is generated from an individual theme.The advantage of LDA is that it considers that the inherent statistical framework of document is such as different Co-occurrence information etc. of the word in entire collection of document, it is assumed that each of every document vocabulary is all individually main from one Topic is generated, and the theme is that Di Li Cray by one on all themes is distributed and is generated.LDA is by each document tables It is shown as one group of ProbabilityDistribution Vector closed in theme collection, these vectors are used to indicate the visual signature and text of sociogram Feature.
In step (4), probabilistic Modeling is carried out to image and text collection respectively using potential Di Li Cray model, is dived In Di Li Cray model hypothesis in the behind of a document sets under cover common theme set, and specific each document back The probability distribution closed in the theme collection is respectively corresponded again afterwards, each of the document word all corresponds to one behind By probability distribution theme generated;And the probability distribution of all documents does not have no bearing on, and is all common from one The distribution of Di Li Cray is generated;On the basis of this model hypothesis, the deep vision bag of words and depth that step (3) is obtained are literary This bag of words is as input, and different modalities document (text document and visual document) is derived by using LDA model, and institute is hidden behind The probability topic of hiding is distributed, and establishes basis to establish the relationship topic model of fusion cross-module state related information in next step.
(5) the relationship topic model modeling of fusion cross-module state topic relativity analysis
Correlation information between different modalities is dissolved into topic model building process by building relationship topic model, tool For body, the theme distributions of the different modalities that step (4) is obtained is as initial value, by by the theme feature of different modalities The correlation for being mapped to the mode of public subspace to be calculated between different modalities theme feature, and by the meter of the correlation Calculation is dissolved into topic model, and then is considered and another mode in the theme that the document for deriving a certain mode is hidden behind Correlation information so that finally obtained subject information considers not only the distributed intelligence between same mode, while also being examined Consider the relationship between other mode.
The step for main target be construct a joint probability distribution so that the multi-modal document likelihood observed Angle value reaches maximum.During constructing model, by multi-modal collection of document DMIt is divided into three parts composition, i.e., first part is Visual pattern set DV, second part is text description collections DT, Part III is link set LVT(set indicate image and Related information between text).Wherein, DVBy deep vision lexical set DWVIt constitutes, and DVVIt is deep vision dictionary, simultaneously Text description collections DTBy depth text lexical set DWTIt constitutes, DVTIt is depth text dictionary.For lvt∈LVT,lvt=1 meaning Taste visual pattern dv∈DVD is described with textt∈DTIt is relevant, and lvt=0 means visual pattern dvIt is described with text dtIt is incoherent.Based on above description, relationship topic model formalization representation is as follows: given TSVFor visual theme set, TST It is text subject set, α and β are two hyper parameters, and wherein α is distributed for the Di Li Cray of theme, and β is directed to theme-depth word The Di Li Cray of remittance is distributed, θvCorresponding visual pattern dvThe theme distribution of behind, θtCorresponding visual pattern dtThe theme of behind point Cloth, Φ are that each theme corresponds to multinomial distribution corresponding to all depth vocabulary, and z is all words of correspondence actually generated by θ The behind subject information of remittance, Dir () and Mult () respectively indicate the distribution of Di Li Cray and multinomial distribution, NdIt indicates in document d In depth vocabulary quantity, n indicate n-th of depth vocabulary.The generating process of entire relationship topic model is as follows:
(1) for each theme tv ∈ DT in visual theme setV:
(a) multinomial that tv corresponds to all visual vocabularies is obtained according to theme-visual vocabulary Di Li Cray profile samples Distribution, it may be assumed that φv tv~Dir (φvv)。
(2) for each theme tt ∈ DT in text subject setT:
(a) multinomial that tt corresponds to all text vocabulary is obtained according to theme-text vocabulary Di Li Cray profile samples Distribution, it may be assumed that φt tt~Dir (φtt)。
(3) for each visual document d ∈ DV:
(a) d corresponding theme distribution behind is obtained according in the Di Li Cray profile samples that theme collection closes, it may be assumed that θv d~ Dir(θvv)。
(b) for each of d deep vision vocabulary wv d,n:
I. the corresponding theme of the vocabulary is obtained according to the theme distribution of the behind document d, it may be assumed that zv d,n~Mult (θv d)
Ii. it is sampled to obtain vocabulary corresponding in the document position according to theme-visual vocabulary, it may be assumed that wv d,n~Mult (φv zd,n)
(4) for each text document d ∈ DT:
(a) d corresponding theme distribution behind is obtained according in the Di Li Cray profile samples that theme collection closes, it may be assumed that θt d~ Dir(θtt);
(b) for each of d depth text vocabulary wt d,n:
I. the corresponding theme of the vocabulary is obtained according to the theme distribution of the behind document d, it may be assumed that zt d,n~Mult (θt d);
Ii. it is sampled to obtain vocabulary corresponding in the document position according to theme-text vocabulary, it may be assumed that wt d,n~Mult (φt zd,n);
(5) l is linked for eachvt∈LVT, indicate visual document dvWith text document dtBetween related information:
(a) according to dvWith dtTheme feature calculate its correlation to lvtIt is sampled, it may be assumed thatMv,Mt), whereinWithRespectively correspond document dvWith dtExperience theme distribution, WithIt is two mapping matrixes point Not Ying She vision and text subject feature to public subspace, wherein the dimension of public subspace is dim dimension, TCor (lvt=1) Indicate document dtWith dvTopic relativity, and TCor (lvt=0) document d is indicatedtWith dvTheme non-correlation.
It is final to construct joint probability distribution form to be directed to entire multi-modal collection of document and be built based on above procedure Mould, as follows:
Wherein, first item corresponds to theme-depth vocabulary generating process, intermediate two corresponding deep vision vocabulary and depth The generating process of text vocabulary, last indicates image-description connection generating process.
(6) across media information retrieval (application of relationship topic model)
Step (6) is the relationship topic model that step (5) is established, and is used for across media information retrieval, with image and text For this, two classes can be divided into across media information retrieval, i.e. text-inquiry-image and image-inquiry-text, text-inquiry- What image considered be according to given query text, using relationship topic model calculate different images to the text degree of correlation come pair All images are ranked up, and image-inquiry-text is then the degree of correlation according to different text documents for given query image To be ranked up to all text documents.
For given inquiry (such as utilizing image querying text), derive that corresponding theme is special using relationship topic model Sign, and calculated between other mode documents using the correlation calculations method of theme feature obtained in step (5) Correlation information (such as text document), is ranked up text document by the height of correlation information, to return To with the maximally related text document of query image.Similarly, the above process is also applied for using text query image across media Information retrieval process.
In conclusion the present invention is proposed for content isomerism and relevance between different modalities in multi-modal document A kind of cross-module state topic relativity modeling method based on deep learning, and then can be with the form of probabilistic model to entire multimode The generating process of state document is described, and the correlation between the document of different modalities is quantified.The method of the present invention can Effectively to apply to improve retrieval relevance across in media information retrieval for large-scale image, enhance user experience.
Detailed description of the invention
Fig. 1 is flow chart of the invention.
Fig. 2 is the schematic diagram for constructing the multi-modal document of depth lexical representation.
Fig. 3 is the schematic diagram of cross-module state relationship topic relativity modeling process.
Fig. 4 is the relationship topic model proposed and the comparison diagram of traditional multi-modal topic model.
Fig. 5 is to carry out the effect picture across media information retrieval using constructed relationship topic model.
Specific embodiment
With reference to the accompanying drawing, the cross-module state relatedness computation method that the present invention is directed to sociogram is discussed in detail.
(1) data object is acquired
Data object is acquired, image and image labeling data are obtained, is arranged in image labeling data in entire data set Seldom appearance or useless mark word.Generally in the data set of acquirement, wherein many noise datas are had, so using These data carry out just carrying out it processing and filtering appropriate before feature extraction.For image, obtained figure As being all unified JPG format, do not need to do any transformation.For the text marking of image, obtained image labeling contains There are many meaningless words, as word addend word does not have the word of any meaning.Some image labelings up to tens, in order to It allows image labeling to describe the main information of image well, those useless, meaningless marks should be given up.Therefore, it is taken Process method step it is as follows:
Step 1: the frequency that all words occur in data set in statistical data collection mark;
Step 2: filtering out the meaningless word with number in those words;
Step 3: in each image labeling in entire data set the less word of the frequency of occurrences, be construed as figure Than minor information as in, and deleted.
Through the above steps, the image labeling that can obtain that treated.For removing the less word of frequency in step 3, Reason for this is that the mark of same class image or there are many identical, words for being close in meaning in image clustering in it.Therefore according to The frequency of occurrences is completely reasonable to be filtered to it.
(2) multi-modal feature extraction
Fig. 2 shows the process extracted feature in the way of deep learning and construct depth vocabulary, utilizes in the present invention Region-CNN is detected to the region of image and is extracted corresponding CNN feature, and the dimension of feature is 4,096 dimension. Usually, Region-CNN can select 2,000 or so region as candidate, such width figure for every piece image As corresponding eigenmatrix just has 2,000*4,096 dimension.And later if all areas to all images cluster, Data volume is M*2,000*4,096, and M is the number of image, it is clear that such data volume bring space-time cost is huge.For Such a practical problem is solved, the method that internal-external cluster combines is carried out in concrete operations, i.e., firstly for each width The all areas for including in image carry out primary internal cluster (being polymerized to 10 classes), carry out again to all areas later primary external poly- Class (is polymerized to 100 classes), and the data volume for carrying out external cluster actually final so is just M*10*4, and 096, largely reduce The space-time cost of cluster.Another needs the problem of illustrating, either Region-CNN extracts visual signature or Skip- Gram is extracted lexical feature and is operated using pre-training model, and wherein Region-CNN is utilized on ImageNet AlexNet carries out pre-training, and Skip-gram then utilizes the mould that training obtains on the wikipedia document comprising 6,000,000,000 vocabulary Type.This training for being primarily due to deep neural network needs a large amount of data, therefore to avoid the problem that over-fitting, using Trained model operates real data to extract corresponding feature on large-scale dataset.
(3) cross-module state topic relativity calculates
Fig. 3 shows cross-module state relationship topic relativity modeling process, and utilization is mentioned in introduction before Carry out computation vision document dvWith text document dtCorrelation, MvAnd MtIt is for visual theme feature and text respectively The mapping matrix of this theme feature, TCor (lvt=1) document d is indicatedtWith dvTopic relativity, and TCor (lvt=0) it indicates Document dtWith dvTheme non-correlation, shown in TCor () is defined as follows:
Here different data types is directed to using both of which, mode is first is that using Sigmoid function by dot product It is mapped in [0,1] range, and second mode calculates topic relativity by normalizing the cosine similarity of two vectors.Together When, the multi-modal theme distribution based on generation can use the method for maximal possibility estimation (MLE) to train to obtain parameter MvWith Mt, that is, the log likelihood angle value of formula (4) is maximized, objective function is defined as shown:
Based on such objective function, mapping matrix MvAnd MtIt can be calculated by gradient descent method.It needs to illustrate It is, in actual training process, it is assumed that the quantity of multi-modal document is | DM|, under normal conditions in each multi-modal document It only include one group of image and text, the number of image document and the number of text document are substantially the same, and are equal to multi-modal The quantity of document, i.e., | Dv|=| DT|=| DM|.If the text and image that occur in same multi-modal document be it is relevant, Without then uncorrelated in same multi-modal document, (i.e. image-text is related for the positive sample for the training data being converted in this way It is right) and the ratio of negative sample (image-text uncorrelated to) be about 1/ | DM|.Such ratio will lead to negative sample and positive sample Serious disproportion, in addition image and text can not illustrate that the image and text are complete in same multi-modal document completely Uncorrelated (same category may be belonged to), therefore enabling the ratio of negative sample and positive sample in practice is 1:1, and is being randomly choosed Meet following constraint when negative sample, i.e., corresponding image and text cannot come from same category.
(4) multi-modal relationship topic model derives
Formula (3) shows relationship topic model constructed in the present invention, is derived using the method for jeep this sampling Obtain the parameter [26] of model.The purpose of this sampling of jeep will obtain each vocabulary in multi-modal document and imply behind Theme is derived by first about depth vocabulary, the corresponding subject information of vocabulary and corresponding cross-module during sampling The edge distribution of state association link, as follows:
Wherein, md,ttCorresponding is the number that theme tt occurs in document d, ntt,wCorresponding is in entire document sets The number of theme tt vocabulary generated.It is general that the single argument for subject information z can be further derived by according to formula (6) Rate distribution, and then obtain the sampling rule for the behind theme of each word in document.As shown in formula (7),
WhereinIndicate the frequency of occurrence of the theme tt after removing current word in document d, andIt indicates to remove current word The number for the word that theme tt is included.Based on such sampling rule, it can sample to obtain each word in entire document sets and carry on the back Implied subject information afterwards.Similarly, it after sampling each time, is all calculated using formula (5) and is obtained in present sample To theme distribution on the basis of how to obtain mapping matrix MtAnd Mv, and the M obtained within the present sample timetAnd MvBy conduct The input of sampling process next time, loops back and forth like this, until reaching iteration termination condition, to obtain final subject information And mapping matrix MtAnd Mv.Correspondingly, other parameters are such as in relationship topic modelθV、θTIt then can be public by calculating Formula (8) finally obtains:
(5) example is applied
Fig. 5 is to carry out the effect picture across media information retrieval using constructed relationship topic model, wherein being divided into two kinds Mode, one is image retrieval text (Image Query-to-Text) is utilized, another kind is to utilize text retrieval image (Text Query-to-Image), relevance score are calculated as shown in formula (9).
Bibliography
[1]Fan,J.P.;He,X.F.;Zhou,N.;Peng,J.Y.;and Jain,R.2012.Quantitative Characterization of Semantic Gaps for Learning Complexity Estimation and Inference Model Selection.IEEE Transactions on Multimedia 14(5):1414-1428.
[2]Datta,R.;Joshi,D.;Li,J.;and Wang,J.Z.2008.Image Retrieval:Ideas, Influences,and Trends of the New Age.ACM Computing Surveys(CSUR)40(2), Article5.
[3]Rasiwasia,N.;Pereira,J.C.;Coviello,E.;Doyle,G.;Lanckriet,G.R.G.; Levy,R.;and Vasconcelos,N.2010.A New Approach to Cross-modal Multimedia Retrieval.In Proceedings of MM 2010,251-260.
[4]Pereira,J,C.;Coviello,E.;Doyle,G.;Rasiwasia,N.;Lanckriet,G.R.G.; Levy,R.;and Vasconcelos,N.2014.On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence(PAMI)36(3):521-535.
[5]Barnard,K.;Duygulu,P.;Forsyth,D.;Freitas,N.;Blei,D.M.;and Jordan, M.I.2003.Matching Words and Pictures.Journal of Machine Learning Research.3: 1107-1135.
[6]Wang,X.;Liu,Y.;Wang,D.;and Wu,F.2013.Cross-media Topic Mining on Wikipedia.In Proceedings of MM 2013,689-692.
[7]Frome,A.;Corrado,G.S.;Shlens,J.;Bengio,S.;Dean,J.;Ranzato,M.A.;and Mikolov,T.2013.DeViSE:A Deep Visual-Semantic Embedding Model.In Proceedings of NIPS 2013.
[8]Feng,F.X.;Wang,X.J.;and Li,R.F.2014.Cross-modal Retrieval with Correspondence Autoencoder.In Proceedings of MM 2014,7-16.
[9]Nguyen,C.T.;Kaothanthong,N.;Tokuyama,T.;and Phan X.H.2013.A Feature-Word-Topic Model for Image Annotation and Retrieval.ACM Transactions on the Web 7(3),Article 12.
[10]Ramage,D.;Heymann,P.;Manning,C.D.;and Molina,H.G.2009.Clustering the Tagged Web.In Proceedings of WSDM 2009,54-63.
[11]Blei,D.M.;and Jordan,M.I.2003.Modeling Annotated Data.In Proceedings of SIGIR 2003,127-134.
[12]Wang,C.;Blei,D.;and Fei-Fei L.2009.Simultaneous Image Classification and Annotation.In Proceedings of CVPR 2009,1903-1910.
[13]Putthividhya,D.;Attias,H.T.;and Nagarajan,S.S.2010.Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation.In Proceedings of CVPR2010,3408-3415.
[14]Niu,Z.X.;Hua,G.;Gao,X.B.;and Tian,Q.2014.Semi-supervised Relational Topic Model for Weakly Annotated Image Recognition in Social Media.In Proceedings of CVPR2014,4233-4240.
[15]Wang,Y.F.;Wu,F.;Song,J.;Li,X.;and Zhuang,Y.T.2014.Multi-modal Mutual Topic Reinforce Modeling for Cross-media Retrieval.In Proceedings of MM 2014,307-316.
[16]Zheng,Y.;Zhang,Y.J.;and Larochelle,H.2014.Topic Modeling of Multimodal Data:an Autoregressive Approach.In Proceedings of CVPR 2014,1370- 1377.
[17]Chen,T.;SalahEldeen,H.M.;He,X.N.;Kan,M.Y.;and Lu,D.Y.2015.VELDA: Relating an Image Tweet’s Text and Images.In Proceedings of AAAI 2015.
[18]Girshick,R.;Donahue,J.;Darrell,T.;and Malik,J.2014.Rich feature hierarchies for accurate object detection and semantic segmentation.In Proceedings of CVPR 2014,580-587.
[19]Hariharan,B.;Arbelaez,P.;Girshick,R.;and Malik, J.2014.Simultaneous Detection and Segmentation.In Proceedings of ECCV 2014, 297-312.
[20]Karpathy,A.;Joulin,A.;and Fei-Fei,L.2014.Deep Fragment Embeddings for Bidirectional Image Sentence Mapping.In Proceedings of NIPS 2014.
[21]Zhang,N.;Donahue,J.;Girshick,R.;and Darrell,T.2014.Part-Based R- CNNs for Fine-Grained Category Detection.In Proceedings of ECCV 2014,834-849.
[22]Mikolov,T.;Sutskever,I.;Chen,K.;Corrado,G.;and Dean, J.2013.Distributed Representations of Words and Phrases and their Compositionality.In Proceedings of NIPS 2013.
[23]Tang,D.Y.;Wei,F.R.;Qin,B.;Zhou,M.;and Liu,T.2014.Building Large- Scale Twitter-Specific Sentiment Lexicon:A Representation Learning Approach.In Proceedings of COLING 2014,172-182.
[24]Karpathy,A.;Joulin,A.;and Fei-Fei,L.2014.Deep Fragment Embeddings for Bidirectional Image Sentence Mapping.In Proceedings of NIPS 2014.
[25]Sivic,J.,and Zisserman,A.2003.Video Google:A Text Retrieval Approach to Object Matching in Videos.In Proceedings of ICCV 2003,2:1470- 1477.
[26]Griffiths,T.L.;and Steyvers,M.2004.Finding Scientific Topics.
In Proceedings of the National Academy of Sciences of the United States of America,101(1):5228-5235。

Claims (6)

1. a kind of cross-module state topic relativity modeling method based on deep learning, it is characterised in that specific step is as follows:
(1) data prediction: the data image of acquisition different modalities is concentrated from multi-medium data, obtains image and iamge description number According to, arrange image labeling data set in seldom occur or useless mark word;
(2) it extracts multi-modal depth characteristic: extracting the visual signature of image and the language of iamge description using deep learning method Adopted feature;Specifically, it is utilized respectively Region-CNN model and Skip-gram model comes the provincial characteristics and text of abstract image This lexical feature;Wherein, representational region candidate collection in Region-CNN detection image first utilizes pre-training later Convolutional neural networks come to extract corresponding region corresponding to feature;Skip-gram model is then to utilize text vocabulary and word The feature vector that co-occurrence information directly training between remittance obtains vocabulary indicates;
(3) depth bag of words are constructed: using clustering algorithm K-means that image-region obtained in step (2) is special first Text lexical feature of seeking peace is clustered, and obtains the deep vision dictionary and depth text dictionary that limit dimension, and then will be corresponding All provincial characteristics are mapped to corresponding visual dictionary in image, so that building obtains deep vision bag of words;Similarly, Vocabulary in all texts is also mapped onto text dictionary and obtains depth text bag of words;
(4) multi-modal theme generates: the generation of entire multi-modal data collection is simulated using the hypothesis of potential Di Li Cray model Process, and it is derived by the theme distribution feature that text collection and image set are hidden behind, it makes full use of between vocabulary Co-occurrence information;
(5) the relationship topic model modeling of fusion cross-module state topic relativity analysis: corresponding relationship topic model is constructed, that is, is existed The correlation that theme feature between different modalities is considered while constructing topic model, by multi-modal master obtained in step (4) Feature is inscribed as initial value, while calculating the correlation between image and text using the related information between image and text Property, calculated correlation update the subject information of multi-modal document, thus cross-iteration carry out correlation meter It calculates to update and then construct with theme distribution and obtains final relationship topic model;
(6) based on topic relativity across media information retrieval: obtained cross-module state topic relativity is applied to across media letters In breath retrieval, it is the inquiry for giving certain mode respectively, is obtained and other maximally related mode of the inquiry using correlation calculations Data.
2. according to the method described in claim 1, it is characterized by: in step (2), it is described be utilized respectively Region-CNN and Skip-gram model comes the provincial characteristics of abstract image and the lexical feature of text, and detailed process is as follows:
Given image, Region-CNN go out the position that object is likely to occur from image selection first with the method for selection search and make For Candidate Set, exist in the form of region;And then it is directed to each extracted region CNN feature;In specific implementation, Each image-region is converted into fixed Pixel Dimensions 227*227 by Region-CNN, for extracting the convolutional network of feature Layer is fully connected by 5 convolutional layers and 2 to constitute;
Given text document, the corresponding feature of each word occurred in text document is obtained using Skip-gram model training Vector;The text description section of entire multi-modal document data set is indicated with TD, TW is all texts occurred in TD This vocabulary, TV is the corresponding dictionary of text vocabulary, for each of TW vocabulary tw, ivtwAnd ovtwIt is the input for tw Feature vector and output feature vector, Context (tw) is the vocabulary that word tw hereinafter occurs on it;Context is corresponding Window size is set as 5, and all input vectors corresponding to entire text data set and output vector are uniformly used a long ginseng Number vector indicates W ∈ R2*|TV|*dim, wherein dim is the dimension of input vector and output vector;Entire Skip-gram model Objective function is described below:
Using the negative sample method of sampling come approximate calculation ogP (twj|twi), calculation formula is as follows:
Wherein, σ () is sigmoid function, and m is the quantity of negative sample, each negative sample is from the noise based on word frequency It is generated to be distributed P (tw).
3. according to the method described in claim 1, it is characterized by: step (3) is to obtain respective depth vocabulary in step (2) On the basis of, depth bag of words, detailed process are further constructed by the method for vector quantization are as follows: for mentioning using R-CNN Multi-modal document data is concentrated institute first with the method for K-means by the region candidate collection obtained and corresponding feature The provincial characteristics for having image to be included is clustered, and obtains the classification of fixed quantity, and central point of each cluster classification is made For the representative element of the category, all these classifications constitute a corresponding dictionary;Later, each candidate regions in image Domain is mapped in corresponding classification and indicates, mapping method is special by calculating the feature in each region and class center The Euclidean distance of sign, so that the corresponding classification nearest with provincial characteristics is found, it is cumulative in the position that vector corresponds to the category, from And every piece image in entire data set all being indicated the form as deep vision bag of words, i.e., every piece image is one corresponding The dimension of vector, vector is the number of classification, and the element value of vector is the number that the category occurs in the picture, with vector VT ∈RCIt indicates, wherein C is the class number that cluster obtains;Similarly, for term vector all corresponding to text document, Corresponding depth text dictionary is obtained also by the mode of cluster, finally with same mapping method by each text table It is shown as the form of depth text bag of words.
4. according to the method described in claim 1, it is characterized by: being distinguished in step (4) using potential Di Li Cray model Probabilistic Modeling carried out to image and text collection, potential Di Li Cray model hypothesis the behind of document sets under cover one it is common Theme set, and specific each document respectively corresponds the probability distribution closed in the theme collection behind, this Each of document word all corresponds to one by probability distribution theme generated behind;And the probability distribution of all documents It does not have no bearing on, is generated from a common Di Li Cray distribution;On the basis of this model hypothesis, by step (3) the deep vision bag of words and depth text bag of words obtained are derived by different modalities document using LDA model as inputting The probability topic distribution hidden behind.
5. according to the method described in claim 1, it is characterized by: step (5) during constructing model, by multi-modal text Shelves set DMIt is divided into three parts composition, i.e., first part is visual pattern set DV, second part is text description collections DT, the Three parts are link set LVT, which indicates the related information between image and text;Wherein, DVBy deep vision word finder Close DWVIt constitutes, and DVVIt is deep vision dictionary, while text description collections DTBy depth text lexical set DWTIt constitutes, DVTIt is Depth text dictionary;For lvt∈LVT,lvt=1 means visual pattern dv∈DVD is described with textt∈DTIt is relevant, and lvt=0 means visual pattern dvD is described with texttIt is incoherent;Based on above description, the formalization of relationship topic model It is expressed as follows: given DTVFor visual theme set, DTTIt is text subject set, α and β are two hyper parameters, and wherein α is for master The Di Li Cray of topic is distributed, and β is distributed for theme-depth vocabulary Di Li Cray, θvCorresponding visual pattern dvThe theme of behind point Cloth, θtCorresponding visual pattern dtThe theme distribution of behind, Φ are that each theme corresponds to multinomial corresponding to all depth vocabulary point Cloth, z are the behind subject informations of all vocabulary of correspondence actually generated by θ, and Dir () and Mult () respectively indicate Di Li Cray Distribution and multinomial distribution, NdIndicate the quantity of the depth vocabulary in document d, n indicates n-th of depth vocabulary;Entire relationship The generating process of topic model is as follows:
(1) for each theme tv ∈ DT in visual theme setV:
The multinomial distribution that tv corresponds to all visual vocabularies is obtained according to theme-visual vocabulary Di Li Cray profile samples, it may be assumed that φv tv~Dir (φvv);
(2) for each theme tt ∈ DT in text subject setT:
The multinomial distribution that tt corresponds to all text vocabulary is obtained according to theme-text vocabulary Di Li Cray profile samples, it may be assumed that φt tt~Dir (φtt);
(3) for each visual document d ∈ DV:
(a) d corresponding theme distribution behind is obtained according in the Di Li Cray profile samples that theme collection closes, it may be assumed that
θv d~Dir (θvv);
(b) for each of d deep vision vocabulary wv d,n:
I. the corresponding theme of the vocabulary is obtained according to the theme distribution of the behind document d, it may be assumed that zv d,n~Mult (θv d);
Ii. sample to obtain corresponding vocabulary in a document according to theme-visual vocabulary, it may be assumed that wv d,n~Mult (φv zd,n);
(4) for each text document d ∈ DT:
(a) d corresponding theme distribution behind is obtained according in the Di Li Cray profile samples that theme collection closes, it may be assumed that θt d~Dir (θtt);
(b) for each of d depth text vocabulary wt d,n:
I. the corresponding theme of the vocabulary is obtained according to the theme distribution of the behind document d, it may be assumed that zt d,n~Mult (θt d);
Ii. sample to obtain corresponding vocabulary in a document according to theme-text vocabulary, it may be assumed that wt d,n~Mult (φt zd,n);
(5) l is linked for eachvt∈LVT, indicate visual document dvWith text document dtBetween related information:
(a) according to dvWith dtTheme feature calculate its correlation to lvtIt is sampled, it may be assumed that WhereinWithRespectively correspond document dvWith dtExperience theme distribution, WithIt is that two mapping matrixes map vision and text subject feature respectively To public subspace, wherein the dimension of public subspace is dim dimension, TCor (lvt=1) document d is indicatedtWith dvTheme it is related Property, and TCor (lvt=0) document d is indicatedtWith dvTheme non-correlation;
It is final to construct joint probability distribution form to be directed to entire multi-modal collection of document and be modeled, such as based on above procedure Shown in lower:
Wherein, first item corresponds to theme-depth vocabulary generating process, intermediate two corresponding deep vision vocabulary and depth text The generating process of vocabulary, last indicates image-description connection generating process.
6. according to the method described in claim 1, it is characterized by: step (6) is the relationship theme mould that step (5) is established Type, for across media information retrieval;It is divided into two classes across media information retrieval, i.e. text-inquiry-image and image-inquiry-text This, what text-inquiry-image considered is to calculate different images to this using relationship topic model according to given query text The text degree of correlation is ranked up all images, and image-inquiry-text is according to different text documents for giving query graph The degree of correlation of picture is ranked up all text documents;
Image querying text is utilized for given, derives corresponding theme feature using relationship topic model, and utilize The correlation calculations method of theme feature obtained in step (5) calculates the correlation information between other mode documents, Text document is ranked up by the height of correlation information, is obtained and the maximally related text text of query image to return Shelves;Similarly, the above process is also applied for across the media information retrieval process using text query image.
CN201610099438.9A 2016-02-23 2016-02-23 Cross-module state topic relativity modeling method based on deep learning Expired - Fee Related CN105760507B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610099438.9A CN105760507B (en) 2016-02-23 2016-02-23 Cross-module state topic relativity modeling method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610099438.9A CN105760507B (en) 2016-02-23 2016-02-23 Cross-module state topic relativity modeling method based on deep learning

Publications (2)

Publication Number Publication Date
CN105760507A CN105760507A (en) 2016-07-13
CN105760507B true CN105760507B (en) 2019-05-03

Family

ID=56330274

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610099438.9A Expired - Fee Related CN105760507B (en) 2016-02-23 2016-02-23 Cross-module state topic relativity modeling method based on deep learning

Country Status (1)

Country Link
CN (1) CN105760507B (en)

Families Citing this family (55)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018046452A1 (en) 2016-09-07 2018-03-15 Koninklijke Philips N.V. Systems, methods, and apparatus for diagnostic inferencing with a multimodal deep memory network
CN106156374A (en) * 2016-09-13 2016-11-23 华侨大学 A kind of view-based access control model dictionary optimizes and the image search method of query expansion
US11068652B2 (en) * 2016-11-04 2021-07-20 Mitsubishi Electric Corporation Information processing device
CN108073576A (en) * 2016-11-09 2018-05-25 上海诺悦智能科技有限公司 Intelligent search method, searcher and search engine system
CN108198625B (en) * 2016-12-08 2021-07-20 推想医疗科技股份有限公司 Deep learning method and device for analyzing high-dimensional medical data
CN106777050B (en) * 2016-12-09 2019-09-06 大连海事大学 It is a kind of based on bag of words and to take into account the shoes stamp line expression and system of semantic dependency
CN106778880B (en) * 2016-12-23 2020-04-07 南开大学 Microblog topic representation and topic discovery method based on multi-mode deep Boltzmann machine
CN106650756B (en) * 2016-12-28 2019-12-10 广东顺德中山大学卡内基梅隆大学国际联合研究院 knowledge migration-based image text description method of multi-mode recurrent neural network
CN106886783B (en) * 2017-01-20 2020-11-10 清华大学 Image retrieval method and system based on regional characteristics
CN107145910A (en) * 2017-05-08 2017-09-08 京东方科技集团股份有限公司 Performance generation system, its training method and the performance generation method of medical image
CN107273517B (en) * 2017-06-21 2021-07-23 复旦大学 Graph-text cross-modal retrieval method based on graph embedding learning
CN109213988B (en) * 2017-06-29 2022-06-21 武汉斗鱼网络科技有限公司 Barrage theme extraction method, medium, equipment and system based on N-gram model
TWI636404B (en) * 2017-07-31 2018-09-21 財團法人工業技術研究院 Deep neural network and method for using the same and computer readable media
CN107480289B (en) * 2017-08-24 2020-06-30 成都澳海川科技有限公司 User attribute acquisition method and device
CN108305296B (en) * 2017-08-30 2021-02-26 深圳市腾讯计算机系统有限公司 Image description generation method, model training method, device and storage medium
CN107870992A (en) * 2017-10-27 2018-04-03 上海交通大学 Editable image of clothing searching method based on multichannel topic model
CN107798624B (en) * 2017-10-30 2021-09-28 北京航空航天大学 Technical label recommendation method in software question-and-answer community
CN108256549B (en) * 2017-12-13 2019-03-15 北京达佳互联信息技术有限公司 Image classification method, device and terminal
CN108399409B (en) 2018-01-19 2019-06-18 北京达佳互联信息技术有限公司 Image classification method, device and terminal
CN110119505A (en) 2018-02-05 2019-08-13 阿里巴巴集团控股有限公司 Term vector generation method, device and equipment
CN108595636A (en) * 2018-04-25 2018-09-28 复旦大学 The image search method of cartographical sketching based on depth cross-module state correlation study
CN108830903B (en) * 2018-04-28 2021-11-05 杨晓春 Billet position detection method based on CNN
CN109145936B (en) * 2018-06-20 2019-07-09 北京达佳互联信息技术有限公司 A kind of model optimization method and device
CN110110122A (en) * 2018-06-22 2019-08-09 北京交通大学 Image based on multilayer semanteme depth hash algorithm-text cross-module state retrieval
CN109214412A (en) * 2018-07-12 2019-01-15 北京达佳互联信息技术有限公司 A kind of training method and device of disaggregated model
CN109213853B (en) * 2018-08-16 2022-04-12 昆明理工大学 CCA algorithm-based Chinese community question-answer cross-modal retrieval method
EP3644616A1 (en) * 2018-10-22 2020-04-29 Samsung Electronics Co., Ltd. Display apparatus and operating method of the same
CN109472232B (en) * 2018-10-31 2020-09-29 山东师范大学 Video semantic representation method, system and medium based on multi-mode fusion mechanism
CN110442721B (en) * 2018-11-28 2023-01-06 腾讯科技(深圳)有限公司 Neural network language model, training method, device and storage medium
CN111464881B (en) * 2019-01-18 2021-08-13 复旦大学 Full-convolution video description generation method based on self-optimization mechanism
CN109886326B (en) * 2019-01-31 2022-01-04 深圳市商汤科技有限公司 Cross-modal information retrieval method and device and storage medium
CN109816039B (en) * 2019-01-31 2021-04-20 深圳市商汤科技有限公司 Cross-modal information retrieval method and device and storage medium
CN110209822B (en) * 2019-06-11 2021-12-21 中译语通科技股份有限公司 Academic field data correlation prediction method based on deep learning and computer
CN110337016B (en) * 2019-06-13 2020-08-14 山东大学 Short video personalized recommendation method and system based on multimodal graph convolution network, readable storage medium and computer equipment
CN110647632B (en) * 2019-08-06 2020-09-04 上海孚典智能科技有限公司 Image and text mapping technology based on machine learning
CN110503147B (en) * 2019-08-22 2022-04-08 山东大学 Multi-mode image classification system based on correlation learning
CN111310453B (en) * 2019-11-05 2023-04-25 上海金融期货信息技术有限公司 User theme vectorization representation method and system based on deep learning
CN111259152A (en) * 2020-01-20 2020-06-09 刘秀萍 Deep multilayer network driven feature aggregation category divider
CN112257445B (en) * 2020-10-19 2024-01-26 浙大城市学院 Multi-mode push text named entity recognition method based on text-picture relation pre-training
CN112507064B (en) * 2020-11-09 2022-05-24 国网天津市电力公司 Cross-modal sequence-to-sequence generation method based on topic perception
CN114547259B (en) * 2020-11-26 2024-05-24 北京大学 Automatic formula description generation method and system based on topic relation graph
CN112632969B (en) * 2020-12-13 2022-06-21 复旦大学 Incremental industry dictionary updating method and system
CN113157959B (en) * 2020-12-17 2024-05-31 云知声智能科技股份有限公司 Cross-modal retrieval method, device and system based on multi-modal topic supplementation
CN112836746B (en) * 2021-02-02 2022-09-09 中国科学技术大学 Semantic correspondence method based on consistency graph modeling
CN115017911A (en) * 2021-03-05 2022-09-06 微软技术许可有限责任公司 Cross-modal processing for vision and language
CN113051932B (en) * 2021-04-06 2023-11-03 合肥工业大学 Category detection method for network media event of semantic and knowledge expansion theme model
CN113139468B (en) * 2021-04-24 2023-04-11 西安交通大学 Video abstract generation method fusing local target features and global features
CN113298265B (en) * 2021-05-22 2024-01-09 西北工业大学 Heterogeneous sensor potential correlation learning method based on deep learning
CN113297485B (en) * 2021-05-24 2023-01-24 中国科学院计算技术研究所 Method for generating cross-modal representation vector and cross-modal recommendation method
CN113392196B (en) * 2021-06-04 2023-04-21 北京师范大学 Question retrieval method and system based on multi-mode cross comparison
CN113343679B (en) * 2021-07-06 2024-02-13 合肥工业大学 Multi-mode subject mining method based on label constraint
CN113516118B (en) * 2021-07-29 2023-06-16 西北大学 Multi-mode cultural resource processing method for joint embedding of images and texts
CN113408282B (en) * 2021-08-06 2021-11-09 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for topic model training and topic prediction
CN114880527B (en) * 2022-06-09 2023-03-24 哈尔滨工业大学(威海) Multi-modal knowledge graph representation method based on multi-prediction task
CN118378168B (en) * 2024-06-25 2024-09-06 北京联合永道软件股份有限公司 Unstructured data modeling method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559193A (en) * 2013-09-10 2014-02-05 浙江大学 Topic modeling method based on selected cell
CN103559192A (en) * 2013-09-10 2014-02-05 浙江大学 Media-crossed retrieval method based on modal-crossed sparse topic modeling
CN104317837A (en) * 2014-10-10 2015-01-28 浙江大学 Cross-modal searching method based on topic model
CN104899253A (en) * 2015-05-13 2015-09-09 复旦大学 Cross-modality image-label relevance learning method facing social image

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559193A (en) * 2013-09-10 2014-02-05 浙江大学 Topic modeling method based on selected cell
CN103559192A (en) * 2013-09-10 2014-02-05 浙江大学 Media-crossed retrieval method based on modal-crossed sparse topic modeling
CN104317837A (en) * 2014-10-10 2015-01-28 浙江大学 Cross-modal searching method based on topic model
CN104899253A (en) * 2015-05-13 2015-09-09 复旦大学 Cross-modality image-label relevance learning method facing social image

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"跨媒体组合语义深度学习";吴飞等;《浙江省信号处理学会2015年年会——信号处理在大数据》;20151031;第1-5页

Also Published As

Publication number Publication date
CN105760507A (en) 2016-07-13

Similar Documents

Publication Publication Date Title
CN105760507B (en) Cross-module state topic relativity modeling method based on deep learning
Zhang et al. A quantum-inspired multimodal sentiment analysis framework
Liu et al. A survey of sentiment analysis based on transfer learning
Peng et al. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges
Liu et al. Image annotation via graph learning
Park et al. Efficient extraction of domain specific sentiment lexicon with active learning
Gao et al. Multi‐dimensional data modelling of video image action recognition and motion capture in deep learning framework
Ranjan et al. LFNN: Lion fuzzy neural network-based evolutionary model for text classification using context and sense based features
Li et al. Modeling continuous visual features for semantic image annotation and retrieval
Niu et al. Knowledge-based topic model for unsupervised object discovery and localization
Papadopoulos et al. Image clustering through community detection on hybrid image similarity graphs
Sumathi et al. An overview of automated image annotation approaches
Li et al. Fusing semantic aspects for image annotation and retrieval
Tian et al. Automatic image annotation based on Gaussian mixture model considering cross-modal correlations
Xie et al. A semantic model for cross-modal and multi-modal retrieval
Wang et al. Rare-aware attention network for image–text matching
Long et al. Bi-calibration networks for weakly-supervised video representation learning
Wu et al. Multiple hypergraph clustering of web images by miningword2image correlations
Chen et al. An annotation rule extraction algorithm for image retrieval
Papapanagiotou et al. Improving concept-based image retrieval with training weights computed from tags
Tian et al. Scene graph generation by multi-level semantic tasks
CN105677830B (en) A kind of dissimilar medium similarity calculation method and search method based on entity mapping
Guo [Retracted] Intelligent Sports Video Classification Based on Deep Neural Network (DNN) Algorithm and Transfer Learning
Xiao et al. Research on multimodal emotion analysis algorithm based on deep learning
Xue et al. Few-shot node classification via local adaptive discriminant structure learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190503

CF01 Termination of patent right due to non-payment of annual fee