CN103559191B

CN103559191B - Based on latent space study and Bidirectional sort study across media sort method

Info

Publication number: CN103559191B
Application number: CN201310410565.2A
Authority: CN
Inventors: 吴飞; 汤斯亮; 卢鑫炎; 邵健; 庄越挺
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2013-09-10
Filing date: 2013-09-10
Publication date: 2016-09-14
Anticipated expiration: 2033-09-10
Also published as: CN103559191A

Abstract

The invention discloses a kind of based on latent space study and Bidirectional sort study across media sort method.Comprise the steps: 1) ordered samples of text retrieval image and the ordered samples unification of image retrieval text are configured to training sample；2) training sample obtaining structure carries out learning based on latent space study and sorting across media of Bidirectional sort study, obtains semantic information of multimedia space and across media order models；3) what use study obtained carries out sorting across media across media order models.The present invention can be applied not only to text retrieval image and image retrieval text, and owing to being modeled two retrieval directions simultaneously, the semantic understanding ability of the retrieval model obtained is higher, and retrieval precision is more preferable compared with the method only considering unidirectional sequence study.

Description

Based on latent space study and Bidirectional sort study across media sort method

Technical field

The present invention designs cross-media retrieval, particularly relates to a kind of based on latent space study and two-way row Sequence study across media sort method.

Background technology

Image is the most common file type, and it has certain semanteme.In general, Image is made up of pixel one by one, and computer can not directly understand the language that image is contained Justice information.Along with multimedia technology and the development of network technology, increasing image emerges Come.It is interested that retrieval technique can help user quickly to find oneself in the data of magnanimity Content, becomes field the most important in Computer Applied Technology.Traditional retrieval technique, Either retrieval based on key word is also based on the retrieval of content, all can not meet use well Family is wished by text retrieval image or the demand of image retrieval text.Retrieval based on key word In system, need in advance image to be labeled.But the amount of images owing to presently, there are is huge Greatly, therefore annotation process quantities is vast and numerous, and owing to marked content is inevitably marked The impact of note person's subjective factors, for same image, different mark persons may mark not Same key word, therefore key word tends not to objectively respond whole semantemes that image is contained. Content-based retrieval system then need not be labeled image, and user submits a retrieval sample to Image is retrieved by example, but traditional content-based retrieval technology two weakness of existence: One is the media object that user can only retrieve mode identical with inquiring about example, can only be examined by image Rope image；Two are the low-level image feature of image and high-level semantic exists semantic gap and therefore retrieves performance It is restricted.In order to cross over the semantic gap between different modalities data, it is more fully understood that multimedia Semanteme, simultaneously in order to meet user's demand across Media Inquiries, seek a kind of based on semantic across Media sort method is the most meaningful.

Summary of the invention

It is an object of the invention to overcome the deficiencies in the prior art, it is provided that a kind of based on latent space study With Bidirectional sort study across media sort method.

Based on latent space study and Bidirectional sort study across media sort method, including walking as follows Rapid:

1) ordered samples of text retrieval image and the ordered samples of image retrieval text are unified structure Build as training sample；

2) carry out based on latent space study and Bidirectional sort study to building the training sample that obtains Across media sequence study, obtain semantic information of multimedia space and across media order models；

3) what use study obtained carries out sorting across media across media order models: user submits to and looks into After asking example, first find this inquiry example coordinate in semantic information of multimedia space, then According to across media object coordinate in semantic information of multimedia space, calculate inquiry example and its He is all across the media object similarity in semantic information of multimedia space, and according to this similarity, It is ranked up across media object all.

Described step 1) including:

1) word bag model is utilized to carry out feature representation all text documents in training sample, and Utilizing TF-IDF method to be weighted each word, text is finally represented as Wherein m is the dimension of text space；

2) all image documents in training sample are extracted SIFT local feature region, and to these Local feature region carries out K-Means cluster, builds code book and vision list with cluster centre Word.Then to every pictures, each office of this picture is calculated by Euclidean distance arest neighbors Portion's characteristic point should belong to which vision word in code book, last and to text document Process as, utilize word bag model and TF-IDF method to carry out feature representation, image is It is represented as eventuallyWherein n is the dimension of image space；

3) for text retrieval image direction, to each query text, an image is built Sorted lists, wherein to be marked as query semantics relevant or semantic for the image in list Uncorrelated, the training sample of the most each text retrieval image is represented as tlv tripleWherein N is training sample number, t_iFor retrieval text, p_iFor Image collection,It is the sequence on image collection,Represent whole Sorting space；

4) for image retrieval text orientation, to each query image, a text is built The sorted lists of document, wherein the text document in list be marked as query semantics be correlated with Or semantic uncorrelated, the training sample of each image retrieval text is represented as tlv tripleM is training sample number, p_jFor retrieving image, t_jIt is text document set,It it is the sequence closed of text document collection；

5) the inquiry list in both direction is combined and obtains unified training sample.

Described step 2) including:

1) using structural support vector machine to build an optimization problem, its object function is so that and reflects Penetrate function between structure risk and empiric risk, obtain compromise:

\begin{matrix} \min_{U, V, ξ_{1}, ξ_{2}} \frac{λ}{2} {| | U | |}_{F}^{2} + \frac{λ}{2} {| | V | |}_{F}^{2} + \frac{1}{N} Σ_{i = 1}^{N} ξ_{1, i} + \frac{1}{M} Σ_{j = N + 1}^{N + M} ξ_{2, j} \\ s . t . &ForAll; i &Element; {1, \cdot \cdot \cdot, N}, &ForAll; y &Element; y : \\ δF (t_{i}, p_{i}, y) &GreaterEqual; Δ (y_{i}^{*}, y) - ξ_{1, i} \\ &ForAll; j &Element; {N + 1, \cdot \cdot \cdot, N + M}, &ForAll; y &Element; y : \\ δF (p_{j}, t_{j}, y) &GreaterEqual; Δ (y_{j}^{*}, y) - ξ_{2, j} . \end{matrix} - - - (1)

Wherein,It is the mapping matrix mapping the text to latent space,It is will Image is mapped to the mapping matrix of latent space, and k is the dimension of latent space, ξ_{1, i}And ξ_{2, j}It is lax Variable.The function F of definition is as follows:

F (t, p, y) = \underset{i &Element; p^{+}}{Σ} \underset{j &Element; p^{-}}{Σ} y_{ij} \frac{{(Ut)}^{T} V (p_{i} - p_{j})}{| p^{+} | \cdot | p^{-} |} - - - (2)

δF (t_{i}, p_{i}, y) = F (t_{i}, p_{i}, y_{i}^{*}) - F (t_{i}, p_{i}, y) - - - (3)

F (p, t, y) = \underset{i &Element; t^{+}}{Σ} \underset{j &Element; t^{-}}{Σ} y_{ij} \frac{{(Vp)}^{T} U (t_{i} - t_{j})}{| t^{+} | \cdot | t^{-} |} - - - (4)

δF (p_{j}, t_{j}, y) = F (p_{j}, t_{j}, y_{j}^{*}) - F (p_{j}, t_{j}, y) - - - (5)

Wherein, p⁺And p^-Represent the image collection relevant to query text t and civilian with inquiring about respectively The incoherent image collection of this t, t⁺And t^-Represent the text relevant to query image p respectively Set, text collection incoherent with query image p.y_ijValue according to sequence y determine: If the sequence that document i is than document j is forward, then y_ij=1, otherwise y_ij=-1.Additionally, Definition loss function is Δ (y^*, y)=1-MAP (y^*, y), MAP is Mean Average Precision, performance measurement standard conventional in a kind of information retrieval, MAP value is the biggest, Sequence performance is the best, and the value of loss function is the least；

2) input the two-way ordered samples training sample as optimization problem, solve and obtain parameter U And V.

Described step 3) including:

1) in the case of input being text query sample t, to all image p_iAccording to below equation Calculate the similarity of itself and query sample: f (t, p_i)=(Ut)^TVp_i, then by similarity from greatly To little, image is ranked up；

2) in the case of input being image querying sample p, to all text document t_iAccording to following Formula calculates the similarity of itself and query sample: f (t_i, p)=(Ut_i)^TVp, then by similarity From big to small text document is ranked up.

The present invention is compared with background technology, and have has the advantages that:

The present invention is directed to Bidirectional sort training sample and propose a set of new based on semantic content Search method.Merge latent space study due to the method and Bidirectional sort learnt two kinds of mechanism, Take full advantage of Bidirectional sort training sample, be simultaneous for sequence performance and directly optimize, because of This has the performance that preferably sorts.

Accompanying drawing explanation

Fig. 1 is to illustrate across media sort method based on what latent space study and Bidirectional sort learnt Figure；

Fig. 2 is the example of the Query Result of the present invention.

Detailed description of the invention

Multimedia document is carried out by the present invention by merging latent space study and Bidirectional sort study Semantic understanding, is mapped to a unification by all of multimedia document (text document, image) Semantic information of multimedia latent space in, thus realize across media sequence retrieval.

Described step 1) including:

Described step 2) including:

\begin{matrix} \min_{U, V, ξ_{1}, ξ_{2}} \frac{λ}{2} {| | U | |}_{F}^{2} + \frac{λ}{2} {| | V | |}_{F}^{2} + \frac{1}{N} Σ_{i = 1}^{N} ξ_{1, i} + \frac{1}{M} Σ_{j = N + 1}^{N + M} ξ_{2, j} \\ s . t . &ForAll; i &Element; {1, \cdot \cdot \cdot, N}, &ForAll; y &Element; y : \\ δF (t_{i}, p_{i}, y) &GreaterEqual; Δ (y_{i}^{*}, y) - ξ_{1, i} \\ &ForAll; j &Element; {N + 1, \cdot \cdot \cdot, N + M}, &ForAll; y &Element; y : \\ δF (p_{j}, t_{j}, y) &GreaterEqual; Δ (y_{j}^{*}, y) - ξ_{2, j} . \end{matrix} - - - (6)

F (t, p, y) = \underset{i &Element; p^{+}}{Σ} \underset{j &Element; p^{-}}{Σ} y_{ij} \frac{{(Ut)}^{T} V (p_{i} - p_{j})}{| p^{+} | \cdot | p^{-} |} - - - (7)

δF (t_{i}, p_{i}, y) = F (t_{i}, p_{i}, y_{i}^{*}) - F (t_{i}, p_{i}, y) - - - (8)

F (p, t, y) = \underset{i &Element; t^{+}}{Σ} \underset{j &Element; t^{-}}{Σ} y_{ij} \frac{{(Vp)}^{T} U (t_{i} - t_{j})}{| t^{+} | \cdot | t^{-} |} - - - (9)

δF (p_{j}, t_{j}, y) = F (p_{j}, t_{j}, y_{j}^{*}) - F (p_{j}, t_{j}, y) - - - (10)

2) input the two-way ordered samples training sample as optimization problem, solve obtain parameter U and V.Concrete derivation algorithm is as follows:

To the searching optimum y in step 3 and step 5, it is possible to use SVMMAP method is entered Row solves.Finally solve U and V obtained and be i.e. respectively text space linearly reflecting to latent space Penetrate function and the image space linear mapping function to latent space.

Described step 3) including:

Embodiment

In order to verify the effect of the present invention, grab from the webpage of " figure wikipedia-every day one " Taking about 2900 webpages, be divided into 10 big classes, each webpage contains an image and several The description text of Duan Xiangguan, tests in this, as data set.If image and text are all returned Belong to a class of 10 big apoplexy due to endogenous wind, then it is assumed that image is relevant with text, the most uncorrelated.By number Being divided into training set and test set according to collection, the present invention is trained in training set, is then surveying Independent assessment is carried out on examination collection.Feature extraction is carried out according to step described in the present invention, wherein After removing common word and uncommon word, text space is set as 5000 dimensions, and image space is set as 1000 dimensions.In order to evaluate the performance of the algorithm of the present invention objectively, inventor uses average standard Really the present invention is evaluated by rate (Mean Average Precision, MAP).MAP's Result is as shown in table 1:

	MAP@50	MAP@all
			Text query image	0.3981	0.2123
Image querying text	0.2599	0.2528

Table 1

Wherein MAP@50 is front 50 return calculated MAP value of result, MAP@all It is all calculated MAP value of return result.

In order to preferably represent present invention result on cross-media retrieval, present in fig. 2 The example of some Query Results.Fig. 2 is twice retrieval result of the present invention, respectively text inspection Rope image and image retrieval text.Wherein when showing image retrieval text, the text of return makes With the image of its correspondence as displaying.From the result presented it will be seen that either with image Query text, or with text query image, the method for the present invention all has preferable effect, Can return traditional single mode retrieval the most close irrealizable result.

Claims

1. based on latent space study and Bidirectional sort study across a media sort method, it is special Levy and be to comprise the steps:

3) what use study obtained carries out sorting across media across media order models: user submits to and looks into After asking example, first find this inquiry example coordinate in semantic information of multimedia space, then basis Across media object coordinate in semantic information of multimedia space, calculate inquiry example with other all across Media object is in the similarity in semantic information of multimedia space, and according to this similarity, to all across matchmaker Body object is ranked up.

The most according to claim 1 a kind of based on latent space study and Bidirectional sort study Across media sort method, it is characterised in that described step 1) including:

1) word bag model is utilized to carry out feature representation all text documents in training sample, and Utilizing TF-IDF method to be weighted each word, text is finally represented as t ∈ R^m, its Middle m is the dimension of text space；

2) all image documents in training sample are extracted SIFT local feature region, and to these Local feature region carries out K-Means cluster, builds code book and vision word with cluster centre； Then to every pictures, each local feature region of this picture is calculated by Euclidean distance arest neighbors Which vision word in code book should be belonged to, finally the same with to the process of text document, Utilizing word bag model and TF-IDF method to carry out feature representation, image is finally represented as p∈Rⁿ, wherein n is the dimension of image space；

3) for text retrieval image direction, to each query text, an image is built Sorted lists, wherein the image in list is marked as that query semantics is relevant or semanteme not phase Closing, the training sample of the most each text retrieval image is represented as tlv tripleWherein N is training sample number, t_iFor retrieval text, p_iFor figure Image set closes,It is the sequence on image collection,Represent whole Sorting space；

4) for image retrieval text orientation, to each query image, a text is built The sorted lists of document, wherein the text document in list be marked as query semantics relevant or Semantic uncorrelated, the training sample of each image retrieval text is represented as tlv tripleM is training sample number, p_jFor retrieval image, t_jIt is Text document set,It it is the sequence closed of text document collection；