CN103559191A

CN103559191A - Cross-media sorting method based on hidden space learning and two-way sorting learning

Info

Publication number: CN103559191A
Application number: CN201310410565.2A
Authority: CN
Inventors: 吴飞; 汤斯亮; 卢鑫炎; 邵健; 庄越挺
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2013-09-10
Filing date: 2013-09-10
Publication date: 2014-02-05
Anticipated expiration: 2033-09-10
Also published as: CN103559191B

Abstract

The invention discloses a cross-media sorting method based on hidden space learning and two-way sorting learning. The method includes 1, centrally constructing sorting samples of text retrieval images and sorting samples of image retrieval texts into a training sample; 2, performing cross-media sorting learning based on the hidden space learning and the two-way sorting learning on the constructed training sample, and acquiring a multimedia semantic space and a cross-media sorting model; 3, using the learned cross-media sorting model to performing cross-media sorting. The method can be applied in the text retrieval images and the image retrieval texts, and modeling is performed on two retrieval directions simultaneously, the acquired semantic understanding capacity of a retrieval model is stronger, and retrieval accuracy is higher as compared with the method considering one-way sorting learning only.

Description

Based on the study of hidden space learning and Bidirectional sort across media sort method

Technical field

The present invention design is across media retrieval, relate in particular to a kind of based on hidden space learning and Bidirectional sort study across media sort method.

Background technology

Image is current very common file type, and it has certain semanteme.In general, image is comprised of pixel one by one, and computing machine can not directly be understood the semantic information that image contains.Along with the development of multimedia technology and network technology, increasing image emerges.Retrieval technique can help user's fast finding in the data of magnanimity to own interested content, to become field more and more important in Computer Applied Technology.No matter traditional retrieval technique, be retrieval or the content-based retrieval based on keyword, all can not meet well user and wish the demand with text retrieval image or image retrieval text.In searching system based on keyword, need in advance image to be marked.But because the amount of images existing is at present huge, therefore mark process engineering amount vast and numerous, and due to the impact of the marked content person's subjective factor that inevitably can be subject to mark, for same image, different mark persons may mark different keywords, so keyword often can not objectively respond whole semantemes that image contains.Content-based retrieval system does not need image to mark, user submits to a retrieval sample to retrieve image, but there are two weakness in traditional content-based retrieval technology: the one, and user can only retrieve and the media object of inquiring about the identical mode of example, can only pass through image retrieval image; The 2nd, the low-level image feature of image and high-level semantic exist semantic gap so retrieval performance to be restricted.In order to cross over the semantic gap between different modalities data, understand better semantic information of multimedia, simultaneously in order to meet user across the demand of Media Inquiries, seek a kind of quite meaningful across media sort method based on semanteme.

Summary of the invention

The object of the invention is to overcome the deficiencies in the prior art, provide a kind of based on the study of hidden space learning and Bidirectional sort across media sort method.

Based on the study of hidden space learning and Bidirectional sort across media sort method, comprise the steps:

1) the ordered samples unification of the ordered samples of text retrieval image and image retrieval text is configured to training sample;

2) to build the training sample obtain carry out based on hidden space learning and Bidirectional sort study across media sequence study, obtain semantic information of multimedia space and across media order models;

3) what use study obtained carries out sorting across media across media order models: after submit queries example, first find this inquiry example at the coordinate in semantic information of multimedia space, then according to the coordinate in semantic information of multimedia space across media object, calculate inquiry example and other all across media object the similarity in semantic information of multimedia space, and according to this similarity, to all, across media object, sort.

Described step 1) comprising:

1) to all text documents in training sample, utilize word bag model to carry out feature representation, and utilize TF-IDF method to be weighted each word, text is finally represented as t ∈ R ^m, wherein m is the dimension in text space;

2) all image documents in training sample are extracted to SIFT local feature point, and these local feature points are carried out to K-Means cluster, with cluster centre, build code book and vision word.Then to every pictures, each the local feature point that calculates this picture by Euclidean distance arest neighbors should belong to which the vision word in code book, last the same with the processing to text document, utilize word bag model and TF-IDF method to carry out feature representation, image is finally represented as p ∈ R ⁿ, the dimension that wherein n is image space;

3) for text retrieval image direction, to each query text, build the sorted lists of an image, to be wherein marked as query semantics relevant or semantic uncorrelated for the image in list, so the training sample of each text retrieval image is represented as tlv triple wherein N is training sample number, t _ifor retrieval text, p _ifor image collection,

be the sequence on image collection, y represents whole Sorting space;

4) for image retrieval text orientation, to each query image, build the sorted lists of a text document, to be wherein marked as query semantics relevant or semantic uncorrelated for the text document in list, and the training sample of each image retrieval text is represented as tlv triple

m is training sample number, p _ifor retrieving images, t _jtext document set,

it is the sequence in text document set;

5) Query List on both direction is combined and obtains unified training sample.

Described step 2) comprising:

1) use structural support vector machine to build an optimization problem, its objective function is to make mapping function between structure risk and empiric risk, obtain compromise:

\begin{matrix} \min_{U, V, ξ_{1}, ξ_{2}} \frac{λ}{2} {| | U | |}_{F}^{2} + \frac{λ}{2} {| | V | |}_{F}^{2} + \frac{1}{N} Σ_{i = 1}^{N} ξ_{1, i} + \frac{1}{M} Σ_{j = N + 1}^{N + M} ξ_{2, j} \\ s . t . &ForAll; i &Element; {1, . . ., N}, &ForAll; y &Element; y : \\ δF (t_{i}, p_{i}, y) &GreaterEqual; Δ (y_{i}^{*}, y) - ξ_{1, i} \\ &ForAll; j &Element; {N + 1, . . ., N + M}, &ForAll; y &Element; y : \\ δF (p_{j}, t_{j}, y) &GreaterEqual; Δ (y_{j}^{*}, y) - ξ_{2, j} . \end{matrix} - - - (1)

Wherein,

the mapping matrix to hidden space by text mapping,

be the mapping matrix to hidden space by image mapped, k is the dimension in hidden space, ξ _{1, i}and ξ _{2, j}it is slack variable.The function F of definition is as follows:

F (t, p, y) = \underset{i &Element; p^{+}}{Σ} \underset{j &Element; p^{-}}{Σ} y_{ij} \frac{{(Ut)}^{T} V (p_{i} - p_{j})}{| p^{+} | \cdot | p^{-} |} - - - (2)

δF (t_{i}, p_{i}, y) = F (t_{i}, p_{i}, y_{i}^{*}) - F (t_{i}, p_{i}, y) - - - (3)

F (p, t, y) = \underset{i &Element; t^{+}}{Σ} \underset{j &Element; t^{-}}{Σ} y_{ij} \frac{{(Vp)}^{T} U (t_{i} - t_{j})}{| t^{+} | \cdot | t^{-} |} - - - (4)

δF (p_{j}, t_{j}, y) = F (p_{j}, t_{j}, y_{j}^{*}) - F (p_{j}, t_{j}, y) - - - (5)

Wherein, p ⁺and p ^-represent respectively the image collection relevant to query text t and with the incoherent image collection of query text t, t ⁺and t ^-represent respectively the text collection relevant to query image p, with the incoherent text collection of query image p.Y _ijvalue according to sequence y, decide: if document i is more forward than the sequence of document j, y _ij=1, otherwise y _ij=-1.In addition, definition loss function is Δ (y ^*, y)=1-MAP (y ^*, y), MAP is Mean Average Precision, conventional performance measurement standard in a kind of information retrieval, and MAP value is larger, and sequence performance is better, and the value of loss function is just less;

2) input two-way ordered samples as the training sample of optimization problem, solve and obtain parameter U and V.

Described step 3) comprising:

1) to being input as in the situation of text query sample t, to all image p _iaccording to following formula, calculate the similarity of itself and query sample: f (t, p _i)=(Ut) ^tvp _i, then by similarity, from big to small image is sorted;

2) to being input as in the situation of image querying sample p, to all text document t _iaccording to following formula, calculate the similarity of itself and query sample: f (t _i, p)=(Ut _i) ^tvp, then sorts to text document from big to small by similarity.

The present invention compares with background technology, and the useful effect having is:

The present invention is directed to Bidirectional sort training sample and proposed a set of new search method based on semantic content.Because the method has merged hidden space learning and two kinds of mechanism of Bidirectional sort study, take full advantage of Bidirectional sort training sample, for sequence performance, directly optimize simultaneously, therefore there is better sequence performance.

Accompanying drawing explanation

Fig. 1 be based on the study of hidden space learning and Bidirectional sort across media sort method schematic diagram;

Fig. 2 is the example of Query Result of the present invention.

Embodiment

The present invention learns multimedia document to carry out semantic understanding by merging hidden space learning and Bidirectional sort, all multimedia documents (text document, image) are mapped in a hidden space of unified semantic information of multimedia, thereby realize across media sequence retrieval.

Described step 1) comprising:

3) for text retrieval image direction, to each query text, build the sorted lists of an image, to be wherein marked as query semantics relevant or semantic uncorrelated for the image in list, so the training sample of each text retrieval image is represented as tlv triple

wherein N is training sample number, t _ifor retrieval text, p _ifor image collection,

be the sequence on image collection, y represents whole Sorting space;

m is training sample number, p _jfor retrieving images, t _jtext document set,

it is the sequence in text document set;

Described step 2) comprising:

\begin{matrix} \min_{U, V, ξ_{1}, ξ_{2}} \frac{λ}{2} {| | U | |}_{F}^{2} + \frac{λ}{2} {| | V | |}_{F}^{2} + \frac{1}{N} Σ_{i = 1}^{N} ξ_{1, i} + \frac{1}{M} Σ_{j = N + 1}^{N + M} ξ_{2, j} \\ s . t . &ForAll; i &Element; {1, . . ., N}, &ForAll; y &Element; y : \\ δF (t_{i}, p_{i}, y) &GreaterEqual; Δ (y_{i}^{*}, y) - ξ_{1, i} \\ &ForAll; j &Element; {N + 1, . . ., N + M}, &ForAll; y &Element; y : \\ δF (p_{j}, t_{j}, y) &GreaterEqual; Δ (y_{j}^{*}, y) - ξ_{2, j} . \end{matrix} - - - (6)

Wherein,

the mapping matrix to hidden space by text mapping,

F (t, p, y) = \underset{i &Element; p^{+}}{Σ} \underset{j &Element; p^{-}}{Σ} y_{ij} \frac{{(Ut)}^{T} V (p_{i} - p_{j})}{| p^{+} | \cdot | p^{-} |} - - - (7)

δF (t_{i}, p_{i}, y) = F (t_{i}, p_{i}, y_{i}^{*}) - F (t_{i}, p_{i}, y) - - - (8)

F (p, t, y) = \underset{i &Element; t^{+}}{Σ} \underset{j &Element; t^{-}}{Σ} y_{ij} \frac{{(Vp)}^{T} U (t_{i} - t_{j})}{| t^{+} | \cdot | t^{-} |} - - - (9)

δF (p_{j}, t_{j}, y) = F (p_{j}, t_{j}, y_{j}^{*}) - F (p_{j}, t_{j}, y) - - - (10)

2) input two-way ordered samples as the training sample of optimization problem, solve and obtain parameter U and V.Concrete derivation algorithm is as follows:

To the optimum y of the searching in step 3 and step 5, can use SVMMAP method to solve.Finally solve linear mapping function that U that obtains and V are respectively Dao Yin space, text space and image space to the linear mapping function in hidden space.

Described step 3) comprising:

Embodiment

In order to verify effect of the present invention, from the webpage of " figure wikipedia-every day one ", capture approximately 2900 webpages, be divided into 10 large classes, each webpage has comprised an image and several sections of relevant description texts, usings that this tests as data set.If image and text all belong to the class in 10 large classes, think image and text-dependent, otherwise uncorrelated.Data set is divided into training set and test set, and the present invention trains on training set, then on test set, carries out independent assessment.For feature extraction, according to the said step of the present invention, carry out, wherein remove common word and uncommon word after text space be set as 5000 dimensions, image space is set as 1000 dimensions.In order to evaluate objectively the performance of algorithm of the present invention, inventor uses Average Accuracy (Mean Average Precision, MAP) to evaluate the present invention.The result of MAP is as shown in table 1:

?	MAP@50	MAP@all
			Text query image	0.3981	0.2123
Image querying text	0.2599	0.2528

Table 1

Wherein MAP@50 is first 50 and returns results the MAP value calculating, and MAP@all is all MAP values that calculate that return results.

In order to represent better the present invention in the result across in media retrieval, in Fig. 2, presented the example of some Query Results.Fig. 2 is twice result for retrieval of the present invention, is respectively text retrieval image and image retrieval text.Wherein, when showing image retrieval text, the text returning has been used its corresponding image as displaying.No matter from the result presenting, can see, be with image querying text, or with text query image, and method of the present invention all has good effect, can return traditional single mode retrieval irrealizable semantically close result.

Claims

Based on the study of hidden space learning and Bidirectional sort across a media sort method, it is characterized in that comprising the steps:

1) the ordered samples unification of the ordered samples of text retrieval image and image retrieval text is configured to training sample;

2) to build the training sample obtain carry out based on hidden space learning and Bidirectional sort study across media sequence study, obtain semantic information of multimedia space and across media order models;

3) what use study obtained carries out sorting across media across media order models: after submit queries example, first find this inquiry example at the coordinate in semantic information of multimedia space, then according to the coordinate in semantic information of multimedia space across media object, calculate inquiry example and other all across media object the similarity in semantic information of multimedia space, and according to this similarity, to all, across media object, sort.
According to claim 1 a kind of based on the study of hidden space learning and Bidirectional sort across media sort method, it is characterized in that described step 1) comprising:

1) to all text documents in training sample, utilize word bag model to carry out feature representation, and utilize TF-IDF method to be weighted each word, text is finally represented as t ∈ R ^m, wherein m is the dimension in text space;

2) all image documents in training sample are extracted to SIFT local feature point, and these local feature points are carried out to K-Means cluster, with cluster centre, build code book and vision word.Then to every pictures, each the local feature point that calculates this picture by Euclidean distance arest neighbors should belong to which the vision word in code book, last the same with the processing to text document, utilize word bag model and TF-IDF method to carry out feature representation, image is finally represented as p ∈ R ⁿ, the dimension that wherein n is image space;

3) for text retrieval image direction, to each query text, build the sorted lists of an image, to be wherein marked as query semantics relevant or semantic uncorrelated for the image in list, so the training sample of each text retrieval image is represented as tlv triple
wherein N is training sample number, t _ifor retrieval text, p _ifor image collection, be the sequence on image collection, y represents whole Sorting space;

4) for image retrieval text orientation, to each query image, build the sorted lists of a text document, to be wherein marked as query semantics relevant or semantic uncorrelated for the text document in list, and the training sample of each image retrieval text is represented as tlv triple
m is training sample number, p _jfor retrieving images, t _jtext document set,
it is the sequence in text document set;

5) Query List on both direction is combined and obtains unified training sample.
According to claim 1 a kind of based on the study of hidden space learning and Bidirectional sort across media sort method, it is characterized in that described step 2) comprising:

1) use structural support vector machine to build an optimization problem, its objective function is to make mapping function between structure risk and empiric risk, obtain compromise:

$\begin{matrix} \min_{U, V, ξ_{1}, ξ_{2}} \frac{λ}{2} {| | U | |}_{F}^{2} + \frac{λ}{2} {| | V | |}_{F}^{2} + \frac{1}{N} Σ_{i = 1}^{N} ξ_{1, i} + \frac{1}{M} Σ_{j = N + 1}^{N + M} ξ_{2, j} \\ s . t . &ForAll; i &Element; {1, . . ., N}, &ForAll; y &Element; y : \\ δF (t_{i}, p_{i}, y) &GreaterEqual; Δ (y_{i}^{*}, y) - ξ_{1, i} \\ &ForAll; j &Element; {N + 1, . . ., N + M}, &ForAll; y &Element; y : \\ δF (p_{j}, t_{j}, y) &GreaterEqual; Δ (y_{j}^{*}, y) - ξ_{2, j} . \end{matrix} - - - (1)$

Wherein,
the mapping matrix to hidden space by text mapping,
be the mapping matrix to hidden space by image mapped, k is the dimension in hidden space, ξ _{1, i}and ξ _{2, j}it is slack variable.The function F of definition is as follows:

$F (t, p, y) = \underset{i &Element; p^{+}}{Σ} \underset{j &Element; p^{-}}{Σ} y_{ij} \frac{{(Ut)}^{T} V (p_{i} - p_{j})}{| p^{+} | \cdot | p^{-} |} - - - (2)$

$δF (t_{i}, p_{i}, y) = F (t_{i}, p_{i}, y_{i}^{*}) - F (t_{i}, p_{i}, y) - - - (3)$

$F (p, t, y) = \underset{i &Element; t^{+}}{Σ} \underset{j &Element; t^{-}}{Σ} y_{ij} \frac{{(Vp)}^{T} U (t_{i} - t_{j})}{| t^{+} | \cdot | t^{-} |} - - - (4)$

$δF (p_{j}, t_{j}, y) = F (p_{j}, t_{j}, y_{j}^{*}) - F (p_{j}, t_{j}, y) - - - (5)$

Wherein, p ⁺and p ^-represent respectively the image collection relevant to query text t and with the incoherent image collection of query text t, t ⁺and t ^-represent respectively the text collection relevant to query image p, with the incoherent text collection of query image p.Y _ijvalue according to sequence y, decide: if document i is more forward than the sequence of document j, y _ij=1, otherwise y _ij=-1.In addition, definition loss function is Δ (y ^*, y)=1-MAP (y ^*, y), MAP is Mean Average Precision, conventional performance measurement standard in a kind of information retrieval, and MAP value is larger, and sequence performance is better, and the value of loss function is just less;

2) input two-way ordered samples as the training sample of optimization problem, solve and obtain parameter U and V.
According to claim 1 a kind of based on the study of hidden space learning and Bidirectional sort across media sort method, it is characterized in that described step 3) comprising:

1) to being input as in the situation of text query sample t, to all image p _iaccording to following formula, calculate the similarity of itself and query sample: f (t, p _i)=(Ut) ^tvp _i, then by similarity, from big to small image is sorted;

2) to being input as in the situation of image querying sample p, to all text document t _iaccording to following formula, calculate the similarity of itself and query sample: f (t _i, p)=(Ut _i) ^tvp, then sorts to text document from big to small by similarity.