CN107885854A

CN107885854A - A kind of semi-supervised cross-media retrieval method of feature based selection and virtual data generation

Info

Publication number: CN107885854A
Application number: CN201711124618.9A
Authority: CN
Inventors: 孙建德; 于恩; 李静; 张化祥
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2017-11-14
Filing date: 2017-11-14
Publication date: 2018-04-06
Anticipated expiration: 2037-11-14
Also published as: CN107885854B

Abstract

The present invention proposes a kind of semi-supervised cross-media retrieval method of feature based selection and virtual data generation.This method proposes to generate some virtual data points according to the feature of training data to expand training data, while uses l during two pairs of projection matrixes are learnt_2,1Norm carries out feature selecting.Specifically, its class center is asked to every a kind of image and text first, then generates new data point at random around it, form new training data；Then, two pairs of projection matrixes are learnt using new data, at the same time, using l_2,1Norm constraint, carry out feature selecting；Finally retrieval result is assessed with mAP values.This method not only generates some random number strong points to improve the diversity of training data, while some can be selected more to have the feature of distinction and abundant information when two pairs of projection matrixes are learnt.Experimental result on three different pieces of information collection also indicates that the superiority of the method.

Description

A kind of semi-supervised cross-media retrieval method of feature based selection and virtual data generation

Technical field

The present invention relates to cross-media retrieval method, is given birth to more specifically to a kind of selection of feature based and virtual data Into semi-supervised cross-media retrieval method.

Background technology

With the development of multimedia technology, increasing data can be expressed as different mode, and different modalities Data may have same semantic information.Therefore, how to explore these has identical semantic but shows as different modalities number Relation between becomes particularly important.Wherein, it is nearest for many years in, cross-media retrieval technology increasingly causes researcher's Concern.Cross-media retrieval just refers to retrieve other with identical semantic information by the use of a kind of data of mode as inquiry data The data of mode.By taking the retrieval of picture and text as an example, picture can be used to go text of the retrieval with corresponding semantic information, letter Claim：I2T；Or picture of the retrieval with corresponding semantic information is removed using text, referred to as：T2I.The present invention is between picture and text Retrieval exemplified by analyzed and tested, but the retrieval that the method can be expanded between other different modalities.

In cross-media retrieval technology, most important problem is：The data of different modalities have different character representations, this Different a bit to be characterized in different dimensional spaces, the similitude between such isomeric data cannot directly compare.Cause This, the problem of cross-media retrieval field is primarily upon, is how to cross over this semantic gap.A kind of popular solution method is just It is sub-space learning method.Sub-space learning method is intended to learn a potential semantic space, in this potential semantic space The similitude of isomeric data can be directly measured.Traditional sub-space learning method is a pair of projection matrixes of study, passes through this The data of different modalities are mapped in a potential semantic space by projection matrix can, such isomeric data it is similar Property can is measured.A kind of popular method is：Canonical correlation analysis (Canonical Correlation Analysis, CCA), CCA has learnt a pair of projection matrixes, is maximized when by the Feature Mapping of different modalities to semantic space Correlation between isomeric data.Based on CCA, semantic relevant matches (Semantic Correlation Match, SCM) use Logistic regression obtains semantic space.Another popular method is：Offset minimum binary (Partial Least Squares, PLS), PLS is intended to learn two potential semantic spaces by maximizing the correlation between isomeric data.In addition, Normalize multi-view analysis (generalized multi-view analysis, GMA) and GMLDA based on GMA and GMMFA obtains the feature of multi-angle by using label information, and achieves more preferable effect.

However, there is directionality, i.e. image retrieval text (I2T) and text retrieval image in common cross-media retrieval task (T2I), above method only learns a pair of projection matrixes, does not emphasize to inquire about the importance of data.Specifically, in I2T In task, picture is more decisive for study projection matrix；Similarly, in T2I tasks, then the importance of text is more emphasized. Therefore, the method for learning a pair of projection matrixes is extremely difficult to optimal effect.In order to emphasize that the weight of data is inquired about in different task The property wanted, cross-media retrieval (Modality-dependent Cross-media Retrieval, MDCR) method based on mode Two pairs of projection matrixes of study are proposed, i.e., a pair of projection matrixes is learnt respectively to I2T and T2I tasks, can thus taken into full account The importance of data is inquired about, therefore the precision retrieved also is greatly improved.

But above method all simply has the method for supervision, is only trained and have ignored using only markd data Unlabelled data, while cannot more expand intrinsic data set.Secondly, current method only from how to measure isomeric data it Between the angle of similitude set out, it is intended to learn more effective projection matrix, it is more accurate always to be obtained in semantic space Comparison, still, they all have ignored when learning projection matrix it is more rich to information content, more distinction feature choosing Select.Therefore, we have invented a kind of semi-supervised method that can generate virtual data point at random based on MDCR, while use l_2,1 Norm carries out feature selecting.

The content of the invention

The invention provides the semi-supervised cross-media retrieval technology of a kind of selection of feature based and pseudo-random data generation.Pass The cross-media retrieval method of system, otherwise it is only to be used only to have what flag data was trained to have measure of supervision, otherwise it is to select one Part Unlabeled data adds the semi-supervised method of training.The present invention propose, generated in markd data basis some with Correlation pseudorandom virtual data point, so not only it is contemplated that unlabelled data, it is related that some can also be increased Virtual data point is to improve the precision of training.Meanwhile unlike traditional search method：Our method is for different The different projection matrix of tasking learning, l is used when learning projection matrix_2,1Norm carries out feature selecting.On the whole, we Method simultaneously consider the diversity of training data and the selection of validity feature.

The concrete technical scheme of the present invention is as follows：

A kind of semi-supervised cross-media retrieval technology of feature based selection and virtual data generation, comprises the following steps：

Step 1：Data-oriented collectionN represents the sum of data pair, x_iRepresentative picture feature, t_iRepresent text Eigen, then, picture and text feature matrix can be expressed as：X_G=[x₁,x₂,...,x_n-1,x_n] and T_G=[t₁, t₂,t₃,...,t_n-1,t_n]；

Step 2：Pseudorandom virtual data point is generated, legacy data collection is expanded, specific method is：Calculate X_GAnd T_G Per a kind of class center, i.e., for every a kind of data, calculate average of such data per dimension, obtained each dimension average structure Into it is new vector as such class center；Then, using the average of every dimension as center, at it, generation n' is individual at random up and down Numerical value, the sets of random values in all dimensions is combined n' pseudorandom virtual data of generation, by the puppet Random virtual data point adds the data set G after legacy data collection is expanded_all={ G, G'}, picture and text after expansion Eigenmatrix is expressed as：X=[x₁,...x_n,x₁',x'₂...x'_n] and T=[t₁,...,t_n,t₁',t'₂,...,t'_n]；

Step 3：Build object function：

Objective function：

Wherein, U, V represent this method a pair of projection matrixes to be learnt, and C (U, V) is correlation analysis item so that multimode The data of state can be kept into potential semantic space to neighbor relationships；L (U, V) is from image or text modality feature Space to semantic space linear regression item, for keep with identical semantic different modalities data neighbor relationships；N(U, V it is) regular terms, the selection for feature；

According to formula (1), image inspection text I2T and the object function of text inspection image T2I retrieval tasks are respectively obtained, such as Under：

(1) I2T object function is：

Wherein, U₁,V₁It is the projection matrix that learn to obtain in I2T tasks, the U corresponded respectively in formula (1), V, β are Coefficient of balance and 0≤β≤1, Y are semantic matrixes；

(2) T2I object function is：

Wherein, U₂,V₂It is the projection matrix that learn to obtain in T2I tasks, the U, V corresponded respectively in formula (1)；

Step 4：By iterative method, optimal projection matrix is obtained：

Because formula (2) and (3) are non-convex, therefore use the method for control variable to solve, i.e., local derviation asked to U and V respectively, And make it be equal to zero, projection matrix U and V value can be obtained；Then pass through continuous iteration, until convergence, try to achieve projection matrix U and V optimal value.

Especially, in step 3, N (U, V)=λ₁||U||_2,1+λ₂||V||_2,1, wherein λ₁, λ₂For balancing two canonicals , and all be positive number, this bound term is used to select when projection matrix learn more with distinction and abundant information Feature.

Brief description of the drawings

Fig. 1 is the inventive method flow chart.

Embodiment

1. data set is handled：

Wikipedia, altogether comprising 10 classes, 2866 picture-texts pair.We select 2173 picture-texts pair As initial training data, remainder is test data.Its picture feature be 4096 dimension CNN features, text feature 100 Tie up LDA features.

Pascal Sentence, 20 classes altogether, per 50 picture-texts pair of class.We select 30 figures in every class Picture-text is to as initial training data, remaining is test data.Its picture feature be 4096 dimension CNN features, text feature For 100 dimension LDA features.

INRIA-Websearch, altogether 353 class, 71478 image-texts pair.We randomly choose therein 70% and made For initial training data, remaining is test data.Its picture feature is the CNN features of 4096 dimensions, and text feature is 1000 dimension LDA Feature.

2. specific implementation step of the present invention：

Step 1：Data-oriented collectionN represents the sum of data pair, x_iRepresentative picture feature, t_iRepresent text Eigen, then, picture and text feature matrix can be expressed as：X_G=[x₁,x₂,...,x_n-1,x_n] and T_G=[t₁, t₂,t₃,...,t_n-1,t_n]。

Step 2：Pseudorandom virtual data point is generated, legacy data collection is expanded, specific method is：Calculate X_GAnd T_G Per a kind of class center, i.e., for every a kind of data, calculate average of such data per dimension, obtained each dimension average structure Into it is new vector as such class center；Then, using the average of every dimension as center, at it, generation n' is individual at random up and down Numerical value, the sets of random values in all dimensions is combined n' pseudorandom virtual data of generationBy described in Pseudorandom virtual data point adds the data set G after legacy data collection is expanded_all={ G, G'}, picture and text after expansion Eigen matrix is expressed as：X=[x₁,...x_n,x₁',x'₂...x'_n] and T=[t₁,...,t_n,t′₁,t′₂,...,t ′_n]。

Step 3：Build object function：

Objective function：

(1) I2T object function is：

Wherein, U₁,V₁It is the projection matrix that learn to obtain in I2T tasks, β is coefficient of balance and 0≤β≤1, Y are semantemes Matrix, and：N(U₁,V₁)=λ₁||U₁||_2,1+λ₂||V₁||_2,1, wherein λ₁, λ₂For balancing two regular terms, and all it is positive number；

(2) T2I object function is：

Wherein, U₂,V₂It is the projection matrix N (U that learn to obtain in T2I tasks₂,V₂)=λ₁||U₂||_2,1+λ₂||V₂| |_2,1；

Step 4：By iterative method, optimal projection matrix is obtained：

Especially, for l_2,1Norm can using mark come derivation, such as：Matrix U is defined, then：||U||_2,1=Tr (U^TRU), R is a diagonal matrix,uⁱU every a line is represented, ε is a minimum real number.

3. evaluation criteria (mAP)

We use mean accuracy average (mean average precision, mAP) evaluation criteria, last to evaluate Retrieval effectiveness.We define the average precision inquired about each time first：

Wherein, N represents the sum of sample in test data, has and corresponding category when in the sort result of ith retrieval Rel (i)=1 when signing identical, otherwise rel (i)=0.P (i) represents the precision of ith retrieval ordering result.So, all inquiries AP values be averagely exactly mAP.

4. algorithm is realized

(1)I2T:

Input：Picture feature matrix X_GWith text feature matrix T_G, sample labeling matrix Y, parameter lambda₁, λ₂, β

Generate virtual data：For every a kind of data, the average per dimension is calculated first, as in this kind of class The heart,

Then, using the average of every dimension as center, in its n' numerical value of generation at random up and down, in all dimensions with Machine value, which is combined, can be formed by the virtual data of n', finally, the virtual data of generation add input picture and Text feature matrix, obtain new training picture feature matrix X and text feature matrix T.

Initialization：Initial projection matrix U₁, V₁For unit matrix.

Solve optimal solution：According to the U tried to achieve₁=(XX^T+λ₁R₁₁)^-1[βXY+(1-β)XT^TV₁] and

V₁=[(1- β) TT^T+λ₂R₁₂]^-1(1-β)TX^TU₁, by constantly iteration, until result restrain to obtain it is optimal U₁, V₁。

This process false code is as follows：

(2)T2I：

It is similar with I2T tasks, finally obtain optimal projection matrix U₂, V₂

5. results contrast

We are tested on three data sets respectively, and compared for other current popular 7 kinds of methods (PLS, CCA, SM, SCM, GMMFA, GMLDA, MDCR), following table shows that the inventive method all shows preferably to examine on different pieces of information collection Suo Xiaoguo.

Claims

1. a kind of feature based selection and the semi-supervised cross-media retrieval technology of virtual data generation, comprise the following steps：

Step 1：Data-oriented collectionN represents the sum of data pair, x_iRepresentative picture feature, t_iRepresent text spy Sign, then, picture and text feature matrix can be expressed as：X_G=[x₁,x₂,...,x_n-1,x_n] and T_G=[t₁,t₂, t₃,...,t_n-1,t_n]；

Step 2：Pseudorandom virtual data point is generated, legacy data collection is expanded, specific method is：Calculate X_GAnd T_GIt is each The class center of class, i.e., for every a kind of data, average of such data per dimension is calculated, what obtained each dimension average was formed New vector is as such class center；Then, using the average of every dimension as center, n' number is generated at random up and down at it Value, the sets of random values in all dimensions is combined n' pseudorandom virtual data of generation By the puppet Random virtual data point adds the data set G after legacy data collection is expanded_all={ G, G'}, picture and text after expansion Eigenmatrix is expressed as：X=[x₁,...x_n,x₁',x'₂...x'_n] and T=[t₁,...,t_n,t₁',t'₂,...,t'_n]；

Step 3：Build object function：

Objective function：

Wherein, U, V represent this method a pair of projection matrixes to be learnt, and C (U, V) is correlation analysis item so that multi-modal Data can be kept into potential semantic space to neighbor relationships；L (U, V) is from image or text modality feature space To the linear regression item of semantic space, for keeping the neighbor relationships with identical semantic different modalities data；N (U, V) is Regular terms, the selection for feature；

According to formula (1), image inspection text I2T and the object function of text inspection image T2I retrieval tasks are respectively obtained, it is as follows：

(1) I2T object function is：

Wherein, U₁,V₁It is the projection matrix that learn to obtain in I2T tasks, the U corresponded respectively in formula (1), V, β are balances Coefficient and 0≤β≤1, Y are semantic matrixes；

(2) T2I object function is：

Step 4：By iterative method, optimal projection matrix is obtained：

Because formula (2) and (3) are non-convex, therefore use the method for control variable to solve, i.e., local derviation is asked to U and V respectively, and make It is equal to zero, can obtain projection matrix U and V value；Then continuous iteration is passed through, until convergence, tries to achieve projection matrix U and V Optimal value.

2. feature based selection as claimed in claim 1 and the semi-supervised cross-media retrieval technology of virtual data generation, it is special Sign is：In step 3, N (U, V)=λ₁||U||_2,1+λ₂||V||_2,1, wherein λ₁, λ₂For balancing two regular terms, and all it is Positive number, this bound term are used for the feature that selection when projection matrix is learnt more has distinction and abundant information.