CN102663447A

CN102663447A - Cross-media searching method based on discrimination correlation analysis

Info

Publication number: CN102663447A
Application number: CN2012101334886A
Authority: CN
Inventors: 谭铁牛; 王亮; 王威
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2012-04-28
Filing date: 2012-04-28
Publication date: 2012-09-12
Anticipated expiration: 2032-04-28
Also published as: CN102663447B

Abstract

The invention discloses a cross-media searching method based on discrimination correlation analysis. The method comprises the following steps of establishing a cross-media training database, carrying out feature extraction, mean-value pretreatment and linear projection transformation sequentially for different modal samples, and setting a target function according to a projection space; solving the target function to acquire a linear projection vector; establishing a cross-media test database; sequentially performing the feature extraction and mean-value pretreatment for a target to be searched; utilizing the linear projection vector to perform the linear projection transformation for the feature data after the mean-value pretreatment; and calculating an Euclidean distance between two modal data projection vectors, listing the Euclidean distance in ascending order, and acquiring a cross-media searching result. Due to the adoption of the method, dimensional reduction can be effectively performed for the feature data, so that the feature data can be widely applied to other multi-modal work, for example the multi-modal biological feature recognition.

Description

Stride the medium search method based on what differentiate correlation analysis

Technical field

The present invention relates to pattern-recognition and machine learning field, especially a kind ofly stride the medium search method based on what differentiate correlation analysis.

Background technology

In recent years, a large amount of multi-medium datas that occur present two tangible characteristics: high-dimensional property and polyphyly, for example same semantic concept can be represented by plurality of kinds of contents such as the literal on the network, picture, videos.In addition, the Internet user also mainly searches for needed information through text keyword, and this mainly is because search engine can't be understood the mutual relationship between the different modalities medium, thereby has limited the development of search engine.The characteristic dimensionality reduction has disclosed manifold structure and correlativity different modalities data between of high dimensional data in lower dimensional space, and in information retrieval, pattern classification, great function has been brought into play in fields such as information visualization.

The feature dimension reduction method of single mode data has a lot, and principal component analysis (Principal Component Analysis) projects to raw data on the principal direction with maximum variance; (Linear Discriminant Analysis LDA) is a kind of supervision dimension reduction method that has, and finds a projection subspace under the condition of classification information making full use of, and makes different classes of characteristic have optimum identification in linear discriminant analysis; Local linear embedding (Locally Linear Embedding) is a non-linear local reservation method the earliest, and the linear relationship of each data point and its arest neighbors data point is able to keep in projector space; LE (Laplacian Eigenmaps) has kept the distance of local two data points in projector space, LPP (Locality Preserving Projection) is its linear-apporximation algorithm; Multilayer own coding network (Multilayer Autoencoder Network) is the nonlinear stretch of principal component analysis method.Have research work to point out, though nonlinear method treatment of simulated data performance is fine, but not necessarily the principal component analysis method than traditional is good for real data, and more than these methods of mentioning all can not directly apply to the multi-modal medium retrieval of striding.

The feature dimension reduction method research of multi-modal data is not a lot; Canonical correlation analysis (Canonical Correlation Analysis; CCA) be wherein the most famous multivariate data analysis method; It to same subspace, makes multi-modal data difference linear projection multi-modal variable have maximum correlation; Relevant with typical linear different, PLS (Partial Least Square) makes multi-modal variable have maximum covariance in projector space; Under the inspiration of multilayer own coding network, multi-modal degree of depth learning network is suggested and is the common expression of different modalities data study.In a word; Above method more is to be that target removes to seek projector space with the correlativity that maximizes multi-modal variable; And ignored the identification that maximizes different classes of data in the multi-modal data, and identification is often extremely important in multi-modal data retrieval and classification task.

Summary of the invention

Existing multi-modal data analysing method is not generally considered the identification of data; The invention provides a kind of based on differentiating correlation analysis (Discriminant Correlation Analysis; DCA) method; It has merged the thought of canonical correlation analysis and linear discriminant analysis, optimizes the identification of multiple modalities correlation of data and different classes of data simultaneously.

Proposed by the invention a kind ofly stride the medium search method, it is characterized in that this method may further comprise the steps based on what differentiate correlation analysis:

Step 1 is set up and to be comprised right the striding the medium tranining database and extract the proper vector of different modalities sample in this database of image and text one to one, obtains corresponding characteristic point set;

Step 2, the characteristic point set to image and two mode of text carries out the average pre-service respectively, makes that the average of characteristic point set of each mode is 0;

Step 3 will be passed through the pretreated characteristic point set of average and carried out the linear projection conversion, and set an objective function about the linear projection variable according to the projector space that obtains;

Step 4, use characteristic value solving method is found the solution said objective function, obtains linear projection vector a and b;

Step 5, set up comprise image and text one to one right stride the medium test database;

Step 6 is imported object to be retrieved, and extracts the proper vector of object to be retrieved respectively and stride in the medium test database characteristic point set that belongs to the object set of different modalities with object to be retrieved;

Step 7, proper vector and characteristic point set that step 6 is obtained carry out said average pre-service respectively;

Step 8, the linear projection vector a and the b that use said step 4 to obtain carry out the linear projection conversion respectively to process pretreated proper vector of average and characteristic point set;

Step 9; Calculate the Euclidean distance between the projection variable of projection variable and object set of object to be retrieved; And all Euclidean distances are carried out ascending sort, preceding n the corresponding object data of Euclidean distance promptly is the object of striding another mode relevant with image to be retrieved that retrieval obtains in the medium test database said.

The inventive method can be carried out dimensionality reduction effectively to characteristic, thereby is widely used in other a lot of multi-modal work, discerns such as multi-modal biological characteristic.Experiment showed, the inventive method in striding medium retrievals than canonical correlation analysis, and the simple combination performance of canonical correlation analysis and linear discriminant analysis all will be got well.

Description of drawings

Fig. 1 is the realization flow figure of the inventive method;

Fig. 2 be the inventive method on a simulated data collection with the comparing result of other correlation techniques.

Embodiment

For making the object of the invention, technical scheme and advantage clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, to further explain of the present invention.

Fig. 1 is the realization flow figure of the inventive method; As shown in Figure 1, proposed by the invention a kind ofly comprise training process (Fig. 1 (a)) and test process (Fig. 1 (b) and (c)), particularly based on the medium search method of striding of differentiating correlation analysis; Fig. 1 (a) is for utilizing among the present invention image text in the tranining database to study projection vector a; The process flow diagram of b, shown in Fig. 1 (a), training process of the present invention may further comprise the steps:

Step 1 is set up and to be comprised right the striding the medium tranining database and extract the proper vector of different modalities sample in this database of image and text one to one, obtains corresponding characteristic point set.

The present invention at first sets up image and text is striden the medium tranining database one to one; Use yardstick invariant features conversion (Scale-Invariant Feature Transform then respectively; SIFT) algorithm and latent Di Lei Cray distribute, and (Latent Dirichlet Allocation, LDA) algorithm carries out feature extraction to image and text.

Step 2, the characteristic point set to image and two mode of text carries out the average pre-service respectively, makes that the average of characteristic point set of each mode is 0:

x←x-E(x) (1)

y←y-E(y)

Wherein, x and y are two given mode characteristic point sets, and such as image and text characteristic of correspondence data acquisition, its corresponding respectively data point set is { x ₁... x _nAnd { y ₁... y _n, the data in each data point set belong to k common classification respectively

E (x), E (y) is the average of original set of data points.

Step 3; To pass through pretreated image of average and text feature data point set carries out the linear projection conversion and obtains projector space; Set an objective function according to said projector space, this objective function is the objective function about the linear projection variable that is used to carry out the linear projection conversion.

Given projection vector a and b, variable set x and y that two mode characteristics of image and text point set is corresponding carry out the linear projection conversion, obtain respective projection variable u and v:

u＝a ^Tx (2)

v＝b ^Ty

The step of the said projector space target setting function that conversion obtains according to linear projection further may further comprise the steps:

Step 3.1, the covariance cov of projection variable u and v in the calculating projector space (u, v):

cov (u, v) = a^{T} E ({xy}^{T}) b

= \frac{1}{2} a^{T} E ({xy}^{T}) b + \frac{1}{2} b^{T} E ({yx}^{T}) a

= [\begin{matrix} a^{T} & b^{T} \end{matrix}] [\begin{matrix} 0 & \frac{1}{2} E ({xy}^{T}) \\ \frac{1}{2} E ({yx}^{T}) & 0 \end{matrix}] [\begin{matrix} a \\ b \end{matrix}] - - - (3)

= [\begin{matrix} a^{T} & b^{T} \end{matrix}] Σ [\begin{matrix} a \\ b \end{matrix}]

Wherein, ∑ defines the eigenmatrix of covariance for this reason.

Step 3.2, computed image and the inter-class variance of two mode characteristics of text point set in projector space and a type internal variance σ _BAnd σ _W:

σ_{B} = Σ_{m = 1}^{k} \frac{n_{m}}{n} ω_{m} ω_{m}^{T} - - - (4)

σ_{w} = \frac{1}{2 n} Σ_{m = 1}^{k} \underset{i &Element; C_{m}}{Σ} ((u_{i} - ω_{m}) {(u_{i} - ω_{m})}^{T} + (v_{i} - ω_{m}) {(v_{i} - ω_{m})}^{T}) - - - (5)

Wherein, n representes the number of each data point intensive data, n _mThe number of representing the data of m class in each data point set, k are the number of classification, ω _mThe average of representing the concentrated m class data of two data points:

ω_{m} = \frac{1}{2} (\frac{1}{n_{m}} \underset{i &Element; C_{m}}{Σ} u_{i} + \frac{1}{n_{m}} \underset{i &Element; C_{m}}{Σ} v_{i}) - - - (6)

Be brought into formula (4) and (5), then σ to projection formula (2) _BAnd σ _WCan be rewritten as:

σ_{B} = [\begin{matrix} a^{T} & b^{T} \end{matrix}] S_{B} [\begin{matrix} a \\ b \end{matrix}] - - - (7)

σ_{W} = [\begin{matrix} a^{T} & b^{T} \end{matrix}] S_{W} [\begin{matrix} a \\ b \end{matrix}] - - - (8)

Wherein, S _BAnd S _WBe called " the hash matrix between type " and " hash matrix in type " of multi-modal data, be respectively:

S_{B} = \frac{1}{2 n} Σ_{m = 1}^{k} n_{m} [\begin{matrix} E_{m} {xx}^{T} & E_{m} {xy}^{T} \\ E_{m} {yx}^{T} & E_{m} {yy}^{T} \end{matrix}] - - - (9)

S_{W} = \frac{1}{2 n} Σ_{m = 1}^{k} n_{m} [\begin{matrix} E_{m} ({xx}^{T}) - \frac{1}{2} E_{m} {xx}^{T} & - \frac{1}{2} E_{m} {xy}^{T} \\ - \frac{1}{2} E_{m} {yx}^{T} & E_{m} ({yy}^{T}) - \frac{1}{2} E_{m} {yy}^{T} \end{matrix}] - - - (10)

Wherein, E _m(x) and E _m(y) be the average that raw data points is concentrated m class data respectively, C _mRepresent m class data set:

\begin{matrix} E_{m} ({xx}^{T}) = \frac{1}{n_{m}} \underset{i &Element; C_{m}}{Σ} (x_{i} x_{i}^{T}) & E_{m} ({yy}^{T}) = \frac{1}{n_{m}} \underset{i &Element; C_{m}}{Σ} (y_{i} y_{i}^{T}) \end{matrix}

\begin{matrix} E_{m} {xx}^{T} = E_{m} (x) E_{m}^{T} (x) & E_{m} {xy}^{T} = E_{m} (x) E_{m}^{T} (y) \end{matrix}- - - - (11)

\begin{matrix} E_{m} {yy}^{T} = E_{m} (y) E_{m}^{T} (y) & E_{m} {yx}^{T} = E_{m} (y) E_{m}^{T} \end{matrix} (x)

Step 3.3, according to the covariance cov that calculates (u, v), inter-class variance σ _BWith class internal variance σ _WThe target setting function.

The objective function that the present invention differentiates correlation analysis is defined as:

a^{*}, b^{*} = \arg \max_{a, b} \frac{{μσ}_{B} + (1 - μ) cov (u, v)}{σ_{W}} - - - (12)

Wherein, σ _BAnd σ _WBe respectively " inter-class variance " and " type internal variance " of two data points collection in projector space, (u v) is the covariance of variable u and v in the projector space to cov, and μ is the adjusting parameter, and it is controlling σ _BAnd cov (u, relative weighting v).

Step 4, use characteristic value solving method is found the solution said objective function, is finally learnt the linear projection vector a and the b that obtain.

In order to find the solution said objective function, need convert said objective function into a generalized eigenvalue problem:

At first define f=(a, b), then objective function (12) can be rewritten as:

f^{*} = \arg \max_{f} \frac{f^{T} (μ S_{B} + (1 - μ) Σ) f}{f^{T} S_{W} f} - - - (13)

Can see that objective function (13) is very similar with the objective function of linear discriminant analysis, adopt lagrange's method of multipliers promptly can convert a generalized eigenvalue problem to (13) into, be shown below:

(μS _B+(1-μ)∑)f＝λS _Wf (14)

Find the solution the eigenwert and the proper vector of (14); And arrange proper vector again according to the order that eigenwert is successively decreased; Get linear projection vector a and b that big eigenwert characteristic of correspondence vector obtains as final study; The linear projection vector a and the b that promptly utilize said study to obtain carry out the linear projection conversion respectively to multi-modal characteristic point set, can realize the dimensionality reduction to said multi-modal characteristic point set.

Step 5, set up comprise image and text one to one right stride the medium test database.

Fig. 1 (b) is for concentrating the process flow diagram of retrieving the text relevant with image at text data among the present invention; Fig. 1 (c) is for concentrating the process flow diagram of retrieving the image relevant with text in view data among the present invention; Shown in Fig. 1 (b) and Fig. 1 (c), test process of the present invention may further comprise the steps:

Step 6 is imported object to be retrieved, and extracts the proper vector of object to be retrieved respectively and stride in the medium test database characteristic point set that belongs to the object set of different modalities with object to be retrieved.

In this step; Similar with step 1; (Scale-Invariant Feature Transform, SIFT) algorithm and latent Di Lei Cray distribute, and (Latent Dirichlet Allocation, LDA) algorithm carries out feature extraction to image and text to use the conversion of yardstick invariant features respectively.

For instance, when needs were retrieved a series of text object relevant with certain image, object to be retrieved was an image, extracts the SIFT proper vector x of image respectively _iLDA characteristic point set with test database Chinese version data set

Wherein, N is the number of test database Chinese version data.

Step 7, similar with said step 2, proper vector and characteristic point set that step 6 is obtained carry out the average pre-service respectively.

Step 8, the linear projection vector a and the b that use said step 4 to obtain carry out the linear projection conversion respectively to process pretreated proper vector of average and characteristic point set, so that the pretreated characteristic of process average is carried out dimensionality reduction.

The linear projection vector a and the b that use said step 4 to obtain are with the SIFT proper vector x of image _iLDA characteristic set with test database Chinese version data set Carry out the linear projection conversion respectively, obtain respective projection variable u _iWith

u _i＝a ^Tx _i

(15)

{v_{i}}_{i = 1}^{N} = b^{T} {y_{i}}_{i = 1}^{N}

If object to be retrieved is an image; In this step; The Euclidean distance between the projection variable of each text data in the projection variable of computed image and the test database at first; And all Euclidean distances are carried out ascending sort, preceding n the corresponding text data of Euclidean distance promptly is the text object relevant with image to be retrieved that retrieval obtains.Here, result for retrieval quantity n can be set up on their own by the user as required.

What need to specify is, except the retrieval of cross-module attitude, the inventive method also may be used on other anyly need carry out dimension-reduction treatment to carry out the field of feature identification to multi-modal data, discerns such as multi-modal biological characteristic.

Prove that with the test result on simulated data collection and the True Data the inventive method is superior to the combination of canonical correlation analysis, linear discriminant analysis and canonical correlation analysis and linear discriminant analysis respectively below.

Simulated data collection instance is as shown in Figure 1, has generated two two-dimentional point sets among Fig. 1 (a), and asterism (the 1st type) is a point set with crunode (the 2nd type), and square frame (the 1st type) is the another one point set with rhombus (the 2nd type), and these two point sets belong to 2 types respectively; (b) provided the projection result of canonical correlation analysis (CCA) on simulated data; Though these two point sets are very relevant; But they but have a large amount of overlapping regions on low dimension projector space (here data projection to transverse axis), and the projecting direction that obtains of canonical correlation analysis does not have identification thus; (c) provided the projection result of linear discriminant analysis (LDA) on simulated data, though two types after the projection have good identification, the correlativity of two point sets after the projection is very poor; (d) provided the result of linear discriminant analysis (LDA) and a kind of combination of canonical correlation analysis (CCA), promptly earlier each point set has been done linear discriminant analysis, and then do canonical correlation analysis, the result who obtains is with directly to do canonical correlation analysis (b) closely similar; (e) provided the result of linear discriminant analysis (LDA) with other a kind of combination of canonical correlation analysis (CCA); Promptly earlier two point sets are done canonical correlation analysis; And then doing linear discriminant analysis, its result seems more similar with the result (g) of the inventive method (DCA), yet carries out obtaining after the log-transformation (f) and (h) to two results (e) and transverse axis coordinate (g); The result that can see canonical correlation analysis and linear discriminant analysis combination is linear inseparable on horizontal axis; Like P data point in (f) and Q data point, and the result of the inventive method is a linear separability, has explained that the inventive method has more identification.

On an image text data set, tested the performance of differentiating correlation analysis below the True Data collection instance; It is right that this data set comprises 2866 image texts; It is right that wherein training set has 2173 image texts; Test set has 693 image texts right, and each image text belongs to a certain type in following 10 types: art, biology, geography, history, literature, medium, music, royal family, physical culture, military affairs to a class label is arranged.Wherein, image adopts the SIFT characteristic of 128 dimensions, and text adopts the LDA text semantic characteristic of 10 dimensions.Project to the characteristic of these two types of data with two kinds of combinations differentiating correlation analysis, canonical correlation analysis and canonical correlation analysis and linear discriminant analysis the lower dimensional space of 9 dimensions then respectively; In this 9 dimension space, carry out cross-module attitude retrieval tasks; Promptly concentrate the retrieval image relevant, perhaps concentrate the retrieval text relevant with certain image at text data with certain text in view data.The result of cross-module attitude retrieval measures with mean accuracy (MAP, mean average precision), and mean accuracy is the bigger the better, and the mean accuracy here is meant the mean value of each query and search precision.Table 1 has provided the classification results of four kinds of algorithms, can see, differentiates correlation analysis and is superior to additive method.

Table 1

Method	Image is as test data	Text is as test data
			DCA	0.2108	0.2482
CCA	0.2032	0.2032
			CCA+LDA	0.2020	0.2011
LDA+CCA	0.2031	0.2034

Above-described specific embodiment; The object of the invention, technical scheme and beneficial effect have been carried out further explain, and institute it should be understood that the above is merely specific embodiment of the present invention; Be not limited to the present invention; All within spirit of the present invention and principle, any modification of being made, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. stride the medium search method based on what differentiate correlation analysis for one kind, it is characterized in that this method may further comprise the steps:

2. method according to claim 1 is characterized in that, in the said step 1 and 6, uses yardstick invariant features mapping algorithm and latent Di Lei Cray Distribution Algorithm that image and text are carried out feature extraction respectively.

3. method according to claim 1 is characterized in that, the linear projection map table in the said step 3 is shown:

u＝a ^Tx

，

V=b ^TY wherein, x and y are respectively the set of the corresponding variable of two mode characteristics of image and text point set, a and b are respectively corresponding projection vector, u and v pass through the projection variable that the linear projection conversion obtains.

4. method according to claim 3 is characterized in that, further may further comprise the steps according to the step of the projector space target setting function that obtains:

Step 3.1, and the covariance cov of projection variable u and v in the calculating projector space (u, v);

Step 3.2, computed image and the inter-class variance of two mode characteristics of text point set in projector space and a type internal variance σ _BAnd σ _W

5. method according to claim 4 is characterized in that, in the said step 3.1, the covariance cov of projection variable u and v (u v) is expressed as:

cov (u, v) = [\begin{matrix} a^{T} & b^{T} \end{matrix}] Σ [\begin{matrix} a \\ b \end{matrix}],

Wherein, ∑ defines the eigenmatrix of covariance for this reason.

6. method according to claim 4 is characterized in that, in the said step 3.2, and a said inter-class variance and a type internal variance σ _BAnd σ _WBe expressed as:

σ_{B} = [\begin{matrix} a^{T} & b^{T} \end{matrix}] S_{B} [\begin{matrix} a \\ b \end{matrix}],

σ_{W} = [\begin{matrix} a^{T} & b^{T} \end{matrix}] S_{W} [\begin{matrix} a \\ b \end{matrix}],

Wherein, S _BAnd S _W" the hash matrix between type " and " hash matrix in type " that is called multi-modal data:

S_{B} = \frac{1}{2 n} Σ_{m = 1}^{k} n_{m} [\begin{matrix} E_{m} {xx}^{T} & E_{m} {xy}^{T} \\ E_{m} {yx}^{T} & E_{m} {yy}^{T} \end{matrix}],

S_{W} = \frac{1}{2 n} Σ_{m = 1}^{k} n_{m} [\begin{matrix} E_{m} ({xx}^{T}) - \frac{1}{2} E_{m} {xx}^{T} & - \frac{1}{2} E_{m} {xy}^{T} \\ - \frac{1}{2} E_{m} {yx}^{T} & E_{m} ({yy}^{T}) - \frac{1}{2} E_{m} {yy}^{T} \end{matrix}],

Wherein, n representes the number of each data point intensive data, n _mThe number of representing the data of m class in each data point set, k are the number of classification,

\begin{matrix} E_{m} ({Xx}^{T}) = \frac{1}{n_{m}} \underset{i &Element; C_{m}}{Σ} (x_{i} x_{i}^{T}) \end{matrix},

\begin{matrix} E_{m} ({Yy}^{T}) = \frac{1}{n_{m}} \underset{i &Element; C_{m}}{Σ} (y_{i} y_{i}^{T}) \end{matrix},

E_{m} {xx}^{T} = E_{m} (x) E_{m}^{T} (x), E_{m} {xy}^{T} = E_{m} (x) E_{m}^{T} (y), E_{m} {yy}^{T} = E_{m} (y) E_{m}^{T} (y), E_{m} {yx}^{T} = E_{m} (y) E_{m}^{T} (x),

C _mRepresent m class data set, E _m(x) and E _m(y) be the average that raw data points is concentrated m class data respectively.

7. method according to claim 4 is characterized in that, said objective function is defined as:

a^{*}, b^{*} = \arg \max_{a, b} \frac{{μσ}_{B} + (1 - μ) cov (u, v)}{σ_{W}},

Wherein, μ is for regulating parameter, and it is controlling σ _BAnd cov (u, relative weighting v).

8. method according to claim 1 is characterized in that, in the said step 4, the step that use characteristic value solving method is found the solution said objective function further may further comprise the steps:

At first, (a b), rewrites said objective function to definition f=;

Then, the objective function after adopting lagrange's method of multipliers to rewrite convert into one can try to achieve generalized eigenvalue equality;

At last, find the solution the eigenwert and the proper vector of this equality, and arrange proper vector again, get linear projection vector a and b that big eigenwert characteristic of correspondence vector obtains as final study according to the order that eigenwert is successively decreased.

9. method according to claim 1 is characterized in that, the object to be retrieved in the said step 6 is image or text.

10. method according to claim 1 is characterized in that, result for retrieval quantity n is set up on their own by the user as required in the said step 9.