CN104537124A

CN104537124A - Multi-view metric learning method

Info

Publication number: CN104537124A
Application number: CN201510042581.XA
Authority: CN
Inventors: 张驰; 付彦伟
Original assignee: SUZHOU DEWO INTELLIGENT SYSTEM Co Ltd
Current assignee: SUZHOU DEWO INTELLIGENT SYSTEM Co Ltd
Priority date: 2015-01-28
Filing date: 2015-01-28
Publication date: 2015-04-22
Anticipated expiration: 2035-01-28
Also published as: CN104537124B

Abstract

The invention discloses a multi-view metric learning method for multi-view video abstraction. The multi-view metric learning method comprises the following steps: decomposing a video into a set of frames; learning a unitized metric space; performing clustering on the unitized metric space; selecting a specific frame for outputting as an abstract. The multi-view metric learning method is capable of finding out the metric for best separating data, and simultaneously forcing the learned metric to keep the original inherent information between data points.

Description

Multi views metric learning method

Technical field

The present invention relates to a kind of multi views metric learning method, specifically, relate to a kind of multi views metric learning method for multi-view video summary.

Background technology

Supervised study is by learning with Modling model for predicting the mark having no example a large amount of markd training example.Here " mark " refers to the output corresponding to example, and such as, in classification task, mark is exactly the classification of example, and mark is exactly the real-valued output corresponding to example in recurrence task.Along with the mankind collect, store the high development of data capability, can easily obtain large quantities of Unlabeled data in a lot of actual task, these data be given to the manpower and materials marking and then often need at substantial.Semi-supervised learning attempts to allow Learning machine automatically utilize a large amount of Unlabeled data has flag data to learn with auxiliary on a small quantity.And do not need manpower to carry out input label without supervised study.

In the past ten years, multi views study receives a large amount of concern.The existing method of great majority is all devoted to semi-supervised learning.

In many real world applications, the data of non-label are usually expressed as the view of a large amount of height correlation.Such as, in field of video processing, different cameras may focus on substantially same visual field.Again such as, in Quick Response Code scanning field, have multiple fixed Quick Response Code scanner at diverse location from the same Quick Response Code target of multiple angle shot, and for hand-held scanner rifle, also can produce the Quick Response Code target image of multiple different angles because of shake.In this case, people may expect utilize correlativity to help understand data and make its characterization, and expect to find the tolerance of " optimization " of the immanent structure that can reflect input data.

Assuming that the complicated human motion in this locality set coordinate is time dependent function, and temporally sampled by multiple camera simultaneously.In order to disclose the feature of this initial space, classic method generally extracts high-dimensional space of feature vectors to each view video with multiple supposition independently, then uses many dimension reduction methods.

Conventional video method of abstracting is designed to generate summary to single-view video record, therefore can not utilize the redundancy in multi views record completely, can neglect that different view video comprises concerning information distinctive and complementary raw data set.

Summary of the invention

The invention provides a kind of multi views metric learning method for multi-view video summary.Multi-view video gathers the different visual projections of the identical space-time in real-life simultaneously.Multi views metric learning in the present invention projects all multi-view videos in new metric space, with simulate real world diverse spatial best, and is used for disclosing the internal characteristics of object of which movement.This greatly simplifies video frequency abstract by preserving across maximum internal characteristics of different views.In learnt metric space, vision data is made a summary by cluster and extract key frame in each cluster.

Method of the present invention its combine the advantage of largest interval cluster and inconsistent minimum sandards, therefore, it is possible to find the tolerance of mask data best, and the original internal information forcing learnt tolerance to keep between data point simultaneously.

Method of the present invention is specially adapted to the situation of unsupervised learning.

The invention provides a kind of multi views metric learning method for multi-view video summary, it comprises the following steps:

(1) videograph is decomposed the set of framing, be expressed as X ⁽¹⁾..., X ^(K), wherein , be the d of n the frame representing a kth view _kdimensional characteristics, R represents real number, d _krepresent the dimension in a kth former space, n represents frame number;

(2) according to being in X ⁽¹⁾..., X ^(K)in information, the metric space X that study is unitized, wherein , d represents the dimension mapping rear space.

(3) perform the cluster on X, the center using cluster representatively, is expressed as , F represents the set of summary, i ₁i _crepresent the ID of a frame;

(4) to each f _icfrom K frame, select the frame of and its correspondence, and export these frames as last summary.

Wherein, in the step performing study, unified coordinates matrix is found , minimize it

，

Wherein, R _emp(X), R _struct(X), R _diff(X) be the empirical loss of X, structural penalties and inconsistent loss respectively; γ ₁, γ ₂it is the parameter balanced between control objectives; And

Empirical loss R _emp(X) be 0;

Structural penalties R _struct(X) be , wherein G _xthe similarity transformation of tolerance X, it is normalization Laplace operator;

Inconsistent loss R _diff(X) be

, wherein tr is mark.

According to a preferred embodiment of the present invention, minimized R (X) is

。

According to another preferred embodiment of the present invention, minimized R (X) is optimized for

, wherein

, and .

According to another preferred implementation of the present invention, minimized R (X) is optimized for further

, wherein

, μ represents weight factor, and μ k represents weight factor.

Accompanying drawing explanation

Fig. 1 shows the process flow diagram of the method according to some embodiments of the present invention.

Embodiment

Hereafter will set forth the specific embodiment of the present invention by reference to the accompanying drawings.Need to understand, these embodiments are illustrative, and not restrictive.

The object of largest interval cluster is the cluster finding out band largest interval.Method of the present invention optimizes the measurement of graph theory to find the kernel matrix allowed compared with large-spacing between cluster.

Suppose the low-level features of K different views, wherein it is coordinates matrix.The present invention seeks to find unified coordinates matrix , it minimizes

Wherein R _emp(X), R _struct(X), R _diff(X) be experience, structure and the inconsistent loss of X respectively; γ ₁, γ ₂it is the parameter balanced between control objectives.

The loss R of experience _emp(X) usually definition for tag information is passed through.Such as, in observed Multiple Kernel Learning, R _emp(X) minimum turnover loss (can obtain in the tolerance defined by X) is normally defined.The loss R of structure _struct(X) complicacy of sorter can be defined as, or be used for ensureing that similar " example " has similar " label ".Inconsistent loss R _diff(X) then X and X is referred to ^(k)different amplitudes.The inconsistent standard (DMC) that minimizes passes through R _diff(X) add.

In learning procedure, the present invention adopts unsupervised multi views metric learning.

First, assuming that G ⁽¹⁾..., G ^(K)∈ R ^nxnby metric space X ⁽¹⁾..., X ^(K)the similar matrix defined respectively, wherein G ^(k)(i, j)=G ^(k)(j, i) is data point x _iand x _jin the similarity of a kth view, n represents frame number.

With as normalization Laplace operator, wherein the normalization Laplace operator of similarity matrix G is defined as:

Wherein, , and , I is order matrix.

Good video frequency abstract should have the R better coordinated _diff, and be constant (such as rotate, translation and convergent-divergent) to the metric transformation of synchronization frame (X).For this reason, inconsistent loss is:

（3）

G _xthe similarity transformation of tolerance X.This equation is constant to some metric transformation, such as rotation, translation and convergent-divergent, and the different visual conditions better coordinated.In addition, more need to introduce without optimized variable.

To R _struct(X) definition is by the inspiration of following spectral image theory.

First, the diversity c of eigenvalue 0 equal the component that connects in image number.

Secondly,

For , have

，

Wherein ,

be arrive shortest path,

and

be eigenvalue.

These theories show a k smallest eigen determine by k-cluster in the tolerance of (it is the conversion of tolerance x) implicit difiinition.Therefore definition structure loss is as follows:

Finally, unsupervised study is arranged does not have label information, makes simply .

In conjunction with above-mentioned definition, formula is provided for unsupervised multi views metric learning:

（5）

In some alternative embodiments, because be the inconsistent measurement between metric space, can think that CCA is a good selection.But, the calculating of CCA relates to the optimization of transformation matrix, and this can introduce optimized variable, makes optimization become untraceable.

The simplification that CCA measures causes the following inconsistent measurement based on prediction:

Wherein with represent according to tolerance X and the prediction of the sorter of study.

When classification results can easily from learn tolerance derive time, this definition be recommend.But problem produces when facing cluster task, and wherein the inconsistent of different cluster causes dyscalculia.

In some embodiments, also peer-to-peer 5 is optimized.

If , and

。

Once find , the metric space with regard to implicit difiinition.In fact, given , coordinates matrix metric space, wherein be eigenvector, correspond to i-th smallest eigen, and k-means algorithm can be used for carrying out cluster according to this space.Therefore, in order to the object of cluster, it is enough to calculate itself.(to know, have and the same eigenvector, therefore causes the same cluster result).This optimization problem becomes now

Consider efficiency, also suppose ,

First fixing , P, can pass through eigen decomposition try to achieve; Then fixing P, μ by quadratic programming (quadratic programming) until concentrate try to achieve (as shown in the formula), this quadratic programming can pass through Mosek(software) effectively try to achieve, wherein m is little ([RU1] is Constant Grade algorithm complex) in practice forever.

（8）

According to certain embodiments of the present invention, for generating video frequency abstract, assuming that each event E in real world _ian all corresponding distribution D _i, wherein D _ithat zonule in " potential " semantic space is placed in the middle.Event E _ieach " example " be exactly a data point x _ij, it is according to the D in latent semantic space _isample.

The video of the different visual angles of process same place of the present invention, therefore the high-dimensional low-level features of each view is embedded in identical low latitude space.This has just been applied to the inconsistent standard that minimizes in study.

See Fig. 1, carry out multi-view video summary in the following manner:

(1) videograph is decomposed the set of framing, be expressed as , wherein , be the d of n the frame representing kth view _kdimensional characteristics,

(2) according to being in in information, the metric space that study is unitized .

(3) perform the cluster on X, the center using cluster representatively, is expressed as

(4) to each , from K frame, select a frame according to it, and export these frames as last summary.

Method according to the present invention compares experiment, and handling object is the office data collection previously gathered, and finds that method of the present invention is better than existing certain methods greatly.Experimental result sees the following form:

Method	Event number	Precision (%)	Recall rate (%)
				Evenly/random	10/5	70/60	26.9/11.5
ED/DM	10/13	80/76.9	30.8/38.5
				The present invention	20	100	76.9

Wherein " evenly " expression is evenly made a summary to video, and " at random " expression is made a summary at random to the frame of video; " ED " represents Euclidean distance (Euclidean distance) method, and " DM " represents disperse tolerance (Diffusion Metric) method.

More than by concrete form of implementation, the invention has been described, need be appreciated that above description is illustrative instead of restrictive.Such as, the feature in foregoing can be combined with each other, only otherwise exceed scope of the present invention.In addition, when not deviating from spirit of the present invention, can modify to adapt to concrete condition to form of implementation.Although each form of implementation of element-specific as herein described and procedure definition, they are restrictive anything but, and just exemplary role.By reading above description, other embodiments many it will be apparent to those of skill in the art.Therefore, the complete equivalent scope that scope of the present invention should contain together with this kind of claim with reference to claim is determined jointly.

Claims

1., for a multi views metric learning method for multi-view video summary, it comprises the following steps:

(1) videograph is decomposed the set of framing, be expressed as X ⁽¹⁾..., X ^(K), wherein , be the d of n the frame representing kth view _kdimensional characteristics, R represents real number, d _krepresent the dimension of a kth view spaces, n represents frame number;

(2) according to being in X ⁽¹⁾..., X ^(K)in information, the metric space X that study is unitized, wherein , d represents the dimension mapping rear space;

(4) to each f _icfrom K frame, select the frame of and its correspondence, and export these frames as last summary;

，

Empirical loss R _emp(X) be 0;

Structural penalties R _struct(X) be , wherein G _xthe similarity transformation of tolerance X, normalization Laplace operator, λ _ibe eigenvalue, c represents the number of predefined class;

Inconsistent loss R _diff(X) be

, wherein tr is mark.

2. multi views metric learning method according to claim 1, is characterized in that, minimized R (X) is

。

3. multi views metric learning method according to claim 2, is characterized in that, minimized R (X) is optimized for

, wherein

, and .

4. multi views metric learning method according to claim 3, is characterized in that, minimized R (X) is optimized for further

, wherein

, μ represents weight factor, and μ k represents weight factor.