CN105095863A

CN105095863A - Similarity-weight-semi-supervised-dictionary-learning-based human behavior identification method

Info

Publication number: CN105095863A
Application number: CN201510414039.2A
Authority: CN
Inventors: 张向荣; 焦李成; 孙志豪; 马文萍; 侯彪; 白静; 马晶晶; 冯婕
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2015-07-14
Filing date: 2015-07-14
Publication date: 2015-11-25
Anticipated expiration: 2035-07-14
Also published as: CN105095863B

Abstract

The invention discloses a similarity-weight-semi-supervised-dictionary-learning-based human behavior identification method. With the method, a problem of low human behavior identification rate of the existing supervision method in the prior art can be solved. The identification method comprises: (1), an inputted data set is divided into test samples and training samples; (2), local feature detection is carried out on all samples and local features with the labeled samples are selected randomly to obtain an initialized dictionary; (3), according to the initialized dictionary, dictionary learning is carried out by using a semi-supervised method; (4), group sparse coding is carried out on all samples by using the learned dictionary to obtain a coding matrix of each sample; (5), vectorization is carried out on the coding matrix of each sample to obtain a final expression; and (6), testing sample classification is carried out by using the final expression of each sample and a sparse representation classification method to complete human behavior identification in the testing samples. Therefore, discrimination of dictionary learning is enhanced; the human behavior identification rate is improved; and the method can be used for target detection in a video.

Description

Based on the Human bodys' response method of the semi-supervised dictionary learning of similarity weights

Technical field

The invention belongs to mode identification technology, the recognition methods of particularly target person behavior in video, can be used for target detection in video.

Background technology

Human bodys' response refers to the behavioural information identifying target in video sequence, for work for the treatment of is afterwards prepared, it comprises detect relevant target visual information from video sequence, and express by a kind of suitable mode, finally explain that these information are to realize learning and identifying the behavior of people.

In recent years, without supervising and having supervision dictionary learning to be successfully applied to Images Classification and Activity recognition field.In Human bodys' response field, whether their difference uses the video sequence of label, does not wherein use the label information of video without supervision dictionary learning, and has supervision dictionary learning just contrary.The follow-up works such as identification are carried out eventually through the dictionary learnt.Following step is divided into for there being supervision dictionary learning:

The first step, obtains local feature: utilize local feature detecting device, and as Harris3D detects son, Hessian detects son, and Cuboid detects son etc., automatically detects interested region in video, and is described with corresponding descriptor;

Second step, obtain initialization dictionary: utilize K average that all video local feature descriptions symbol is carried out cluster, thus obtain several cluster centres, and these cluster centres are exactly so-called vision keyword, the number of cluster centre or be called word bag large I in advance by artificially setting.

3rd step, acquisition dictionary: solve objective function, generally comprise two steps repeated, namely solves code coefficient and dictionary learning hockets, until end condition reaches.

Can find out and have the dictionary learning of supervision to use the label information of video sequence relative to without supervision dictionary learning, and the different supervision dictionary learning methods that has just is how to use label information.But because the label of target obtains and needs the manpower and materials of at substantial in real life, the video in real life is often without label.And have supervision dictionary learning method also not consider without exemplar information.

2014, the people such as Y.Sun are organizing on sparse basis, introduce the group sparse constraint item of a weighting, the object of this bound term is the coding making of a sort dictionary atom participate in video as much as possible, thus propose a kind of have more identification have supervision dictionary learning method, the method takes full advantage of the information of exemplar, but do not use the information without label, specifically see SunY, LiuQ, TangJ, etal.Learningdiscriminativedictionaryforgroupsparserepre sentation. [J] .IEEETransactionsonImageProcessing, 2014, 23 (9): 3816-3828.

Although said method can obtain the dictionary having more identification, improve accuracy of identification, the deficiency of the method is also obvious: its consideration has marker samples, does not consider the information of unmarked sample, does not utilize the information of sample fully; And the difficulty in fact often having exemplar to obtain is very large, the sample without label but easily can obtain and exist in a large number, how fully to extract and to utilize in a large number without the information of exemplar, becoming the key point in this field.

Summary of the invention

The object of the invention is to a kind of Human bodys' response method proposing semi-supervised dictionary learning based on similarity weights, with the information by extracting without label video, improving Human bodys' response precision.

Technical thought of the present invention is: introduce without label video, and obtain the dictionary that has more identification thus obtain the coding of each video, apply it in Human bodys' response, implementation step comprises as follows:

(1) input comprises the sets of video data of c class behavior, and comprising training dataset and test data set, training dataset is by n _lthe individual video data with class label and n _uindividual without label video data composition, test data set is by n _tindividual tape test video data composition, each video only contains a kind of behavior as a sample;

(2) local feature of each video data is extracted: utilize the Harris angular-point detection method of Space-time domain to carry out local characteristic region detection to each video, the histogram of gradients characteristic sum light stream histogram feature of video is extracted at the local characteristic region extracted, and these the two kinds of features obtained are spliced, obtain the local feature of behavior in each video;

(3) concentrating from training sample, obtaining initialization dictionary D by carrying out stochastic sampling to the local feature of each class video sample ⁽⁰⁾∈ R ^{d × m}, wherein: d represents the dimension of sample local feature, m represents the number of dictionary atom;

3a) suppose that the local feature of training sample i-th class video sample is wherein: n _irepresent that the i-th class training sample has the number of exemplar, i=1,2 ..., c, c represent the classification number of video sample;

3b) to the local feature X of the i-th class video sample of training sample ⁱcarry out the initialization classification dictionary that stochastic sampling obtains the i-th class the all initialization classification dictionaries obtained are carried out splicing and obtains initialization dictionary wherein: d represents the dimension of local feature, b represents the atom number of every class initialization classification dictionary, and m is the atom number of initialization dictionary, i.e. m=c*b.

(4) the weight matrix A encoded is configured to ^(t)∈ R ^{m × n}, wherein: n represents number and the n=n of all training samples _l+ n _u, t=0,1 ..., Τ max, Τ max represents maximum iteration time, and the weight vectors of corresponding sample is shown in each list of weight matrix;

(5) the dictionary D of the t time iteration acquisition is used ^(t), encoded by the local feature of objective function to l video sample optimized below, obtain the encoder matrix of the t time iteration of l video sample

\min_{B_{l}^{(t)}} \frac{1}{2} | | Y^{l} - D^{(t)} B_{l}^{(t)} | |_{F}^{2} + λ_{1} | | B_{l}^{(t)} | |_{1, 1} + λ_{2} | | d i a g (A_{\cdot l}^{(t)}) B_{l}^{(t)} | |_{2, 1}

Wherein, Y ^lrepresent the local feature of l video sample, l=1,2 ...., n, weight matrix A ^(t)l row, || || _frepresent F norm, || || _1,11,1 norm of representing matrix namely presentation code matrix p capable, || || ₁represent 1 norm of vector, || || _2,12,1 norm of representing matrix, above formula Section 1 represents the reconstructed error item that video sample is encoded, to encoder matrix sparsity constraints item, be group sparse constraint item, the dictionary atom that this group sparse constraint item participates in coding in order to constraint comes from of a sort classification dictionary, λ ₁sparse constraint item parameter, λ ₂it is group sparse constraint item parameter;

(6) upgrade by the objective function optimized below the dictionary D that dictionary obtains the t+1 time iteration ^(t+1):

\min_{D^{(t + 1)}} Σ_{l = 1}^{n} \frac{1}{2} | | Y^{l} - D^{(t + 1)} B_{l}^{(t)} | |_{F}^{2} + λ_{3} \underset{i < j}{Σ} Σ_{j = 1}^{c} | | {(D_{i}^{(t + 1)})}^{T} D_{j}^{(t + 1)} | |_{F}^{2}

Wherein, the similarity constraint item to classification dictionary, in order to increase the identification between classification dictionary, () ^trepresent transpose operation, λ ₃it is the parameter of similarity constraint item;

(7) repeat step (4)-(6), until objective function converges or reach maximum iteration time, obtain final dictionary D;

(8) use final dictionary D, obtained the encoder matrix B of each video sample by the objective function optimizing following formula _g:

\min_{B_{g}} \frac{1}{2} | | Y^{g} - {DB}_{g} | |_{F}^{2} + γ | | B_{g} | |_{2, 1}, g = 1, 2, ..., h,

Wherein, || || _frepresent F norm, || || _2,1represent 2,1 norm, above formula Section 1 is the reconstructed error item of video sample coding, || B _g|| _2,1to encoder matrix B _ggroup sparse constraint item, h represents number and the h=n of all video samples _l+ n _u+ n _t, γ is the parameter of group sparse constraint item;

(9) to the local feature of all video samples, according to the encoder matrix B obtained in step (7) _g, apply maximum pond algorithm, each video sample be expressed as the coding vector z of a m dimension _g:

z_{g} = {[{\hat{z}}_{1}, {\hat{z}}_{2}, .. {\hat{z}}_{k} ., {\hat{z}}_{m}]}^{T}, k = 1, 2, .., m

Wherein,

{\hat{z}}_{k} = m a x (| B_{g | k 1} |, | B_{g | k 2} |, .., | B_{g | k q} |, ..., | B_{g | k K} |),

G=1,2 ..., h, q=1,2 ..., K, B _g|kqrepresent g video sample encoder matrix B _grow k q row, K represents the local feature number of this video;

(10) the coding vector composition rarefaction representation classifying dictionary of all training samples is utilized be that the coding vector of all training samples of i forms by class label, i is the class label i=1 of dictionary, 2 ..., c, c are classification sum, n _lthe sum having exemplar in training sample, namely represent that the i-th class has the number of exemplar;

(11) according to classifying dictionary to the coding vector of each test sample book that step (9) obtains carry out sparse coding, obtain the code coefficient β of test sample book on classifying dictionary by following formula:

\min_{β} {| | \hat{y} - \hat{D} β | |_{2}^{2} + η | | β | |_{1}},

Wherein, || || ₂represent 2 norms of vector, || || ₁represent 1 norm of vector, η is that η span is 0 ~ 1 for Equilibrium fitting error and openness parameter of encoding;

(12) code coefficient β is utilized to calculate each test sample book successively at every class classifying dictionary on residual error

r_{i} (\hat{y}) = | | \hat{y} - {\hat{D}}_{i} β_{i} | |_{2}^{2} / | | β_{i} | |_{2}, i = 1, ..., c

Wherein, β _ithat current test sample book is at the i-th category dictionary on code coefficient;

(13) according to residual error size to test sample book classify, find the class categories dictionary producing least residual by this dictionary label i as the label of current test sample book, complete the classification to all test sample books successively.

The present invention compared with prior art, has the following advantages:

1, the semi-supervised dictionary learning method of the present invention's use, relative to having supervision dictionary learning method and without supervision dictionary learning method, take into full account the information without exemplar of a large amount of existence, when there being exemplar little, more can embody it relative to having supervision dictionary learning method and the advantage without supervision dictionary learning method, the situation of more realistic application.

2, the present invention's weight vector of using k near neighbor method to obtain without exemplar, is introduced the local spatial information of feature, enhances the identification that final dictionary represents video sample by weight vectors.

Accompanying drawing explanation

Fig. 1 of the present inventionly realizes schematic diagram;

Fig. 2 is the sample frame image that Weizmann data centralization used during the present invention tests intercepts;

Fig. 3 is the sample frame image that KTH data centralization used during the present invention tests intercepts;

Fig. 4 is the classification confusion matrix figure of the present invention on Weizmann data set;

Fig. 5 is the classification confusion matrix figure of the present invention on KTH data set.

Embodiment

With reference to Fig. 1, the present invention mainly comprises three parts: dictionary learning, representation of video shot, visual classification.Introduce the implementation step of this three part below respectively:

One, dictionary learning

Step 1: the division all video samples being carried out to training sample and test sample book.

1a) input all video samples and their the true tag i of Human bodys' response data set, according to data set author suggestion method choose wherein n video sample as training sample, the remaining h-n of a data centralization video sample as test sample book, wherein, i ∈ { 1,2, ..., c}, i represent the class label of this video sample, c represents the class label sum of video sample, and h represents the number of all video samples;

1b) according to the true tag i of data centralization training sample, be choose w video sample the video sample of i as the known sample of true tag from true tag, namely have exemplar; Using the sample of video sample remaining in training sample as true tag the unknown, namely without exemplar; The number obtaining exemplar is w*c, and the number without exemplar is n-w*c.

Step 2: input the true tag i having exemplar in all training samples, test sample book and training sample, obtain the local feature of each video sample.

Only containing a kind of human body behavior in each video sample, the Harris angular-point detection method of Space-time domain is utilized to carry out local characteristic region detection to the behavior in video, the histogram of gradients characteristic sum light stream histogram feature of behavior in video is extracted at the local characteristic region extracted, and these the two kinds of features obtained are spliced, obtain the local feature set of a video sample:

X_{a}^{i} = [x_{1}, x_{2}, ..., x_{q}, ..., x_{b_{a}^{i}}] &Element; R^{d \times b_{a}^{i}},

Wherein, represent the local feature set having the i-th class a video sample of exemplar in training sample, a=1,2 ..., n _i, n _irepresent that the i-th class training sample has the number of exemplar, x _qrepresent q local feature of this video sample, represent the local feature number having the i-th class a video sample of exemplar in training sample, d represents the dimension of local feature.

Step 3: utilizing allly in training sample has the local feature of label video sample to set up initialization dictionary D ⁽⁰⁾.

3a) set the local feature set of training sample i-th class video sample as

3b) to the local feature set X of the i-th class video sample of training sample ⁱcarry out stochastic sampling, obtain the initialization classification dictionary of the i-th class the all initialization classification dictionaries obtained are carried out splicing and obtains initialization dictionary wherein: i=1,2 ..., c, d represent the dimension of local feature, and b represents the atom number of every class initialization classification dictionary, and m is the atom number of initialization dictionary, i.e. m=c*b.

Step 4: the weight matrix A constructing the t time iteration ^(t).

4a) there is exemplar for each in training sample, obtain its weight vectors in accordance with the following steps:

Following formula 4a1) is utilized to obtain this video sample weight vectors p element

Wherein, p=1,2 ..., m, l=1,2 ..., n;

4a2) calculate weight vectors each element value obtain the weight vectors of this video sample

Each weight vectors without label video sample 4b) in calculation training sample:

4b1) obtain each local feature of this video sample at the t time iteration dictionary D with k near neighbor method ^(t)in k neighbour's dictionary atom, and obtain this video sample neighbour matrix L ∈ R ^{m × K}p capable s row L _ps:

Wherein, p=1,2 ..., m, s=1,2 ..., K, K represent the local feature number in this video;

4b2) calculate each element value L of neighbour's matrix L _ps, obtain neighbour's matrix L of this video sample;

4b3) matrix L is sued for peace by row, obtain a column vector and be set to LL;

4b4) according to the column vector LL obtained, following formula is utilized to obtain this video sample weight vectors p element

Wherein p=1,2 ..., m, δ are scale parameter, LL _prepresent p element of column vector, max (LL) represents the greatest member value asking column vector LL;

4b5) calculate weight vectors each element value obtain the weight vectors of this video sample

4c) calculate weight matrix A ^(t)∈ R ^{m × n}in the weight vectors of training sample corresponding to each row obtain the weight matrix A of all training samples ^(t), wherein, n represents that all training samples have the number of exemplar, and has t=0,1 ..., max, max represent maximum iteration time, the weight vectors of the corresponding training sample of each row of weight matrix.

Step 5: use the dictionary D that the t time iteration obtains ^(t), each training sample is encoded.

5a) for l video sample Y in training sample ^l, obtain solving this video sample the t time iteration encoder matrix objective function, as shown in formula <1>:

\min_{B_{l}^{(t)}} \frac{1}{2} | | Y^{l} - D^{(t)} B_{l}^{(t)} | |_{F}^{2} + L | | B_{l}^{(t)} | |_{1, 1} + λ_{2} | | d i a g (A_{\cdot l}^{(t)}) B_{l}^{(t)} | |_{2, 1}, - - - < 1 >

Wherein l=1,2 ...., n, weight matrix A ^(t)l row, || || _frepresent F norm, || || _1,1represent 1,1 norm, || || _2,1represent 2,1 norm, in formula, Section 1 represents the reconstructed error item that video sample is encoded, to encoder matrix sparsity constraints item, be group sparse constraint item, the dictionary atom that this group sparse constraint item participates in coding in order to constraint comes from of a sort classification dictionary, λ ₁sparse constraint item parameter, λ ₂it is group sparse constraint item parameter;

5b) optimize formula <1>, obtain the encoder matrix of this video sample the t time iteration

5b1) by the capable q row of the r of formula <1> to l video sample encoder matrix in encoder matrix carry out differentiate, obtain following formula:

\frac{\partial f}{\partial B_{l | r q}^{(t)}} = \underset{j &NotEqual; r}{Σ} B_{l | r q}^{(t - 1)} (d_{j}^{(t)} \cdot d_{r}^{(t)}) - Y_{\cdot r}^{l} \cdot d_{r}^{(t)} + | | d_{r}^{(t)} | |_{2}^{2} B_{l | r q}^{(t)} + λ_{1} \frac{\partial}{\partial B_{l | r q}^{(t)}} | | B_{l | r q}^{(t)} | |_{1} + λ_{2} A_{r l}^{(t)} \frac{B_{l | r q}^{(t)}}{| | B_{l | r \cdot}^{(t)} | |_{2}} - - - < 2 >

Wherein,

f = \frac{1}{2} | | Y^{l} - D^{(t)} B_{l}^{(t)} | |_{F}^{2} + L | | B_{l}^{(t)} | |_{1, 1} + λ_{2} | | d i a g (A_{\cdot l}^{(t)}) B_{l}^{(t)} | |_{2, 1},

|| || ₂represent 2 norms of vector, represent vectorial 2 norms square, represent two vectorial inner product operations, represent the capable q row of the r of the t time iteration l video sample encoder matrix, represent that the r of the t time iteration l video sample encoder matrix is capable, q represents q local feature of video sample, represent dictionary D ^(t)r row, r=1,2 ..., m;

5b2) make formula <2> equal zero, obtain following formula:

B_{l | r q}^{(t)} = {\begin{matrix} (1 - λ_{2} \frac{A_{r l}^{(t)}}{| | v_{q}^{'} λ_{1} | |_{2}}) \frac{v_{q}^{'} - λ_{1}}{| | d_{r}^{(t)} | |_{2}^{2}} & v_{q}^{'} > λ_{1} \\ 0 & v_{q}^{'} < λ_{1} \end{matrix}, - - - < 3 >

Wherein v' _q=max (v _q, 0),

ν_{q} = Y_{\cdot r}^{l} \cdot d_{r}^{(t)} - \underset{j &NotEqual; r}{Σ} B_{l | r q}^{(t - 1)} (d_{j}^{(t)} \cdot d_{r}^{(t)});

5b3) calculate the t time iteration encoder matrix in each element value obtain the encoder matrix of this video sample

Step 6: upgrade dictionary, obtain iteration dictionary each time.

6a) obtain the iteration dictionary D solving the t+1 time ^(t+1)objective function, as shown in formula <4>:

\min_{D^{(t + 1)}} Σ_{l = 1}^{n} \frac{1}{2} | | Y^{l} - D^{(t + 1)} B_{l}^{(t)} | |_{F}^{2} + λ_{3} \underset{i < j}{Σ} Σ_{j = 1}^{c} | | {(D_{i}^{(t + 1)})}^{T} D_{j}^{(t + 1)} | |_{F}^{2} - - - < 4 >

Wherein, the similarity constraint item to classification dictionary, in order to increase the identification between classification dictionary, () ^trepresent transpose operation, represent the classification dictionary of the t+1 time iteration i-th class, λ ₃it is the parameter of similarity constraint item;

6b) by formula <4> to the t+1 time iteration dictionary D ^(t+1)in r dictionary atom carry out differentiate and make its result equal zero, obtaining following formula:

d_{r}^{(t + 1)} = {(ν (r, r) + λ_{3} M \cdot M^{T})}^{- 1} u (:, r) - - - < 5 >

Wherein, r ∈ 1,2 ..., m}, i ∈ 1,2 ..., c}, local dictionary M is dictionary D ^(t)reject i-th class classification dictionary rear formed local dictionary, namely () ^trepresent transpose operation, () ^-1the inversion operation of representing matrix,

u (:, r) = v v (:, r) - D^{(t)} \cdot v (:, r) + v (r, r) \cdot d_{r}^{(t)},

v = \underset{l}{Σ} B_{l}^{(t)} \cdot {(B_{l}^{(t)})}^{T}, ν ν = \underset{l}{Σ} Y^{l} \cdot {(B_{l}^{(t)})}^{T};

6c) by calculating the t+1 time iteration dictionary D ^(t+1)in each dictionary atom obtain the iteration dictionary D of the t+1 time ^(t+1).

Step 7: repeat step (4)-(6), until objective function converges or reach maximum iteration time, obtains final dictionary D.

Two, Video coding

Step 8: use final dictionary D, obtains the encoder matrix B of each video sample by the objective function optimizing following formula _g:

\begin{matrix} \min_{B_{g}} \frac{1}{2} | | Y^{g} - {DB}_{g} | |_{F}^{2} + γ | | B_{g} | |_{2, 1} & g = 1, 2, ..., h \end{matrix}

Wherein, || || _frepresent F norm, || || _2,1represent 2,1 norm, above formula Section 1 is the reconstructed error item of video sample coding, || B _g|| _2,1to encoder matrix B _ggroup sparse constraint item, γ is the parameter of group sparse constraint item.

Step 9: by each encoder matrix vectorization, obtains the final presentation code vector of each sample.

9a) utilize maxpooling algorithm to the encoder matrix B of each video sample will obtained in step 7 _gevery a line get maximal value:

{\hat{z}}_{k} = m a x (| B_{g | k 1} |, | B_{g | k 2} |, .., | B_{g | k i} |, ..., | B_{g | k K} |),

Wherein, g=1,2 ..., h, k=1,2 ..., m, B _g|kirepresent g video sample encoder matrix B _grow k i-th arrange, K represents the local feature number of this video;

9b) by the maximal value of the every a line of encoder matrix form a column vector: k=1,2 ..., m, each like this video sample is just expressed as the coding vector z of a m dimension ^*.

Three, visual classification

Step 10: utilize training sample to set up classifying dictionary

If there is the number of exemplar to be N in training sample _l=w*c, utilizes all coding vectors by exemplar in training sample to form classifying dictionary represent the i-th class class categories dictionary, i=1,2 ..., c, m are dictionary atom numbers, and c is dictionary classification sum.

Step 11: utilize classifying dictionary successively to each test sample book coding vector that step (10) obtains carry out sparse coding, obtain the code coefficient β of test sample book on classifying dictionary:

\min_{β} {| | \hat{y} - \hat{D} β | |_{2}^{2} + η | | β | |_{1}},

Wherein, || || ₂represent 2 norms of vector, || || ₁represent 1 norm of vector, η is that η span is 0 ~ 1 for Equilibrium fitting error and openness parameter of encoding.

Step 12: utilize code coefficient to calculate the residual error of each test sample book on each class categories dictionary successively

r_{i} (\hat{y}) = | | \hat{y} - {\hat{D}}_{i} β_{i} | |_{2}^{2} / | | β_{i} | |_{2}, i = 1, ..., c,

Wherein, β _ithat current test sample book is at the i-th class class categories dictionary on code coefficient.

Step 13: according to the residual error of test sample book on each class categories dictionary, test sample book is classified.

According to the residual error of test sample book on each class categories dictionary find the class categories dictionary producing least residual by this class categories dictionary class mark i as the class mark of test sample book, i ∈ 1,2 ..., c}.

Effect of the present invention can be further illustrated by following emulation experiment:

1. simulated conditions

Emulation experiment, at AMDA6-6310CPU, dominant frequency 1.80GHz, the MATLAB7.14 on internal memory 4G, Windows7 platform is carried out.This experiment utilizes the inventive method to test respectively on Weizmann data set and KTH data set, and and Y.Sun, Q.Liu, J.Tang, D.Tao, the supervision dictionary learning method that has in LearningDiscriminationDictionaryforGroupSparseRepresenta tion, ImageProcessing. literary composition contrasts.The data set that experiment uses is Weizmann data set and KTH data set.Wherein:

Weizmann data set comprises 93 videos, and all videos come from 9 different human actions, and everyone demonstrates different 10 behavior acts, i.e. c=10, and the part sample frame sectional drawing of this data set video as shown in Figure 2.These actions comprise: walk, run, jump, side, bend, waveone, wavetwo, pjump, jac, and skip, owing to there being a people to demonstrate twice walk, this three behaviors of runandskip, from the walk of this people, remove a video sample in runandskip three behaviors respectively, use remaining 90 video samples to carry out emulation experiment.Select the behavior act of wherein 5 people as training sample in emulation experiment, n=50, remaining video sample is as test sample book h-n=40;

KTH data set comprises 600 videos, and the part sample frame sectional drawing of this data set video as shown in Figure 3.This data set is completed under 4 different scenes by 25 people, comprises 6 behavior acts, i.e. c=6, respectively: walk, jog, run, box, hwavandhclap, the background of video is fixing, only has the change that in sub-fraction video, visual angle has some slight.According to the suggestion of author in emulation experiment, choose the behavior act of wherein 8 people as training sample, i.e. the behavior act of 11-18 people, n=192; Choose the behavior act of wherein 10 people as test sample book, namely the 2nd, 3,5-10, the behavior act of 22 people, h-n=216.

2. emulate content and result

Emulation 1, Weizmann data set uses the inventive method carry out identifying the emulation experiment of test.

Along with the every class in training sample has the change of exemplar number w, by the inventive method with existingly have measure of supervision to identify Weizmann data set, its result is as table 1.

Table 1. the present invention and existingly have the classification results of measure of supervision on Weizmann data set to contrast

As can be seen from Table 1, recognition effect of the present invention is better than existingly having measure of supervision on the whole.Existing have measure of supervision only to introduce having the reconstructed error of exemplar and having the information of exemplar when dictionary learning, and the inventive method not only introduces the reconstructed error to there being exemplar, also add sparsity constraints and classification dictionary similarity constraint, and the information simultaneously introduced without exemplar, thus the recognition correct rate that can promote test sample book.The results show, the inventive method can obtain the dictionary having more identification, thus effectively can represent human body behavior act, and reaches good Human bodys' response effect on the basis effectively represented.

As w=4, use the inventive method to the confusion matrix figure of Weizmann data set classification results, as shown in Figure 4.As can be seen from Figure 4, all human body behavior acts of the inventive method to Weizmann data centralization all achieve good discrimination.

Emulation 2, along with the every class in training sample has the change of exemplar number w, use the inventive method and existingly have measure of supervision to identify KTH data set, its result is as table 2.

Table 2. the present invention and existingly have the classification results of measure of supervision on KTH data set to contrast

Can to find out from table 2, the recognition correct rate of the present invention on KTH data set is better than existing measure of supervision, improve the accuracy of nearly 1% than the method for existing supervision, this proves further, and the dictionary learning method used in the present invention effectively can ensure the correct identification for test sample book.

As w=8, use the inventive method to the confusion matrix figure of KTH data set classification results, as shown in Figure 5.As can be seen from Figure 5, the present invention all has good discrimination for KTH data centralization major part human body behavior act, and is not very high for the discrimination of this behavior act of run, and this is due to the more similar reason of run with jog two kinds of behavior acts.The study of dictionary is carried out owing to present invention employs semi-supervised dictionary learning method, introduce more sample discriminant information, and coded representation is carried out to the local feature in video, make final representation of video shot have more identification, thus ensure that the higher recognition capability to human body behavior.

Claims

1., based on the Human bodys' response method of the semi-supervised dictionary learning of similarity weight, comprise the steps:

(1) input comprises the sets of video data of c class behavior, and comprising training dataset and test data set, training dataset is by n _lthe individual video data with class label and n _uindividual without label video data composition, test data set is by n _tindividual test video data composition, each video only contains a kind of behavior as a sample;

\min_{B_{l}^{(t)}} \frac{1}{2} | | Y^{l} - D^{(t)} B_{l}^{(t)} | |_{F}^{2} + λ_{1} | | B_{l}^{(t)} | |_{1, 1} + λ_{2} | | d i a g (A_{. l}^{(t)}) B_{l}^{(t)} | |_{2, 1}

\min_{D^{(t + 1)}} Σ_{l = 1}^{n} \frac{1}{2} | | Y^{l} - D^{(t + 1)} B_{l}^{(t)} | |_{F}^{2} + λ_{3} \underset{i < j}{Σ} Σ_{j = 1}^{c} | | {(D_{i}^{(t + 1)})}^{T} D_{j}^{(t + 1)} | |_{F}^{2}

\min_{B_{g}} \frac{1}{2} | | Y^{g} - {DB}_{g} | |_{F}^{2} + γ | | B_{g} | |_{2, 1}, g = 1, 2, ..., h,

z_{g} = {[{\hat{z}}_{1}, {\hat{z}}_{2}, .. {\hat{z}}_{k} ., {\hat{z}}_{m}]}^{T}, k = 1, 2, .., m

Wherein, g=1,2 ..., h, q=1,2 ..., K, B _g|kqrepresent g video sample encoder matrix B _grow k q row, K represents the local feature number of this video;

\min_{β} {| | \hat{y} - \hat{D} β | |_{2}^{2} + η | | β | |_{1}},

r_{i} (\hat{y}) = | | \hat{y} - {\hat{D}}_{i} β_{i} | |_{2}^{2} / | | β_{i} | |_{2}, i = 1, ..., c

(13) according to residual error i=1 ..., the size of c is to test sample book classify, find the class categories dictionary producing least residual by this dictionary label i as the label of current test sample book, complete the classification to all test sample books successively.

2. the Human bodys' response method of the semi-supervised dictionary learning based on similarity weight according to claim 1, the structure weight matrix A wherein described in step (4) ^(t)∈ R ^{m × n}, carry out as follows:

Each weight vectors having label video sample 4a) in calculation training sample set:

Wherein, for l video sample and this video sample is without exemplar, represent the coding vector of this sample p element, p=1,2 ..., the atom number of m, b representation class malapropism allusion quotation, l=1,2 ..., n, i ∈ 1,2 ..., c};

Each weight vectors without label video sample 4b) in calculation training sample set:

4b1) obtain each local feature of this video sample at the t time iteration dictionary D with k near neighbor method ^(t)in k neighbour's dictionary atom, and obtain this video sample neighbour matrix L ∈ R ^{m × K}, its p capable s column element L _psfor:

4b2) each row of neighbour's matrix L is sued for peace, obtain a column vector LL ∈ R ^m;

4b3) according to the column vector LL that obtains, following formula is utilized to obtain l video sample and this sample is weight vectors without exemplar p element

4c) calculate weight matrix A ^(t)in the weight vectors of training sample corresponding to each row obtain weight matrix A ^(t).