CN107301382B

CN107301382B - Behavior identification method based on deep nonnegative matrix factorization under time dependence constraint

Info

Publication number: CN107301382B
Application number: CN201710418471.8A
Authority: CN
Inventors: 同鸣; 汪雷; 李海龙
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2017-06-06
Filing date: 2017-06-06
Publication date: 2020-05-19
Anticipated expiration: 2037-06-06
Also published as: CN107301382A

Abstract

The invention discloses a behavior recognition method based on deep nonnegative matrix factorization under time dependence constraint, which mainly solves the problems of insufficient characteristic expressiveness and low behavior recognition rate extracted by the existing method. The method comprises the following implementation steps: 1) extracting a motion salient region of an original video, and constructing a corresponding non-negative matrix set in a segmented manner; 2) adding time-dependent constraint, and constructing time-dependent constraint non-negative matrix decomposition; 3) constructing a depth non-negative matrix decomposition frame under the time dependence constraint with the depth of L by utilizing the time dependence constraint non-negative matrix decomposition, and decomposing data in a non-negative matrix set by utilizing the frame; 4) normalizing the coefficient matrix output by each layer and then connecting the coefficient matrix in series to be output as space-time characteristics; 5) and constructing a word bag model for the space-time characteristics, and identifying and classifying through an SVM classifier. The method can obtain the space-time characteristics with higher discriminability and expressiveness, and can be applied to occasions with higher requirements on behavior recognition accuracy rate, such as video monitoring, motion analysis and the like.

Description

Behavior identification method based on deep nonnegative matrix factorization under time dependence constraint

Technical Field

The invention belongs to the technical field of image processing, and relates to a human behavior identification method which can be used for intelligent video monitoring and man-machine interaction.

Background

The human behavior recognition technology has wide application prospect and considerable economic value, and the related application fields mainly comprise: video monitoring, motion analysis, virtual reality, and the like. Researchers have conducted a great deal of intensive research on the technologies related to human behavior recognition, and accumulated abundant research results, but as a whole, the field of human behavior recognition is still in the basic research stage at present, and there are many key problems and technical difficulties to be solved urgently, for example, research on a behavior characterization mode with high recognition rate, high robustness and simplicity. Some scholars find that the space-time information of the video is beneficial to improving the recognition rate of the behaviors, and how to effectively acquire the space-time information from the video data becomes the research focus in the field of behavior recognition.

(1) Luo J, Wang W, Qi H.spread-temporal feature extraction and reproduction for RGB-D human action recognition. pattern recognition letters,2014,50(C): 139-148. The method proposes a central symmetry local motion ternary pattern (CS-Mltp) for describing gradient features in time and space, the extracted features can keep good spatial and temporal information, and approximation errors are reduced, but for noisy videos, more noise points are generated in the process of extracting the features, and the accuracy of video feature extraction is seriously influenced.

(2) 329-338 of Ben Aoun N, Mejdoub M, Ben Amar C.graph-based approach for human interaction using spatial-temporal characteristics, journal of visual communication & Image reproduction, 2014,25 (2). The method combines the feature structure representation diagram and the bag-of-words model to model the space-time relationship of the features, can effectively inhibit the influence caused by video noise and shielding, only considers the accurate matching of subgraphs, and finds that the subgraphs with higher frequency are found, but the obtained space-time features have weaker discriminability.

The non-negative matrix factorization NMF is a matrix factorization method under the condition that all elements in a matrix are non-negative, the dimension of data characteristics can be greatly reduced, the factorization characteristics are in accordance with visual perception and visual experience of human, the factorization result has interpretable and clear physical significance, the non-negative matrix factorization NMF is widely concerned by people since the time of putting forward, and the non-negative matrix factorization NMF is successfully applied to multiple fields such as pattern recognition, computer vision, image engineering and the like.

The basic non-negative matrix factorization method that has been proposed so far:

(3) lee D, mounting H S.left the parts of objects with a non-networked kinetic mechanism, Nature,1999,401(6755): 788-791. A new matrix factorization method, non-negative matrix factorization, is proposed. It can decompose the nonnegative matrix, in which all elements of a matrix are nonnegative, into the product of two nonnegative matrices, and simultaneously realize the reduction of the nonlinear dimension. However, when the basic non-negative matrix factorization method is applied to video feature extraction, only the spatial features of each frame of a video are considered, and the space-time features of the video are ignored.

Disclosure of Invention

The invention aims to provide a behavior recognition method based on deep nonnegative matrix decomposition under time dependence constraint to extract space-time characteristics of a video and improve the accuracy of behavior recognition aiming at the defects of the prior art.

The technical key point of the invention is that time-dependent constraint is added to construct time-dependent constraint non-negative matrix decomposition, and a depth non-negative matrix decomposition frame under the time-dependent constraint is constructed by taking the time-dependent constraint non-negative matrix decomposition as an algorithm unit to extract the video space-time characteristics, and the specific implementation steps comprise the following steps:

(1) for an original video O, extracting a motion saliency area of each frame to form a video motion saliency area V ═ V₁,v₂，…,v_i,…,v_ZIn which v is_iA motion saliency region representing the ith frame, i ═ 1,2, …, Z representing the number of frames of the video;

(2) dividing each s frame of the video motion saliency region V into a segment, and traversing and converting the segment into a non-negative matrix set X ═ X₁,X₂,…,X_q,…,X_NsIn which X_qA non-negative matrix formed by a q-th section significance area is represented, wherein q is 1,2, …, Ns and Ns represent the number of sections of one video section;

(3) adding a time-dependent constraint, and constructing an objective function D of non-negative matrix decomposition of the time-dependent constraint:

wherein G is a non-negative matrix, F is a base matrix, H is a coefficient matrix, lambda and η are respectively time-dependent term and sparse term adjusting parameters, w_uIs a weight column vector corresponding to any element U in the interval frame number set U, and U belongs to U, so that a weight matrix W is formed for the interval frame number set U₁,w₂,...,w_u,...,w_g]The weight value can be calculated by a vector autoregressive method according to rows, g represents the maximum interval frame number, g is max (U), diag(w_u) The weight column vector is diagonal into a diagonal matrix, (-)^TRepresenting the transpose of a vector or matrix, | · |. non-woven phosphor_2,1Represents L_2,1Norm, P_u＝P^g-P^u∈R^n×(n-g-1)，P^gIn order to shift the matrix operator horizontally, the operator,

P^uin order to shift the matrix operator horizontally, the operator,

I_{(n-g-1)×(n-g-1)}is a unit matrix of (n-g-1) × (n-g-1), 0_{(g+1)×(n-g-1)}A matrix of all 0's that is (g +1) × (n-g-1);

(4) constructing a depth non-negative matrix decomposition frame under the time-dependent constraint of the depth L by using the time-dependent constraint non-negative matrix decomposition, and using the frame to perform non-negative matrix X on the q video segment_qDecomposing to obtain L coefficient matrixes H^(l)L is 1,2, …, L, wherein L is the index of the decomposition level;

(5) for coefficient matrix H^(l)Normalizing according to rows and connecting the normalized rows in series to obtain the space-time characteristic output of the whole input data

k＝1,2,…,r_l，r_lFor the l-th layer non-negative matrix factorization dimension,

a k-th row representing a first layer coefficient matrix;

(6) decomposing the non-negative matrixes of the non-negative matrix set X one by one, namely adopting the operations of the step (4) to the step (5) for each non-negative matrix to obtain the space-time characteristic output of the whole video:

wherein Feat_qFor the qth video segment space-time feature, (-)^TDenotes the transpose of a vector or matrix,

q

1,2, …, Ns;

(7) performing space-time feature extraction on all sample videos according to the processes from the step (4) to the step (6), and dividing the sample videos into training sets D_trAnd test set D_teObtaining training set D using bag of words model_trHistogram vector N of_trAnd test set D_teHistogram vector N of_te；

(8) Histogram vector N using training set_trTraining SVM classifier, and obtaining histogram vector N of test set_teInputting the data into a trained SVM, and outputting a test set D_teThe behavior class to which the corresponding test sample belongs.

Compared with the prior art, the invention has the following advantages:

1) according to the invention, because time-dependent constraint non-negative matrix decomposition is constructed, the time characteristic of the video can be kept while the spatial characteristic of the video is kept;

2) the invention adopts deep NMF decomposition, and can learn more expressive space-time characteristics by supplementing and perfecting layer by layer, thereby further improving the expression capability of obtaining the space-time characteristics.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention.

Detailed Description

Referring to fig. 1, the implementation steps of the invention are as follows:

step 1 extracts a motion saliency region V of an original video O.

(1a) A gaussian filter of size 5 × 5 is constructed and O ═ O for the original video₁,o₂,…,o_i,…,o_ZGaussian filtering is carried out, and correspondingly filtered video B ═ B is obtained₁,b₂,…,b_i,…,b_ZIn which b is_iRepresents the ith filtered video frame, i ═ 1,2, …, Z;

(1b) the ith video frame o is calculated using the following formula_iV of motion significance_i：

v_i＝|mo_i-b_i|，

Wherein mo_iFor the ith video frame o_iThe geometric mean of the pixels of (a);

(1c) repeating the operation in the step (1b) for all frames in the video O to obtain the whole video motion significance region V ═ { V ═ V₁,v₂，…,v_i,…,v_Z}。

The significance extraction method in the step is derived from the 'Frequency-tuned significant Region Detection' published by the 2009 Radhakrishna Achanta et al, the method is not limited to the method, and other significance extraction methods can be used, such as the 'Global Contrast based significant Region Detection' published by the 2015 Ming-Ming Cheng et al.

Step 2, dividing each s frame of the video motion saliency region V into one segment, and traversing and converting the segment into a non-negative matrix set X ═ X₁,X₂,…,X_q,…,X_NsIn which X_qAnd a non-negative matrix formed by a q-th section significance area is represented, wherein q is 1,2, …, Ns and Ns represent the number of sections of one video segment.

And 3, adding time dependence constraint and constructing a target function D of non-negative matrix decomposition of the time dependence constraint.

In adding the time-dependent constraint term, the invention mainly considers the following three aspects:

1) not only the relation between two adjacent frames is considered, but also the relation between 1 frame or multiple frames at intervals is considered, so that the invention sets an interval frame number set U, and the contribution of two frame images at different intervals to feature extraction is different, and different weight coefficients are given;

2) in order to keep more motion detail information of video behaviors, original data are more fully utilized in a projection mode;

3) differencing the coefficient matrix vectors from the projection vectors to reduce reconstruction errors while applying L to the coefficient matrix_2,1Norm constraint, so that the decomposition result is more expressive on the basis of keeping sparsity, thereby constructing an objective function D of time-dependent constraint non-negative matrix decomposition:

wherein G is a non-negative matrix, F is a base matrix, H is a coefficient matrix, lambda and η are respectively time-dependent term and sparse term adjusting parameters, w_uIs a weight column vector corresponding to any element U in the interval frame number set U, and U belongs to U, so that a weight matrix W is formed for the interval frame number set U₁,w₂,...,w_u,...,w_g]The weight value can be calculated by a vector autoregressive method according to the row, g represents the maximum interval frame number, g is max (U), diag (w)_u) The weight column vector is diagonal into a diagonal matrix, (-)^TRepresenting the transpose of a vector or matrix, | · |. non-woven phosphor_2,1Represents L_2,1Norm, P_u＝P^g-P^u∈R^n×(n-g-1)，P^gIn order to shift the matrix operator horizontally, the operator,

P^uin order to shift the matrix operator horizontally, the operator,

and 4, constructing a depth non-negative matrix decomposition frame under the time dependence constraint with the depth of L by using the time dependence constraint non-negative matrix decomposition.

(4a) Carrying out optimization solution on an objective function D of time-dependent constraint non-negative matrix decomposition;

(4a1) determining the size of a base matrix F and a coefficient matrix H according to the non-negative matrix G and the decomposition dimension r, wherein the size of the non-negative matrix G is mxn, the size of the base matrix F is mxr, and the size of the coefficient matrix H is rxn;

(4a2) randomly initializing a base matrix F and a coefficient matrix H to enable any element F in the base matrix F_ap∈[0,1]1,2,., m,

p

1,2,., r, any element H of the coefficient matrix H_pc∈[0,1]1,2,., n, wherein f_apRepresenting base momentRow a, column p, elements, h, of array F_pcElements representing the p row and c column in the coefficient matrix H;

(4a3) for elements in the base matrix F

Updating:

for elements in coefficient matrix H

Updating:

wherein,

for iterating t-1 times the radix matrix F^t-1Row a, column p, element, t e [1, iter]Iter is a predefined maximum number of iterations,

is a coefficient matrix H after t-1 iterations^t-1The elements of the p-th row and c-th column,

is | | | H^t-1||_2,1With respect to the coefficient matrix H^t-1The intermediate value of the derivation is taken,

representation matrix H^t-1R th line of (1) (.)^TRepresents a transpose of a vector or matrix;

(4a4) stopping iteration after the iteration time t reaches iter times, and outputting an expected basis matrix F and a coefficient matrix H, otherwise, returning to the step (4a 3);

(4b) stacked L-layer time-dependent constrained non-negative matrix factorization architectureDeep decomposition frame, in the first layer, using non-negative matrix G as input to obtain base matrix F⁽¹⁾Sum coefficient matrix H⁽¹⁾The base matrix F obtained by decomposing the previous layer from the second layer^(l-1)As input to the next layer, while outputting F^(l)And H^(l)Where l is the index of the number of decomposition levels, F^(l)Base matrix obtained for layer I, H^(l)And obtaining a coefficient matrix of the l layer.

Step 5, utilizing the frame constructed in step 4 to carry out non-negative matrix X on the qth video segment_qDecomposing to obtain L coefficient matrixes H^(l)And L is 1,2, …, wherein L is the index of the decomposition layer number.

Step 6 pairs of coefficient matrix H^(l)Normalizing according to rows and connecting the normalized rows in series to obtain the space-time characteristic output of the whole input data

represents the kth column of the first layer coefficient matrix.

And 7, decomposing the non-negative matrixes of the non-negative matrix set X one by one, namely adopting the operations of the steps (5) to (6) for each non-negative matrix to obtain the space-time characteristic output of the whole video:

q

1,2, …, Ns.

Step 8, extracting the characteristics of all sample videos and dividing the sample videos into training sets D_trAnd test set D_teObtaining training set D using bag of words model_trHistogram vector N of_trAnd test set D_teHistogram vector N of_te。

(8a) By using K-means clustering method on training set D_trGenerating a dictionary DI_De×Ce；

(8b) Through dictionary DI_De×CeWill train set D_trAnd test set D_teCarrying out quantitative coding to obtain a training set D_trHistogram vector N of_trAnd test set D_teHistogram vector N of_teWhere De represents the feature dimension and Ce represents the cluster center number.

Step 9 histogram vector N using training set_trTraining SVM classifier, and obtaining histogram vector N of test set_teInputting the data into a trained SVM, and outputting a test set D_teThe behavior class to which the corresponding test sample belongs.

In order to verify the effectiveness of the invention, 6 types and 10 types of behaviors are respectively selected from the commonly used human behavior databases KTH and UCF-Sports, and the human behavior recognition is carried out by utilizing the invention. The correct recognition rate on the database KTH was 97.79%, and the correct recognition rate on the database UCF-Sports was 96.67%.

The foregoing description is only an example of the present invention and should not be construed as limiting the invention, as it will be apparent to those skilled in the art that various modifications and variations in form and detail can be made therein without departing from the principles and structures of the invention, but such modifications and variations are within the scope of the invention as defined by the appended claims.

Claims

1. The behavior identification method based on the depth nonnegative matrix factorization under the time dependence constraint comprises the following steps:

wherein G is a non-negative matrix, F is a base matrix, H is a coefficient matrix, lambda and η are respectively time-dependent term and sparse term adjusting parameters, w_uIs a weight column vector corresponding to any element U in the interval frame number set U, and U belongs to U, so that a weight matrix W is formed for the interval frame number set U₁,w₁,...,w_u,...,w_g]The weight value can be calculated by a vector autoregressive method according to the row, g represents the maximum interval frame number, g is max (U), diag (w)_u) The weight column vector is diagonal into a diagonal matrix, (-)^TRepresenting the transpose of a vector or matrix, | · |. non-woven phosphor_2,1Represents L_2,1Norm, P_u＝P^g-P^u∈R^Z×(Z-g-1)，P^gFor the first horizontal shift matrix operator,

P^ufor the second horizontal shift matrix operator, the first horizontal shift matrix operator,

I_{(Z-g-1)×(Z-g-1)}is a unit matrix of (Z-g-1) × (Z-g-1), 0_{(g+1)×(Z-g-1)}An all 0 matrix of (g +1) × (Z-g-1);

(4) constructing a depth non-negative matrix decomposition frame under the time-dependent constraint of the depth L by using the time-dependent constraint non-negative matrix decomposition, and using the frame to perform non-negative matrix X on the q video segment_qDecomposing to obtain L coefficient matrixes H^(l)1,2, wherein L is a decomposition layer number index;

r_lFor the l-th layer non-negative matrix factorization dimension,

a k-th row representing a first layer coefficient matrix;

wherein Feat_qFor the qth video segment space-time feature, (-)^TDenotes the transpose of a vector or matrix, q 1,2, …, Ns;

2. The method of claim 1, wherein the video motion salient region is extracted in step (1) by the following steps:

(1a) a gaussian filter of size 5 × 5 is constructed and for video O ═ O₁,o₂,…,o_i,…,o_ZFiltering is carried out, and correspondingly, a filtered video B is obtained{b₁,b₂,…,b_i,…,b_ZIn which b is_iA column vector representing the filtered i-th video frame translation, i ═ 1,2, …, Z;

v_i＝|mo_i-b_i|，

Wherein mo_iIs a number of rows equal to b_iA column vector of rows of (a), each element having a value of the i-th video frame o_iThe geometric mean of the pixels of (a);

3. The method of claim 1, wherein the depth nonnegative matrix factorization framework under the time-dependent constraint with the depth of L is constructed by utilizing the time-dependent constraint nonnegative matrix factorization in the step (4), and the method comprises the following steps:

(4a1) determining the size of a base matrix F and a coefficient matrix H according to the non-negative matrix G and the decomposition dimension r, wherein the size of the non-negative matrix G is mxn, the size of the base matrix is mxr, and the size of the coefficient matrix H is rxn;

(4a2) randomly initializing a base matrix F and a coefficient matrix H to enable any element F in the base matrix F_ap∈[0,1]1,2,., m, p 1,2,., r, any element H of the coefficient matrix H_pc∈[0,1]1,2,., n, wherein f_apRepresenting the elements of row a and column p in the base matrix F, h_pcElements representing the p row and c column in the coefficient matrix H;

(4a3) for elements in the base matrix F

Updating:

for elements in coefficient matrix H

Updating:

wherein,

is a matrix P_uThe middle element is a positive part of the compound,

is a matrix P_uThe middle element is a negative part of the element,

(4b) stacking L layers of time-dependent constrained non-negative matrix factorization to construct a deep decomposition frame, and taking a non-negative matrix G as input in the first layer to obtain a base matrix F⁽¹⁾Sum coefficient matrix H⁽¹⁾The base matrix F obtained by decomposing the previous layer from the second layer^(l-1)As input to the next layer, while outputting F^(l)And H^(l)Where l is the index of the number of decomposition levels, F^(l)Base matrix obtained for layer I, H^(l)And obtaining a coefficient matrix of the l layer.

4. The method of claim 1, wherein the training set D is obtained in step (7) using a bag-of-words model_trHistogram vector N of_trAnd test set D_teHistogram vector N of_teFirstly adopting a K-means clustering method to carry out on a training set D_trGenerating a dictionary DI_De×Ce(ii) a Go through dictionary DI_De×CeWill train set D_trAnd test set D_teCarrying out quantitative coding to obtain a training set D_trHistogram vector N of_trAnd test set D_teHistogram vector N of_teWhere De represents the feature dimension and Ce represents the cluster center number.