CN107423697B

CN107423697B - Behavior identification method based on nonlinear fusion depth 3D convolution descriptor

Info

Publication number: CN107423697B
Application number: CN201710568540.3A
Authority: CN
Inventors: 同鸣; 赵梦傲; 李明阳; 汪厚峄
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2017-07-13
Filing date: 2017-07-13
Publication date: 2020-09-08
Anticipated expiration: 2037-07-13
Also published as: CN107423697A

Abstract

The invention discloses a behavior identification method based on a nonlinear fusion depth 3D convolution descriptor, and mainly solves the problem of low identification accuracy rate in the prior art. The scheme is as follows: 1. inputting each sample into a C3D network to obtain each layer of activation values; 2. processing each layer of the C3D network to obtain a feature vector of each layer; 3. fusing the feature vectors of different layers to obtain a global feature set and a local feature set; 4. carrying out discriminant nonlinear fusion on the global feature set and the local feature set to obtain a depth 3D convolution descriptor; 5. acquiring depth features of training samples for training a linear SVM classifier; 6. and acquiring the depth features of the test samples, and inputting the depth features into a linear SVM classifier for recognition. The invention improves the accuracy of behavior recognition, obtains 94.67% recognition rate on the UCF-Sports library, and can be applied to human-computer interaction, video monitoring and video retrieval.

Description

Behavior identification method based on nonlinear fusion depth 3D convolution descriptor

Technical Field

The invention belongs to the technical field of video processing, and particularly relates to a behavior identification method which can be applied to man-machine interaction, video monitoring and video retrieval.

Background

At present, behavior recognition methods in the field of video processing mainly include two methods, namely artificial features and deep learning. In which artificial features are usually designed based on domain knowledge of the controlled environment, however, video data in real scenes cannot always be correctly modeled, and thus the generalization capability of artificial features is not sufficient. Because the video contains abundant semantic information, the traditional artificial features are directly used for behavior recognition, certain semantic information and enough discrimination capability are lacked, and behavior recognition confusion is easily caused.

In recent years, behavior recognition methods based on deep learning have enjoyed great success and progress. Deep learning generally utilizes a deep convolutional neural network for behavior recognition, and the deep convolutional neural network for behavior recognition mainly includes: 2D convolutional networks, 3D convolutional networks, and C3D networks. Among them, the 3D convolutional network model is superior to the conventional 2D convolutional network model. However, the 3D convolutional network model requires a human body detector and a head tracking algorithm to segment the video, and the segmented video segment is used as an input of the 3D convolutional neural network, which has great limitations. Compared with a 3D convolutional network, the C3D network can learn the space-time information in the video, and can directly take the complete video as input, does not depend on any preprocessing, and is easy to expand to a large-scale data set. However, when performing behavior recognition, the C3D network only uses the global features at the top level, and the bottom-level features, which are important local features in the network, are not fully regarded.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a behavior identification method based on a nonlinear fusion depth 3D convolution descriptor, which obtains more discriminant feature representation by fusing different layer features of a C3D network and improves the behavior identification rate.

The technical key point for realizing the invention is to construct a discriminant nonlinear fusion method, fuse the global features and the local features extracted from the C3D network by using the method to obtain a deep 3D convolution descriptor, and classify data by using an SVM, wherein the implementation steps comprise the following steps:

(1) acquiring L eigenvectors u of each sample by using a C3D network, wherein L is the layer number of the C3D network;

(2) obtaining a global feature vector X and a local feature vector Y of each sample according to the feature vector u to obtain a global feature set X and a local feature set Y;

(3) obtaining a depth 3D convolution descriptor D according to the global feature set X and the local feature set Y_C3D；

(4) Descriptor D by depth 3D convolution_C3DObtaining a depth feature vector z of each training sample^trainAnd a depth feature vector z for each test sample^test；

(5)Depth feature vector z from training samples^trainTraining a linear SVM classifier;

(6) depth feature vector z for each test sample according to a linear SVM classifier^testAnd classifying to obtain a classification result of each test sample.

Compared with the prior art, the invention has the following advantages:

according to the method, the global features and the local features of the data are extracted by using the C3D network, a more discriminant deep 3D convolution descriptor is obtained through nonlinear fusion, the deep 3D convolution descriptor is used for training the SVM classifier, and the accuracy of behavior recognition is improved.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention.

Detailed Description

Referring to fig. 1, the behavior recognition method based on the nonlinear fusion depth 3D convolution descriptor of the present invention includes the following steps:

step 1, a training data set and a test data set are obtained.

(1a) Acquiring a human behavior video set V, wherein the category number of the human behavior video set V is C, and the total number of samples is N;

(1b) selecting a samples from each category of the human behavior video set V as test samples to obtain a test data set V^trainTaking the residual samples in the human behavior video set V as training samples to obtain a training data set V^trainWherein, a ∈ {1,2_k-1}，N_kThe number of samples in class k, k is 1, 2.

And 2, acquiring a feature vector u of each sample.

(2a) Dividing each sample into a plurality of continuous video segments, wherein the length of each video segment is the same;

(2b) inputting the video clips obtained in the step (2a) into a C3D network, and obtaining activation values of each layer of each video clip in the C3D network, wherein the number of the layers of the C3D network is L;

(2c) according to the activation values of all layers of each video clip obtained in the step (2b), summing the activation values of the same layer of all the video clips, and averaging to obtain an average activation value of each layer;

(2d) and (3) carrying out dimensionality reduction on the average activation value of each layer obtained in the step (2c) by using principal component analysis to obtain L eigenvectors u of each sample.

And 3, acquiring a global feature set X.

(3a) According to the L feature vectors u of the training samples obtained in the step 2, the number from the C3D network is the first

B feature vectors u are selected from the L-th layer, different feature vectors u are connected in series to obtain a global feature vector x, the dimension of the global feature vector x is q,

represents rounding down;

(3b) repeating the same process for each sample according to the step (3a) to obtain a global feature set: x ═ X₁,x₂,...,x_n,...,x_N]Wherein, X ∈ R^q×N，R^q×NVector space of dimension q × N, x_nIs the global feature vector of the nth sample, N is 1, 2.

And 4, acquiring a local feature set Y.

(4a) According to the L eigenvectors u of the training sample obtained in the step 2, from the 1 st layer to the 1 st layer of the C3D network

E feature vectors u are selected from the layer, different feature vectors u are connected in series to obtain a local feature vector y, the dimension of the local feature vector y is p,

(4b) according to the step (4a) pairRepeating the same process for each sample to obtain a local feature set: y ═ Y₁,y₂,...,y_m,...,y_N]Wherein, Y ∈ R^p×N，R^p×NVector space of dimension p × N, y_mIs the local feature vector of the mth sample, m is 1, 2.

Step 5, calculating a depth 3D convolution descriptor D_C3D。

(5a) Computing a kernel matrix K of a global feature set X using a kernel function_XAnd a kernel matrix K of the local feature set Y_YThe kernel function may be a polynomial kernel function, a gaussian kernel function, a laplacian kernel function, a power exponent kernel function, or other different types of kernel functions, and the polynomial kernel function is selected in this example, but is not limited to this method;

(5a1) calculating a kernel matrix K of the global feature set X by utilizing a polynomial kernel function according to the global feature set X_XEach of the elements of (a):

(K_X)_ij＝G_X(x_i,x_j)，

wherein, i is 1,2, N, j is 1,2, N, (K)_X)_ijKernel matrix K being a global feature set X_XThe ith row and the jth column of elements,

<·>representing the calculated inner product, x_iGlobal feature vector for ith sample in global feature set X, X_jIs the global feature vector of the jth sample in the global feature set X, theta₁Kernel parameters are polynomial kernel functions;

(5a2) according to the local feature set Y, a kernel matrix K of the local feature set Y is calculated by utilizing a polynomial kernel function_YEach of the elements of (a):

(K_Y)_ηξ＝G_Y(y_η,y_ξ)，

wherein η is 1,2, N, ξ is 1,2, N, (K)_Y)_ηξKernel matrix K being a local feature set Y_YLine η column ξ,

y_ηis the local feature vector of the η th sample in the local feature set Y, Y_ξIs the local feature vector of the ξ th sample in the local feature set Y, theta₂Kernel parameters are polynomial kernel functions;

(5b) kernel matrix K from global feature set X_XAnd a kernel matrix K of the local feature set Y_YPerforming discriminant nonlinear fusion to obtain a depth 3D convolution descriptor D_C3D：

(5b1) Calculating an intra-kernel-class divergence matrix for a global feature set X

And the inter-kernel-class divergence matrix of the global feature set X

Wherein the content of the first and second substances,

for the non-linear mapping of the global feature vector for the u-th sample in the kth class of samples,

global feature vector of u-th sample in k-th sample, u ═ 1,2_kT is matrix transposition;

(5b2) computing a set of local featuresNuclear-class internal divergence matrix of Y

And the inter-kernel-class divergence matrix of the local feature set Y

Wherein the content of the first and second substances,

for the non-linear mapping of the local feature vector of the g-th sample in the kth class of samples,

local feature vector of the g-th sample in the k-th sample, g ═ 1,2_k；

(5b3) According to the kernel class divergence matrix of the global feature set X obtained in the step (5b1)

And the kernel-class divergence matrix of the local feature set Y obtained in the step (5b2)

Obtaining a cross covariance matrix K_xyAnd cross covariance matrix K_yx：

Wherein cov (·) represents the computational covariance;

(5b4) constructing an objective function:

and calculates a global projection vector α for each eigenvector x and a local projection vector β for each eigenvector y using the objective function, wherein,

(5b5) and (3) solving the obtained objective function according to a Lagrange multiplier method (5b1), namely converting the problem of solving the objective function into the problem of solving the generalized characteristic value, wherein the formula for solving the generalized characteristic value is as follows:

wherein λ is a generalized eigenvalue, the global projection vector α is composed of the first N elements of the eigenvector corresponding to the generalized eigenvalue λ, and the local projection vector β is composed of the last N elements of the eigenvector corresponding to the generalized eigenvalue λ;

(5b6) solving the generalized eigenvalues according to the step (5b5) to obtain the first s maximum eigenvalues to obtain a projection matrix W of the global eigenvalue set X_X＝[α₁,α₂,...,α_s]And a projection matrix W of the local feature set Y_Y＝[β₁,β₂,...,β_s]Where s ═ min (q, p), min (·) denotes the minimum value, α₁,α₂,...,α_sThe global projection vectors corresponding to the first s maximum eigenvalues obtained for solving the generalized eigenvalues, β₁,β₂,...,β_sLocal projection vectors corresponding to the first s maximum eigenvalues obtained by solving the generalized eigenvalues;

(5b7) from a global feature setKernel matrix K of X_XKernel matrix K of local feature set Y_YProjection matrix W of global feature set X_XAnd a projection matrix W of the local feature set Y_YObtaining a depth 3D convolution descriptor:

and 6, training a linear SVM classifier.

(6a) Obtaining a depth 3D convolution descriptor D according to the step 5_C3DFrom which the depth feature vector z of each training sample is obtained^trainWherein the depth 3D convolution descriptor D_C3DEach column of (a) corresponds to a depth feature vector of a sample;

(6b) depth feature vector z using training samples^trainAnd training the linear SVM classifier.

And 7, obtaining the classification result of the test sample.

(7a) Obtaining a depth 3D convolution descriptor D according to the step 5_C3DFrom which depth feature vector z for each test sample is obtained^test；

(7b) The depth feature vector z of each test sample^testAnd inputting the data into a linear SVM classifier to obtain the recognition result of each test sample.

The foregoing description is only an example of the present invention and should not be construed as limiting the invention, as it will be apparent to those skilled in the art that various modifications and variations in form and detail can be made without departing from the principle and structure of the invention after understanding the present disclosure and the principles, but such modifications and variations are considered to be within the scope of the appended claims.

Claims

1. The behavior identification method based on the nonlinear fusion depth 3D convolution descriptor comprises the following steps:

(1) acquiring L eigenvectors v of each sample by using a C3D network, wherein L is the layer number of the C3D network;

(1a) dividing each sample into a plurality of continuous video segments, wherein the length of each video segment is the same;

(1b) inputting the video clips obtained in the step (1a) into a C3D network, and obtaining activation values of each layer of each video clip in the C3D network, wherein the number of the layers of the C3D network is L;

(1c) according to the activation values of all layers of each video clip obtained in the step (1b), summing the activation values of the same layer of all the video clips, and averaging to obtain an average activation value of each layer;

(1d) performing dimensionality reduction on the average value of the activation value of each layer obtained in the step (1c) by using principal component analysis to obtain L eigenvectors u of each sample;

(2) obtaining a global feature vector X and a local feature vector Y of each sample according to the feature vector v to obtain a global feature set X and a local feature set Y; the method comprises the following implementation steps:

(2a) according to the L feature vectors v of the training samples obtained in the step (1), obtaining the L feature vectors v from the C3D network

B feature vectors v are selected from the L-th layer, different feature vectors v are connected in series to obtain a global feature vector x, the dimension of the global feature vector x is q,

represents rounding down;

(2b) repeating the same process for each sample according to the step (2a) to obtain a global feature set: x ═ X₁,x₂,...,x_n,...,x_N]Wherein, X ∈ R^q×N，R^q×NVector space of dimension q × N, x_nA global feature vector of an nth sample, N being 1, 2.

(2c) According to the L eigenvectors v of the training sample obtained in the step (1), from the layer 1 to the layer 1 of the C3D network

E feature vectors v are selected from the layer, different feature vectors v are connected in series to obtain a local feature vector y, the dimension of the local feature vector y is p,

represents rounding down;

(2d) repeating the same process for each sample according to the step (2c) to obtain a local feature set: y ═ Y₁,y₂,...,y_m,...,y_N]Wherein, Y ∈ R^p×N，R^p×NVector space of dimension p × N, y_mIs the local feature vector of the mth sample, m ═ 1, 2.., N;

(3) obtaining a depth 3D convolution descriptor D according to the global feature set X and the local feature set Y_C3D(ii) a The method comprises the following concrete steps:

(3a) computing a kernel matrix K of a global feature set X using a kernel function_XAnd a kernel matrix K of the local feature set Y_Y：

(3a1) Calculating a kernel matrix K of the global feature set X by utilizing a polynomial kernel function according to the global feature set X_XEach of the elements of (a):

(K_X)_ij＝G_X(x_i,x_j)，

wherein, i is 1,2, N, j is 1,2, N, (K)_X)_ijIs a kernel matrix K_XThe ith row and the jth column of elements,

<·>representing the calculated inner product, x_iGlobal feature vector for the ith sample in feature set X, X_jGlobal feature vector, θ, for the jth sample in feature set X₁Kernel parameters are polynomial kernel functions;

(3a2) according to local characteristicsA characteristic set Y, a kernel matrix K of the local characteristic set Y is calculated by utilizing a polynomial kernel function_YEach of the elements of (a):

(K_Y)_ηξ＝G_Y(y_η,y_ξ)，

wherein η is 1,2, N, ξ is 1,2, N, (K)_Y)_ηξIs a kernel matrix K_YLine η column ξ,

<·>representing the calculated inner product, y_ηIs the local feature vector of the η th sample in the local feature set Y, Y_ξIs the local feature vector of the ξ th sample in the local feature set Y, theta₂Kernel parameters are polynomial kernel functions;

(3b) obtaining a kernel matrix K of the global feature set X according to the step (3a)_XAnd a kernel matrix K of the local feature set Y_YCalculating the depth 3D convolution descriptor D_C3D：

(3b1) Calculating an intra-kernel-class divergence matrix for a global feature set X

And the inter-kernel divergence matrix

Wherein the content of the first and second substances,

global feature vector of the u-th sample in the k-th sample, N_kN is the number of samples of class k, u is 1,2_kK is 1,2, C, T is a matrix transpose;

(3b2) calculating the intra-kernel-class divergence matrix of the local feature set Y

And the inter-kernel divergence matrix

Wherein the content of the first and second substances,

local feature vector of the g-th sample in the k-th sample, g ═ 1,2_k；

(3b3) Kernel-to-kernel divergence matrix from global feature set X

And the inter-kernel-class divergence matrix of the local feature set Y

Obtaining a cross covariance matrix K_xyAnd cross covariance matrix K_yx：

Wherein cov (·) represents the computational covariance;

(3b4) constructing an objective function:

and calculates a projection vector α for each eigenvector x and a projection vector β for each eigenvector y using the objective function, wherein,

(3b5) solving the objective function obtained by (3b4) according to a Lagrange multiplier method, namely converting the problem of solving the objective function into the problem of solving the generalized characteristic value, wherein the formula for solving the generalized characteristic value is as follows:

(3b6) solving the generalized eigenvalues according to the step (3b5) to obtain the first s maximum eigenvalues to obtain a projection matrix W of the global eigenvalue set X_X＝[α₁,α₂,...,α_s]And a projection matrix W of the local feature set Y_Y＝[β₁,β₂,...,β_s]Where s ═ min (q, p), min (·) denotes the minimum value, α₁,α₂,...,α_sThe global projection vectors corresponding to the first s maximum eigenvalues obtained for solving the generalized eigenvalues, β₁,β₂,...,β_sLocal projection vectors corresponding to the first s maximum eigenvalues obtained by solving the generalized eigenvalues;

(3b7) kernel matrix K from global feature set X_XKernel matrix K of local feature set Y_YProjection matrix W of global feature set X_XAnd a projection matrix W of the local feature set Y_YObtaining a depth 3D convolution descriptor:

(5) Depth feature vector z from training samples^trainTraining a linear SVM classifier;

(6) depth feature vector z for each test sample according to a linear SVM classifier^testAnd classifying to obtain the identification result of each test sample.