CN113591797A

CN113591797A - Deep video behavior identification method

Info

Publication number: CN113591797A
Application number: CN202110967362.8A
Authority: CN
Inventors: 杨剑宇; 黄瑶
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2021-08-23
Filing date: 2021-08-23
Publication date: 2021-11-02
Anticipated expiration: 2041-08-23
Also published as: CN113591797B

Abstract

The invention relates to a method for recognizing deep video behaviors, which comprises the following steps of: carrying out front, right side, left side and top projection on the depth video of each behavior sample to obtain a corresponding projection sequence; obtaining a dynamic image of each behavior sample by calculating a dynamic image of each projection sequence; inputting the dynamic image of each behavior sample into a feature extraction module and extracting features; connecting the features extracted from the dynamic images of each behavior sample, and inputting the connected features into a full connection layer; constructing a four-flow human behavior recognition network; calculating dynamic images of front, right side, left side and top projection sequences of the depth video of each training behavior sample, inputting the dynamic images into a four-stream human behavior recognition network, and training the four-stream human behavior recognition network until convergence; and calculating each dynamic image of the behavior sample to be tested, and inputting each calculated dynamic image into the trained four-flow human behavior recognition network to realize behavior recognition.

Description

Deep video behavior identification method

Technical Field

The invention relates to the technical field of computer vision, in particular to a deep video behavior identification method.

Background

At present, human behavior recognition is an important subject in the field of computer vision. The method is widely applied to the fields of video monitoring, human-computer interaction and the like.

Traditional methods focus on design-manual feature extraction of spatiotemporal information in depth video, followed by classification using classifiers such as support vector machines. However, the methods extract shallow features, and the experimental results are not ideal. Due to the development of computers, more and more scholars use deep neural networks for human behavior recognition. The convolutional neural network has strong learning ability on images and videos, so that the convolutional neural network is used for analyzing the depth videos and identifying human behaviors is a good choice. Some researchers propose to use a three-dimensional convolutional neural network to extract deep space-time characteristics in a deep behavior video, but the deep behavior video is directly input into the convolutional neural network, so that three-dimensional information in the deep behavior video cannot be well utilized. Compared with a two-dimensional convolutional neural network, the three-dimensional convolutional neural network has more parameters, more training data are needed to enable the network to be converged, and the three-dimensional convolutional neural network generally shows the performance under the condition of a smaller data set.

Therefore, aiming at the problem of the behavior recognition algorithm, a deep video behavior recognition method is provided.

Disclosure of Invention

The invention is provided for solving the problems in the prior art, and aims to provide a deep video behavior identification method which solves the problem that the deep features extracted by the existing identification method cannot fully utilize three-dimensional information in a deep behavior video.

A depth video behavior recognition method comprises the following steps:

1) carrying out front, right side, left side and top projection on the depth video of each behavior sample to obtain a corresponding projection sequence;

2) obtaining a dynamic image of each behavior sample by calculating a dynamic image of each projection sequence;

3) inputting the dynamic image of each behavior sample into a feature extraction module and extracting features;

4) connecting the features extracted from the dynamic images of each behavior sample, and inputting the connected features into a full connection layer;

5) constructing a four-flow human behavior recognition network;

6) calculating dynamic images of front, right side, left side and top projection sequences of the depth video of each training behavior sample, inputting the dynamic images into a four-stream human behavior recognition network, and training the four-stream human behavior recognition network until convergence;

7) and calculating each dynamic image of the behavior sample to be tested, and inputting the calculated each dynamic image into a trained four-flow human behavior recognition network to realize behavior recognition.

Preferably, the projection sequence in step 1) is obtained by:

each behavior sample is composed of all frames in the depth video of the sample, the depth video of any behavior sample is obtained,

V＝{I_t|t∈[1,N]}，

wherein t represents a time index, and N is the total frame number of the depth video V of the behavior sample; i is_t∈^R×CR, C respectively corresponding to the row number and the column number of the matrix representation of the t frame depth image of the depth video V of the behavior sample, wherein the representation matrix is a real matrix; i is_t(x_i,y_i)＝d_iThe coordinate on the depth image of the t-th frame is (x)_i,y_i) Point p of_iOf depth, i.e. point p_iDistance from depth camera, d_i∈[0,D]D represents the farthest distance that the depth camera can detect;

the depth video V of a behavior sample may be represented as a set of projection sequences, formulated as follows:

V＝{V_front,V_right,V_left,V_top}，

wherein, V_frontProjection sequence obtained by front projection of a depth video V representing a behavior sample, V_rightA projection sequence obtained by right side projection of a depth video V representing a behavior sample, V_leftA projection sequence obtained by left side projection of a depth video V representing a behavior sample, V_topRepresenting depth of a behavioral sampleAnd the video V carries out top surface projection to obtain a projection sequence.

Preferably, the method for calculating the moving image in step 2) is:

front projection sequence V of depth video V with behavior samples_front＝{F_t|t∈[1,N]Take the example, first to F_tVectorization, i.e. F_tIs connected into a new row vector i_t；

For row vector i_tEach element in (a) is used for calculating the arithmetic square root to obtain a new vector w_tNamely:

wherein the content of the first and second substances,

representing a row vector i_tEach element in (1) is used to calculate the arithmetic square root, and w is calculated_tFront projection sequence V of depth video V as behavior sample_frontThe frame vector of the t-th frame of (1);

calculating a front projection sequence V of a depth video V of a behavior sample_frontThe feature vector v of the t frame image_tThe calculation method is as follows:

wherein the content of the first and second substances,

front projection sequence V representing depth video V on a behavior sample_frontSumming frame vectors of the 1 st frame image to the t th frame image;

calculating a front projection sequence V of a depth video V of a behavior sample_frontT frame image F_tScore B of_tThe calculation formula is as follows:

B_t＝u^T·v_t，

wherein u is a vector with dimension a, a ═ R × C; u. of^TRepresents transposing the vector u; u. of^T·v_tRepresenting the vector obtained by transposing the calculation pair vector u and the feature vector v_tDot product of (2);

calculating the value of u such that the sequence of front projections V_frontThe middle frame images are sorted from front to back, the scores are increased progressively, namely the score B is increased when t is larger_tThe higher; the calculation of u can use RankSVM calculation, and the calculation method is as follows:

wherein the content of the first and second substances,

denotes u that minimizes the value of E (u), λ is a constant, | u | | non-calculation²Representing the sum of the squares of each element in the calculation vector u; b is_c、B_jFront projection sequence V of depth video V representing behavior samples respectively_frontScore of image of frame c, score of image of frame j, max {0,1-B_c+B_jMeans to choose 0 and 1-B_c+B_jThe larger of the values;

after calculating the vector u using RankSVM, the vector u is aligned to F_tThe image form with the same size is obtained to obtain u' epsilon^R×CU' is the front projection sequence V of the depth video V of the behavior sample_front.

Preferably, the feature extraction module comprises a convolution unit 1, a convolution unit 2, a convolution unit 3, a convolution unit 4, a convolution unit 5, a multi-feature fusion unit, an average pooling layer, a full connection layer 1 and a full connection layer 2; firstly, the outputs of the convolution unit 1, the convolution unit 2, the convolution unit 3, the convolution unit 4 and the convolution unit 5 are input to the multi-feature fusion unit in sequence, and then the outputs of the multi-feature fusion unit are input to the multi-feature fusion unitOutput M₆Inputting the output S of the average pooling layer into a full-link layer 1, wherein the number of neurons in the full-link layer 1 is D₁Output Q of full connection layer 1₁The calculation method of (c) is as follows:

Q₁＝φ_relu(W₁·S+θ₁),

wherein phi is_reluIs a relu activation function, W₁Is the weight of the fully connected layer 1, θ₁Is the offset vector of fully connected layer 1;

output Q of full connection layer 1₁Inputting into the full connection layer 2, the number of the neurons of the full connection layer 2 is D₂Output Q of full connection layer 2₂The calculation method of (c) is as follows:

Q₂＝φ_relu(W₂·Q₁+θ₂),

wherein, W₂Is the weight of the fully connected layer 2, θ₂The offset vector of the full connection layer 2, and the output of the full connection layer 2 is the feature extracted by the feature extraction module;

respectively inputting the dynamic images of the front, right side, left side and top projection sequences of the depth video V of the behavior sample into a feature extraction module to extract features

Preferably, the features in step 4) are connected in such a way that the extracted features are extracted

Fully-connected layer 3 connected as a vector with input activation function softmax, output Q of fully-connected layer 3₃The calculation method of (c) is as follows:

wherein phi is_softmaxDenotes the softmax activation function, W₃Is the weight of the fully-connected layer 3,

express features

Connected as a vector, θ₃Is the offset vector for fully connected layer 3.

Preferably, the step 5) of constructing the four-stream human behavior recognition network is as follows: the input of the network is a dynamic image of a projection sequence of a depth video of a behavior sample, and the output is Q₃The loss function L of the network is,

wherein G is the total training sample number K is the behavior sample class number,

is the net output of the g-th behavior sample, l_gIs the expected output of the g-th behavior sample, where l_gIs defined as:

wherein l_gIs the tag value of the g-th sample.

Preferably, the behavior in step 7) is identified as: and calculating dynamic images of the front, right side, left side and top projection sequences of the depth video of the behavior sample to be tested, inputting the dynamic images into a trained four-stream behavior recognition network, and obtaining a probability value for predicting the behavior class to which the current testing behavior video sample belongs, wherein the behavior class with the maximum probability value is the behavior class to which the current testing behavior video sample finally predicts.

Preferably, said V_frontProjection sequence acquisition mode:

V_front＝{F_t|t∈[1,N]in which F_t∈^R×CA projection diagram obtained by front projection of the tth frame depth image of the depth video V of the motion sample is shown; point p in depth image_iX of the abscissa_iY longitudinal coordinate value_iDepth value d_iDetermine the projection of the point onto the projection F_tThe abscissa value of the point

Ordinate value

Pixel value

Can be formulated as:

wherein f is₁To be depth value d_iMapping to [0,255]A linear function of the interval such that a point with a smaller depth value has a larger pixel value on the projection image, i.e., a point closer to the depth camera has a brighter value on the front projection image;

V_rightprojection sequence acquisition mode:

V_right＝{R_t|t∈[1,N]in which R is_t∈^R×DA projection diagram obtained by projecting the right side surface of the t frame depth image is shown; when right side projection is performed on the depth image, at least one point is projected to the same position on the projection graph; and observing behaviors from the right side, the closest point to the observer, namely the farthest point from the projection plane can be seen; keeping the abscissa value of the point farthest from the projection plane on the depth image, and calculating the pixel value of the point at the position of the projection image according to the abscissa value; the depth image is arranged in a row by row from the row with the minimum horizontal coordinate x to the direction increasing from xTraversing points in the depth image, projecting the points on the projection graph, and projecting the points p in the depth image_iX of the abscissa_iY longitudinal coordinate value_iDepth value d_iSeparately determine the projection view R_tPixel value of a point in (1)

Ordinate value

Horizontal coordinate value

Is formulated as:

wherein f is₂To be compared with the abscissa value x_iMapping to [0,255]A linear function of the interval; when x is continuously increased, if a new point and a previously projected point are projected to the same position of the projection graph, the latest point is reserved, namely the pixel value of the position of the projection graph is calculated by using the abscissa value of the point with the largest abscissa value, namely the pixel value of the position of the projection graph is calculated

Wherein x_m＝maxx_i,x_i∈X_R，X_RFor all ordinate values in the depth image to be

Depth value of

Of points of (a), maxx_i,x_i∈X_RA set of representations X_RMaximum of the abscissa in (1);

V_leftprojection sequence acquisition mode:

V_left＝{L_t|t∈[1,N]in which L is_t∈^R×DA projection diagram obtained by projecting the left side surface of the t-th frame depth image is shown; when a plurality of points are projected to the same position of the left side surface projection drawing, keeping the point farthest from the projection plane; traversing the points in the depth image from one row with the maximum horizontal coordinate x to the direction of decreasing x row by row, projecting the points on the left side projection image, and the points p in the depth image_iX of the abscissa_iY longitudinal coordinate value_iDepth value d_iDetermine the projection views L respectively_tPixel value of a point in (1)

Ordinate value

Horizontal coordinate value

For the same coordinates projected onto the left side projection view

Selecting the abscissa value of the point with the minimum abscissa to calculate the pixel value of the projection graph at the coordinate, and expressing the pixel value as the following formula:

wherein f is₃To be compared with the abscissa value x_nMapping to [0,255]Linear function of interval, x_n＝minx_i,x_i∈X_L，X_LFor all ordinate values in the depth image to be

Depth value of

Set of abscissas of points of (1), minx_i,x_i∈X_LA set of representations X_LThe minimum value of the middle abscissa;

V_topprojection sequence acquisition mode:

V_top＝{T_t|t∈[1,N]in which O is_t∈^D×CA projection diagram obtained by projecting the depth image of the t frame from the top surface is shown; when the plurality of points are projected to the same position of the top surface projection drawing, the point farthest away from the projection plane is reserved; traversing the points in the depth image line by line from the line with the minimum ordinate y on the depth image to the increasing direction of y, projecting the points on the top surface projection image, and the points p in the depth image_iX of the abscissa_iY longitudinal coordinate value_iDepth value d_iDetermine the projection of the point onto the projection graph O_tThe abscissa value of the point

Pixel value

Ordinate value

For the same coordinates projected onto the projection view

Selecting the longitudinal coordinate value of the point with the maximum longitudinal coordinate as the pixel value of the projection graph at the coordinate, and expressing the pixel value as follows:

wherein f is₄To be compared with the ordinate value y_qMapping to [0,255]Linear function of interval, y_q＝maxy_i,y_i∈Y_OWherein Y is_OAll the abscissa values in the depth image are

Depth value of

Set of ordinates of the points of (1), maxy_i,y_i∈Y_OSet of representations Y_OMaximum of the middle ordinate.

Preferably, the convolution unit 1 comprises 2 convolution layers and 1 max pooling layer; each convolution layer has 64 convolution kernels, each convolution kernel has a size of 3 × 3, the pooling kernel of the largest pooling layer has a size of 2 × 2, and the output of convolution unit 1 is C₁；

The convolution unit 2 contains 2 convolution layers each having 128 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and the input of the convolution unit 2 is C₁The output is C₂；

The convolution unit 3 comprises 3 convolution layers and 1 maximum pooling layer, each convolution layer has 256 convolution kernels, each convolution kernel has a size of 3 × 3, the pooling kernel of the maximum pooling layer has a size of 2 × 2, and the input of the convolution unit 3 is C₂The output is C₃；

The convolution unit 4 contains 3 convolution layers each having 512 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and the input of the convolution unit 4 is C₃The output is C₄；

The convolution unit 5 contains 3 convolution layers each having 512 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and a volumeThe input of the product unit 5 is C₄The output is C₅；

The input of the multi-feature fusion unit is the output C of the convolution unit 1₁Output C of convolution unit 2₂Output C of convolution unit 3₃Output C of convolution unit 4₄Output C of convolution unit 5₅；

Output C of convolution unit 1₁Inputting a maximum pooling layer 1 and a convolutional layer 1 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 1 is 4 multiplied by 4, the convolutional layer 1 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 1 is M₁；

Output C of convolution unit 2₂Inputting a maximum pooling layer 2 and a convolutional layer 2 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 2 is 2 x 2, the convolutional layer 2 has 512 convolutional kernels, the size of the convolutional kernel is 1 x 1, and the output of the convolutional layer 2 is M₂；

The convolution unit 3 inputs the convolution layer 3 in the multi-feature fusion unit, the convolution layer 3 has 512 convolution kernels, the size of the convolution kernels is 1 multiplied by 1, and the output of the convolution layer 3 is M₃；

Output C of convolution unit 4₄Inputting an upsampling layer 1 and a convolutional layer 4 in a multi-feature fusion unit, wherein the convolutional layer 4 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 4 is M₄；

Output C of convolution unit 5₅Inputting an upsampling layer 2 and a convolutional layer 5 in the multi-feature fusion unit, wherein the convolutional layer 5 has 512 convolutional kernels, the size of the convolutional kernel is 1 multiplied by 1, and the output of the convolutional layer 5 is M₅；

The output M of the convolution layer 1₁The output M of the convolutional layer 2₂And the output M of the convolution layer 3₃The output M of the convolutional layer 4₄The output M of the convolutional layer 5₅Connected by channels, input convolutional layer 6, convolutional layer 6 has 512 convolutional kernels, the size of convolutional kernel is 1 × 1, and the output of convolutional layer 6 is M₆；

The output of the multi-feature fusion unit is M as the output of the convolutional layer 6₆。

The invention has the following beneficial effects: 1) the information such as the appearance of the person can not be acquired based on the behavior recognition of the depth video, and the privacy of the person is protected; meanwhile, the depth video is not easily influenced by illumination, and richer three-dimensional information about behaviors can be provided;

2) the depth video is projected to different planes, information of different dimensionalities of behaviors can be obtained, and the information is combined, so that the human behavior can be more easily identified; when the network is trained, only 4 dynamic images are used as the compact representation input network of the video, and the requirement on the performance of computer equipment is not high.

Drawings

FIG. 1 is a flow chart of the present invention

FIG. 2 is a flow diagram of an extraction module.

Fig. 3 is a flow chart of a four-stream human behavior recognition network.

FIG. 4 is a schematic plane projection diagram of a hand waving behavior in the embodiment.

Fig. 5 is a front projection dynamic image of the waving behavior in the embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An embodiment of the present invention, referring to fig. 1 to 5, is a depth video behavior recognition method, including the following steps:

1) carrying out front, right side, left side and top projection on the depth video of each behavior sample to obtain 4 projection sequences;

2) calculating the dynamic images of the 4 projection sequences of each behavior sample to obtain 4 dynamic images of each behavior sample;

3) respectively inputting the 4 dynamic images into a feature extraction module to extract features;

4) connecting the extracted features of the 4 dynamic images, and inputting the features into a full connection layer;

5) constructing a four-flow human behavior recognition network;

6) calculating dynamic images of front, right side, left side and top projection sequences of the depth video of each training behavior sample, inputting a four-stream human behavior recognition network, and training the network until convergence;

7) and 4 dynamic images of each test behavior sample are calculated, and the trained four-flow human behavior recognition network is input to realize behavior recognition.

Acquiring a projection sequence in the step 1):

each behavior sample is composed of all frames in the depth video of the sample, and for any behavior sample's depth video V:

V＝{I_t|t∈[1,N]},

and respectively projecting the depth video V of the behavior sample to four planes, namely a front surface, a right side surface, a left side surface and a top surface. At this time, the depth video V of the behavior sample can be expressed as a set of four projection graph sequences, which are formulated as follows:

V＝{V_front,V_right,V_left,V_top}，

wherein, V_frontProjection sequence obtained by front projection of a depth video V representing a behavior sample, V_rightA projection sequence obtained by right side projection of a depth video V representing a behavior sample, V_leftA projection sequence obtained by left side projection of a depth video V representing a behavior sample, V_topA projection sequence obtained by top surface projection is carried out on the depth video V representing the behavior sample;

V_front＝{F_t|t∈[1,N]in which F_t∈^R×CThe method comprises the steps that a projection diagram obtained by front projection is carried out on a tth frame depth image of a depth video V of a behavior sample; point p in depth image_iX of the abscissa_iY longitudinal coordinate value_iDepth value d_iDetermine the projection of the point onto the projection F_tThe abscissa value of the point

Ordinate value

Pixel value

Can be formulated as:

wherein f is₁To be depth value d_iMapping to [0,255]The linear function of the interval is such that points with smaller depth values have larger pixel values on the projection view, i.e. points closer to the depth camera are brighter on the front projection view.

V_right＝{R_t|t∈[1,N]In which R is_t∈^R×DA projection diagram obtained by projecting the right side surface of the t frame depth image is shown; when right side projecting the depth image, there may be more than one point projected to the same location on the projection map; while observing the behavior from the right side, what can be seen is the point closest to the observer, i.e. the point closest to the projection planeA remote point; therefore, the abscissa value of the point farthest from the projection plane on the depth image should be retained, and the pixel value of the point at the position of the projection view should be calculated using the abscissa value. For this purpose, points in the depth image are traversed column by column starting from a column on the depth image with the smallest horizontal coordinate x and increasing in the direction x, projected onto the projection map, and points p in the depth image_iX of the abscissa_iY longitudinal coordinate value_iDepth value d_iSeparately determine the projection view R_tPixel value of a point in (1)

Ordinate value

Horizontal coordinate value

Is formulated as:

wherein f is₂To be compared with the abscissa value x_iMapping to [0,255]A linear function of the interval; when x is continuously increased, new points and the points which are projected before are projected to the same position of the projection graph, the latest points are reserved, namely the pixel value of the position of the projection graph is calculated by using the abscissa value of the point with the largest abscissa value, namely the pixel value of the position of the projection graph is calculated

Depth value of

Of points of (a), maxx_i,x_i∈X_RA set of representations X_RMaximum of the abscissa in (1).

V_left＝{L_t|t∈[1,N]In which L is_t∈^R×DA projection diagram obtained by projecting the left side surface of the t-th frame depth image is shown; similar to obtaining the right side projection view, when there are a plurality of points projected to the same position of the left side projection view, the point farthest from the projection plane should be reserved; for this purpose, points in the depth image are traversed in a row-by-row manner starting from a row of the depth image with the maximum horizontal coordinate x and decreasing in the direction of x, and projected onto the left side projection image, and points p in the depth image_iX of the abscissa_iY longitudinal coordinate value_iDepth value d_iDetermine the projection views L respectively_tPixel value of a point in (1)

Ordinate value

Horizontal coordinate value

For the same coordinates projected onto the left side projection view

Depth value of

Set of abscissas of points of (1), minx_i,x_i∈X_LA set of representations X_LThe minimum value of the middle abscissa.

V_top＝{T_t|t∈[1,N]In which O is_t∈^D×CShowing the projection view of the t-th frame depth image projected from the top surface. When a plurality of points are projected to the same position of the top surface projection drawing, the point farthest away from the projection plane should be reserved; for this purpose, points in the depth image are traversed line by line starting from the line with the smallest ordinate y on the depth image in the direction in which y increases, projected onto the top projection image, and points p in the depth image_iX of the abscissa_iY longitudinal coordinate value_iDepth value d_iDetermine the projection of the point onto the projection graph O_tThe abscissa value of the point

Pixel value

Ordinate value

For the same coordinates projected onto the projection view

Depth value of

Acquiring a dynamic image in step 2:

front projection sequence V of depth video V with behavior samples_front＝{F_t|t∈[1,N]For example, the motion image is calculated as follows:

first to F_tVectorization, i.e. F_tIs connected into a new row vector i_t；

wherein the content of the first and second substances,

representing a row vector i_tEach element in (a) takes the arithmetic square root; note w_tFront projection sequence V of depth video V as behavior sample_frontThe frame vector of the t-th frame of (1);

computing depth of behavior samplesFront projection sequence V of degree video V_frontThe feature vector v of the t frame image_tThe calculation method is as follows:

wherein the content of the first and second substances,

B_t＝u^T·v_t，

where u is a vector with dimension a, and a ═ R × C. u. of^TRepresents transposing the vector u; u. of^T·v_tRepresenting the vector obtained by transposing the calculation pair vector u and the feature vector v_tDot product of (2);

calculating the value of u such that the sequence of front projections V_frontThe more the frame image arranged at the back, the higher the score, i.e. the larger t, the score B_tThe higher; the calculation of u can use RankSVM calculation, and the calculation method is as follows:

wherein the content of the first and second substances,

after calculating the vector u using RankSVM, the vector u is aligned to F_tThe image form with the same size is obtained to obtain u' epsilon^R×CSequence V of front projections of a depth video V, called u' as a behavioral sample_frontThe moving picture of (2).

The dynamic images of the right side, left side, and top projection sequences of the depth video V of the behavior sample are calculated in the same manner as the dynamic images of the front projection sequence.

And (3) extracting the features of the feature extraction module:

as shown in fig. 2, the dynamic images of the front, right, left, and top projection sequences of the depth video of the behavior sample are respectively input to the feature extraction module to extract features. The feature extraction module comprises a convolution unit 1, a convolution unit 2, a convolution unit 3, a convolution unit 4, a convolution unit 5, a multi-feature fusion unit, an average pooling layer, a full connection layer 1 and a full connection layer 2.

Convolution unit 1 contains 2 convolution layers and 1 max pooling layer. Each convolution layer has 64 convolution kernels, each convolution kernel has a size of 3 × 3, the pooling kernel of the largest pooling layer has a size of 2 × 2, and the output of convolution unit 1 is C₁。

The convolution unit 2 contains 2 convolution layers each having 128 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and the input of the convolution unit 2 is C₁The output is C₂。

The convolution unit 3 comprises 3 convolution layers and 1 maximum pooling layer, each convolution layer has 256 convolution kernels, each convolution kernel has a size of 3 × 3, the pooling kernel of the maximum pooling layer has a size of 2 × 2, and the input of the convolution unit 3 is C₂The output is C₃。

Convolution unit 4 contains 3 convolution layers and 1 max pooling layer, each volume512 convolution kernels are stacked, the size of each convolution kernel is 3 x 3, the size of the pooling kernel of the maximum pooling layer is 2 x 2, and the input of the convolution unit 4 is C₃The output is C₄。

The convolution unit 5 comprises 3 convolution layers each having 512 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and the input of the convolution unit 5 is C₄The output is C₅。

The input of the multi-feature fusion unit is the output C of the convolution unit 1₁Output C of convolution unit 2₂Output C of convolution unit 3₃Output C of convolution unit 4₄Output C of convolution unit 5₅。

Output C of convolution unit 1₁Inputting a maximum pooling layer 1 and a convolutional layer 1 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 1 is 4 multiplied by 4, the convolutional layer 1 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 1 is M₁。

Output C of convolution unit 2₂Inputting a maximum pooling layer 2 and a convolutional layer 2 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 2 is 2 x 2, the convolutional layer 2 has 512 convolutional kernels, the size of the convolutional kernel is 1 x 1, and the output of the convolutional layer 2 is M₂。

The convolution unit 3 inputs the convolution layer 3 in the multi-feature fusion unit, the convolution layer 3 has 512 convolution kernels, the size of the convolution kernels is 1 multiplied by 1, and the output of the convolution layer 3 is M₃。

Output C of convolution unit 4₄Inputting an upsampling layer 1 and a convolutional layer 4 in a multi-feature fusion unit, wherein the convolutional layer 4 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 4 is M₄。

Output C of convolution unit 5₅Inputting an upsampling layer 2 and a convolutional layer 5 in the multi-feature fusion unit, wherein the convolutional layer 5 has 512 convolutional kernels, the size of the convolutional kernel is 1 multiplied by 1, and the output of the convolutional layer 5 is M₅。

The output M of the convolution layer 1₁The output M of the convolutional layer 2₂And the output M of the convolution layer 3₃The output M of the convolutional layer 4₄The output M of the convolutional layer 5₅Connected by channels, input convolutional layer 6, convolutional layer 6 has 512 convolutional kernels, the size of convolutional kernel is 1 × 1, and the output of convolutional layer 6 is M₆。

Fusing the output M of the unit with multiple features₆Inputting the average pooling layer, wherein the output of the average pooling layer is S, the output S of the average pooling layer is input into a full-link layer 1, and the number of neurons in the full-link layer 1 is D₁Output Q of full connection layer 1₁The calculation method of (c) is as follows:

Q₁＝φ_relu(W₁·S+θ₁),

Q₂＝φ_relu(W₂·Q₁+θ₂),

wherein, W₂Is the weight of the fully connected layer 2, θ₂Is the offset vector for fully connected layer 2. The output of the full connection layer 2 is the feature extracted by the feature extraction module;

dynamic images of front, right side, left side and top projection sequences of the depth video V of the behavior sample are respectively input into the feature extraction module, and features can be extracted

Connecting the features extracted in the step 3) in the step 4):

inputting dynamic images of four projection sequences of the depth video of each behavior sample into a feature extraction module to obtain features, and inputting a full connection layer 3 with an activation function of softmax; all connecting layers 3Output Q₃The calculation method of (c) is as follows:

express features

Connected as a vector, θ₃Is the offset vector for fully connected layer 3.

Step 5) constructing a four-flow human behavior recognition network:

as shown in FIG. 3, the input of the network is the dynamic image of the front, right, left, and top projection sequence of the depth video of the behavior samples, and the output is the probability that the corresponding behavior sample belongs to each behavior class, i.e. the output Q of the full link layer 3₃(ii) a The loss function L of the network is:

wherein G is the total training sample number, K is the behavior sample class number,

wherein l_gIs the tag value of the g-th sample.

And 6) calculating dynamic images of the front, right side, left side and top projection sequences of the depth video of each training behavior sample, inputting the dynamic images into a four-stream human behavior recognition network, and training the network until convergence.

And 7) calculating dynamic images of the front, right side, left side and top projection sequences of the depth video of each test behavior sample, inputting the dynamic images into a trained four-stream behavior recognition network, obtaining the probability values of all the behavior classes to which the current test behavior video sample is predicted, wherein the behavior class with the maximum probability value is the behavior class to which the current test behavior video sample is finally predicted, and therefore behavior recognition is achieved.

Example (b):

as shown in the figures 4-5 of the drawings,

1) the collective number of samples of behavior samples is 2400, and there are 8 behavior classes, and each behavior class has 300 samples. Randomly selecting two thirds of samples in each behavior category to be divided into a training set, and dividing the remaining one third of samples into a testing set to obtain 1600 training samples and 800 testing samples.

Each behavior sample consists of all frames in the sample depth video. Take the depth video V of any behavior sample as an example:

V＝{I_t|t∈[1,50]}，

where t represents the time index, the behavior sample has a total of 50 frames. I is_t∈^240×240The line sample is a matrix representation of the t frame depth image of the depth video V, and the number of rows and columns of the frame depth image is 240. The representation matrix is a real matrix. I is_t(x_i,y_i)＝d_iThe coordinate on the depth image of the t-th frame is (x)_i,y_i) Point p of_iOf depth, i.e. point p_iDistance from the depth camera.

V＝{V_front,V_right,V_left,V_top}，

wherein, V_frontProjection sequence obtained by front projection of a depth video V representing a behavior sample, V_rightA projection sequence obtained by right side projection of a depth video V representing a behavior sample, V_leftA projection sequence obtained by left side projection of a depth video V representing a behavior sample, V_topA sequence of projections of the depth video V representing the behavior sample onto the top surface.

V_front＝{F_t|t∈[1,50]In which F_t∈^240×240The projection diagram is obtained by front projection of the tth frame depth image of the depth video V of the behavior sample. Point p in depth image_iX of the abscissa_iY longitudinal coordinate value_iDepth value d_iDetermine the projection of the point onto the projection F_tThe abscissa value of the point

Ordinate value

Pixel value

Can be formulated as:

V_right＝{R_t|t∈[1,50]In which R is_t∈^240×240A projection view obtained by right side projection of the t-th frame depth image is shown.When right side projecting the depth image, there may be more than one point projected to the same location on the projection map. And viewing the behavior from the right side, the closest point to the viewer, i.e. the point furthest from the projection plane, can be seen. Therefore, the abscissa value of the point farthest from the projection plane on the depth image should be retained, and the pixel value of the point at the position of the projection view should be calculated using the abscissa value. For this purpose, points in the depth image are traversed column by column starting from a column on the depth image with the smallest horizontal coordinate x and increasing in the direction x, projected onto the projection map, and points p in the depth image_iX of the abscissa_iY longitudinal coordinate value_iDepth value d_iSeparately determine the projection view R_tPixel value of a point in (1)

Ordinate value

Horizontal coordinate value

Is formulated as:

wherein f is₂To be compared with the abscissa value x_iMapping to [0,255]A linear function of the interval. When x is increased continuously, new points may be projected to the same position of the projection graph as the points which have been projected before, and the latest points should be kept, i.e. the pixel value of the position of the projection graph is calculated by using the abscissa value of the point with the largest abscissa value

Depth value of

V_left＝{L_t|t∈[1,50]In which L is_t∈^240×240The projection diagram is obtained by projecting the left side surface of the t-th frame depth image. Similar to obtaining the right side projection view, when there are multiple points projected to the same location of the left side projection view, the point farthest from the projection plane should be kept. For this purpose, points in the depth image are traversed in a row-by-row manner starting from a row of the depth image with the maximum horizontal coordinate x and decreasing in the direction of x, and projected onto the left side projection image, and points p in the depth image_iX of the abscissa_iY longitudinal coordinate value_iDepth value d_iDetermine the projection views L respectively_tPixel value of a point in (1)

Ordinate value

Horizontal coordinate value

For the same coordinates projected onto the left side projection view

Depth value of

V_top＝{T_t|t∈[1,50]In which O is_t∈^240×240Showing the projection view of the t-th frame depth image projected from the top surface. When there are multiple points projected onto the same location of the top surface projection, the point farthest from the projection plane should be retained. For this purpose, points in the depth image are traversed line by line starting from the line with the smallest ordinate y on the depth image in the direction in which y increases, projected onto the top projection image, and points p in the depth image_iX of the abscissa_iY longitudinal coordinate value_iDepth value d_iDetermine the projection of the point onto the projection graph O_tThe abscissa value of the point

Pixel value

Ordinate value

For the same coordinates projected onto the projection view

At the point of the (c) or (d),selecting the ordinate value of the point with the maximum ordinate as the pixel value of the projection graph at the coordinate, and expressing the ordinate value as the following formula:

Depth value of

2) And calculating the dynamic images of the 4 projection sequences of the depth video of each behavior sample to obtain 4 dynamic images of each behavior sample. Front projection sequence V of depth video V with behavior samples_front＝{F_t|t∈[1,50]For example, the motion image is calculated as follows:

first to F_tVectorization, i.e. F_tIs connected into a new row vector i_t。

wherein the content of the first and second substances,

representing a row vector i_tEach element in (a) takes the arithmetic square root. Note w_tFront projection sequence V of depth video V as behavior sample_frontThe frame vector of the t-th frame.

wherein the content of the first and second substances,

B_t＝u^T·v_t，

where u is a vector with dimension 57600. u. of^TRepresents transposing the vector u; u. of^T·v_tRepresenting the vector obtained by transposing the calculation pair vector u and the feature vector v_tDot product of (2);

wherein the content of the first and second substances,

after calculating the vector u using RankSVM, the vector u is aligned to F_tThe image form with the same size is obtained to obtain u' epsilon^240×240Sequence V of front projections of a depth video V, called u' as a behavioral sample_frontThe moving picture of (2). Fig. 4 is a front projection dynamic image of the hand waving behavior.

The motion images of the right, left, and top projection sequences of the depth video V of the behavior sample are calculated in the same manner as the motion images of the front projection sequence.

3) And respectively inputting the dynamic images of the front, right side, left side and top projection sequences of the depth video of the behavior sample into a feature extraction module to extract features. The feature extraction module comprises a convolution unit 1, a convolution unit 2, a convolution unit 3, a convolution unit 4, a convolution unit 5, a multi-feature fusion unit, an average pooling layer, a full connection layer 1 and a full connection layer 2.

Convolution unit 1 contains 2 convolution layers and 1 max pooling layer. There are 64 convolution kernels per convolution layer, each convolution kernel being 3 x 3 in size, and the pooling kernel for the largest pooling layer being 2 x 2 in size. The output of convolution unit 1 is C₁。

Convolution unit 2 contains 2 convolution layers and 1 max pooling layer. There are 128 convolution kernels per convolution layer, each convolution kernel being 3 x 3 in size, and the pooling kernel for the largest pooling layer being 2 x 2 in size. The input of the convolution unit 2 is C₁The output is C₂。

The convolution unit 3 contains 3 convolution layers and 1 max pooling layer. Each convolution layer has 256 convolution kernels eachThe size of each convolution kernel is 3 × 3, and the size of the pooling kernel of the maximum pooling layer is 2 × 2. The input of the convolution unit 3 is C₂The output is C₃。

The convolution unit 4 contains 3 convolution layers and 1 max pooling layer. There are 512 convolution kernels per convolution layer, each convolution kernel being 3 x 3 in size, and the pooling kernel of the maximum pooling layer being 2 x 2 in size. The input of the convolution unit 4 is C₃The output is C₄。

The convolution unit 5 contains 3 convolution layers and 1 max pooling layer. There are 512 convolution kernels per convolution layer, each convolution kernel being 3 x 3 in size, and the pooling kernel of the maximum pooling layer being 2 x 2 in size. The input of the convolution unit 5 is C₄The output is C₅。

Output C of convolution unit 5₅Upsampling layer in input multi-feature fusion unit2 and convolutional layer 5, the convolutional layer 5 has 512 convolutional kernels, the size of the convolutional kernel is 1 × 1, and the output of the convolutional layer 5 is M₅。

Fusing the output M of the unit with multiple features₆Inputting the average pooling layer, wherein the output of the average pooling layer is S, the output S of the average pooling layer is input into a full connection layer 1, the number of neurons of the full connection layer 1 is 4096, and the output Q of the full connection layer 1₁The calculation method of (c) is as follows:

Q₁＝φ_relu(W₁·S+θ₁),

wherein phi is_reluIs a relu activation function, W₁Is the weight of the fully connected layer 1, θ₁Is the offset vector for fully connected layer 1.

Output Q of full connection layer 1₁Inputting full connection layer 2, the number of neurons of full connection layer 2 is 1000, and the output Q of full connection layer 2₂The calculation method of (c) is as follows:

Q₂＝φ_relu(W₂·Q₁+θ₂)，

wherein, W₂Is the weight of the fully connected layer 2, θ₂Is the offset vector for fully connected layer 2. The output of the full connection layer 2 is the feature extracted by the feature extraction module.

4) Projecting four of the depth video for each behavior sampleAnd inputting the dynamic images of the shadow sequence into the feature connection obtained by the feature extraction module, and inputting the full connection layer 3 with the activation function of softmax. Output Q of full connection layer 3₃The calculation method of (c) is as follows:

express features

Connected as a vector, θ₃Is the offset vector for fully connected layer 3.

5) Constructing a four-stream human behavior recognition network, wherein the input of the network is dynamic images of projection sequences of the front surface, the right side surface, the left side surface and the top surface of the depth video of the behavior samples, and the output is the probability that the corresponding behavior samples belong to each behavior class, namely the output Q of the full connection layer 3₃. The loss function L of the network is:

wherein the content of the first and second substances,

wherein l_gIs the tag value of the g-th sample.

6) And calculating dynamic images of the front, right side, left side and top projection sequences of the depth video of each training behavior sample, inputting the dynamic images into a four-stream human behavior recognition network, and training the network until convergence.

7) And calculating dynamic images of front, right side, left side and top projection sequences of the depth video of each test behavior sample, inputting the dynamic images into a trained four-stream human behavior recognition network, obtaining probability values which are predicted by the current test behavior video sample and belong to various behavior classes, and obtaining the behavior class with the maximum probability value as the behavior class of the finally predicted current test behavior video sample, thereby realizing behavior recognition.

relu activates the function, whose formula is f (x) ═ max (0, x), with the input of the function being x and the output being the larger of x and 0.

Softmax activation function of the formula

Wherein i represents the output of the ith neuron of the full connection layer, j represents the output of the jth neuron of the full connection layer, n is the number of the neurons of the full connection layer, and S_iRepresents the output of the ith neuron of the full junction layer through the softmax activation function.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A depth video behavior recognition method is characterized by comprising the following steps:

5) constructing a four-flow human behavior recognition network;

7) and calculating each dynamic image of the behavior sample to be tested, and inputting each calculated dynamic image into the trained four-flow human behavior recognition network to realize behavior recognition.

2. The method for identifying deep video behaviors as claimed in claim 1, wherein the projection sequence in step 1) is obtained by:

V＝{I_t|t∈[1,N]}，

V＝{V_front,V_right,V_left,V_top}，

wherein, V_frontProjection sequence obtained by front projection of a depth video V representing a behavior sample, V_rightA projection sequence obtained by right side projection of a depth video V representing a behavior sample, V_leftA projection sequence obtained by left side projection of a depth video V representing a behavior sample, V_topAnd the projection sequence obtained by top surface projection of the depth video V representing the behavior sample.

3. The method according to claim 1, wherein the dynamic image in step 2) is calculated in a manner that:

wherein the content of the first and second substances,

representing a row vector i_tEach element in (1) is used to calculate the arithmetic square root, and w is calculated_tAs depth of a behavioral sampleFront projection sequence V of video V_frontThe frame vector of the t-th frame of (1);

wherein the content of the first and second substances,

B_t＝u^T·v_t，

wherein the content of the first and second substances,

after calculating the vector u using RankSVM, the vector u is aligned to F_tThe image form with the same size is obtained to obtain u' epsilon^R×CU' is the front projection sequence V of the depth video V of the behavior sample_frontThe moving picture of (2).

4. The method for identifying the deep video behaviors according to claim 1, wherein the feature extraction module comprises a convolution unit 1, a convolution unit 2, a convolution unit 3, a convolution unit 4, a convolution unit 5, a multi-feature fusion unit, an average pooling layer, a full connection layer 1 and a full connection layer 2; firstly, the outputs of the convolution unit 1, the convolution unit 2, the convolution unit 3, the convolution unit 4 and the convolution unit 5 are input to the multi-feature fusion unit in sequence, and then the output M of the multi-feature fusion unit is output₆Inputting the output S of the average pooling layer into a full-link layer 1, wherein the number of neurons in the full-link layer 1 is D₁Output Q of full connection layer 1₁The calculation method of (c) is as follows:

Q₁＝φ_relu(W₁·S+θ₁),

Q₂＝φ_relu(W₂·Q₁+θ₂),

5. The method according to claim 1, wherein the features in step 4) are connected in such a way that the extracted features are connected

express features

Connected as a vector, θ₃Is the offset vector for fully connected layer 3.

6. The deep video behavior recognition method according to claim 1, wherein the step 5) constructs a four-stream human behavior recognition network as follows: the input of the network is a dynamic image of a projection sequence of a depth video of a behavior sample, and the output is Q₃The loss function L of the network is,

wherein l_gIs the tag value of the g-th sample.

7. The method for identifying behaviors of depth video according to claim 1, wherein the behaviors in step 7) are identified as follows: and calculating dynamic images of the front, right side, left side and top projection sequences of the depth video of the behavior sample to be tested, inputting the dynamic images into a trained four-stream behavior recognition network, and obtaining a probability value for predicting the behavior class to which the current testing behavior video sample belongs, wherein the behavior class with the maximum probability value is the behavior class to which the current testing behavior video sample finally predicts.

8. The method according to claim 2, wherein V is the video-depth behavior recognition method_frontProjection sequence acquisition mode:

V_front＝{F_t|t∈[1,N]in which F_t∈^R×CA projection diagram obtained by front projection of the tth frame depth image of the depth video V of the motion sample is shown; point p in depth image_iX of the abscissa_iY longitudinal coordinate value_iDepth value d_iDetermine the projection of the point toProjection F_tThe abscissa value of the point

Ordinate value

Pixel value

Can be formulated as:

V_rightprojection sequence acquisition mode:

V_right＝{R_t|t∈[1,N]in which R is_t∈^R×DA projection diagram obtained by projecting the right side surface of the t frame depth image is shown; when right side projection is performed on the depth image, at least one point is projected to the same position on the projection graph; and observing behaviors from the right side, the closest point to the observer, namely the farthest point from the projection plane can be seen; keeping the abscissa value of the point farthest from the projection plane on the depth image, and calculating the pixel value of the point at the position of the projection image according to the abscissa value; traversing the points in the depth image in a row-by-row mode from the row with the minimum horizontal coordinate x in the depth image to the direction increasing to x, projecting the points on the projection map, and obtaining the points p in the depth image_iX of the abscissa_iY longitudinal coordinate value_iDepth value d_iSeparately determine the projection view R_tPixel value of a point in (1)

Ordinate value

Horizontal coordinate value

Is formulated as:

Wherein x_m＝max x_i,x_i∈X_R，X_RFor all ordinate values in the depth image to be

Depth value of

Set of abscissas of points of (1), max x_i,x_i∈X_RA set of representations X_RMaximum of the abscissa in (1);

V_leftsequence of projectionsThe method comprises the following steps:

Ordinate value

Horizontal coordinate value

For the same coordinates projected onto the left side projection view

wherein f is₃To be compared with the abscissa value x_nMapping to [0,255]Linear function of interval, x_n＝min x_i,x_i∈X_L，X_LFor all ordinate values in the depth image to be

Depth value of

Set of abscissas of points (c), min x_i,x_i∈X_LA set of representations X_LThe minimum value of the middle abscissa;

V_topprojection sequence acquisition mode:

Pixel value

Ordinate value

For the same coordinates projected onto the projection view

wherein f is₄To be compared with the ordinate value y_qMapping to [0,255]Linear function of interval, y_q＝max y_i,y_i∈Y_OWherein Y is_OAll the abscissa values in the depth image are

Depth value of

Set of ordinates of the points of (1), max y_i,y_i∈Y_OSet of representations Y_OMaximum of the middle ordinate.

9. The method according to claim 4, wherein the convolution unit 1 comprises 2 convolution layers and 1 maximum pooling layer; each convolution layer has 64 convolution kernels, each convolution kernel has a size of 3 × 3, the pooling kernel of the largest pooling layer has a size of 2 × 2, and the output of convolution unit 1 is C₁；

The convolution unit 4 includes 3 convolution layers each having 512 convolution kernels, each having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, the convolution kernelsThe input of the cell 4 is C₃The output is C₄；

The convolution unit 5 comprises 3 convolution layers each having 512 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and the input of the convolution unit 5 is C₄The output is C₅；

The output M of the convolution layer 1₁The output M of the convolutional layer 2₂And the output M of the convolution layer 3₃The output M of the convolutional layer 4₄The output M of the convolutional layer 5₅Connected by channels, fed into the convolutional layer 6, convolutional layer 6There are 512 convolution kernels, the size of the convolution kernel is 1 × 1, and the output of convolution layer 6 is M₆；