CN113591797A - Deep video behavior identification method - Google Patents
Deep video behavior identification method Download PDFInfo
- Publication number
- CN113591797A CN113591797A CN202110967362.8A CN202110967362A CN113591797A CN 113591797 A CN113591797 A CN 113591797A CN 202110967362 A CN202110967362 A CN 202110967362A CN 113591797 A CN113591797 A CN 113591797A
- Authority
- CN
- China
- Prior art keywords
- projection
- depth
- layer
- convolution
- output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention relates to a method for recognizing deep video behaviors, which comprises the following steps of: carrying out front, right side, left side and top projection on the depth video of each behavior sample to obtain a corresponding projection sequence; obtaining a dynamic image of each behavior sample by calculating a dynamic image of each projection sequence; inputting the dynamic image of each behavior sample into a feature extraction module and extracting features; connecting the features extracted from the dynamic images of each behavior sample, and inputting the connected features into a full connection layer; constructing a four-flow human behavior recognition network; calculating dynamic images of front, right side, left side and top projection sequences of the depth video of each training behavior sample, inputting the dynamic images into a four-stream human behavior recognition network, and training the four-stream human behavior recognition network until convergence; and calculating each dynamic image of the behavior sample to be tested, and inputting each calculated dynamic image into the trained four-flow human behavior recognition network to realize behavior recognition.
Description
Technical Field
The invention relates to the technical field of computer vision, in particular to a deep video behavior identification method.
Background
At present, human behavior recognition is an important subject in the field of computer vision. The method is widely applied to the fields of video monitoring, human-computer interaction and the like.
Traditional methods focus on design-manual feature extraction of spatiotemporal information in depth video, followed by classification using classifiers such as support vector machines. However, the methods extract shallow features, and the experimental results are not ideal. Due to the development of computers, more and more scholars use deep neural networks for human behavior recognition. The convolutional neural network has strong learning ability on images and videos, so that the convolutional neural network is used for analyzing the depth videos and identifying human behaviors is a good choice. Some researchers propose to use a three-dimensional convolutional neural network to extract deep space-time characteristics in a deep behavior video, but the deep behavior video is directly input into the convolutional neural network, so that three-dimensional information in the deep behavior video cannot be well utilized. Compared with a two-dimensional convolutional neural network, the three-dimensional convolutional neural network has more parameters, more training data are needed to enable the network to be converged, and the three-dimensional convolutional neural network generally shows the performance under the condition of a smaller data set.
Therefore, aiming at the problem of the behavior recognition algorithm, a deep video behavior recognition method is provided.
Disclosure of Invention
The invention is provided for solving the problems in the prior art, and aims to provide a deep video behavior identification method which solves the problem that the deep features extracted by the existing identification method cannot fully utilize three-dimensional information in a deep behavior video.
A depth video behavior recognition method comprises the following steps:
1) carrying out front, right side, left side and top projection on the depth video of each behavior sample to obtain a corresponding projection sequence;
2) obtaining a dynamic image of each behavior sample by calculating a dynamic image of each projection sequence;
3) inputting the dynamic image of each behavior sample into a feature extraction module and extracting features;
4) connecting the features extracted from the dynamic images of each behavior sample, and inputting the connected features into a full connection layer;
5) constructing a four-flow human behavior recognition network;
6) calculating dynamic images of front, right side, left side and top projection sequences of the depth video of each training behavior sample, inputting the dynamic images into a four-stream human behavior recognition network, and training the four-stream human behavior recognition network until convergence;
7) and calculating each dynamic image of the behavior sample to be tested, and inputting the calculated each dynamic image into a trained four-flow human behavior recognition network to realize behavior recognition.
Preferably, the projection sequence in step 1) is obtained by:
each behavior sample is composed of all frames in the depth video of the sample, the depth video of any behavior sample is obtained,
V={It|t∈[1,N]},
wherein t represents a time index, and N is the total frame number of the depth video V of the behavior sample; i ist∈R×CR, C respectively corresponding to the row number and the column number of the matrix representation of the t frame depth image of the depth video V of the behavior sample, wherein the representation matrix is a real matrix; i ist(xi,yi)=diThe coordinate on the depth image of the t-th frame is (x)i,yi) Point p ofiOf depth, i.e. point piDistance from depth camera, di∈[0,D]D represents the farthest distance that the depth camera can detect;
the depth video V of a behavior sample may be represented as a set of projection sequences, formulated as follows:
V={Vfront,Vright,Vleft,Vtop},
wherein, VfrontProjection sequence obtained by front projection of a depth video V representing a behavior sample, VrightA projection sequence obtained by right side projection of a depth video V representing a behavior sample, VleftA projection sequence obtained by left side projection of a depth video V representing a behavior sample, VtopRepresenting depth of a behavioral sampleAnd the video V carries out top surface projection to obtain a projection sequence.
Preferably, the method for calculating the moving image in step 2) is:
front projection sequence V of depth video V with behavior samplesfront={Ft|t∈[1,N]Take the example, first to FtVectorization, i.e. FtIs connected into a new row vector it;
For row vector itEach element in (a) is used for calculating the arithmetic square root to obtain a new vector wtNamely:
wherein the content of the first and second substances,representing a row vector itEach element in (1) is used to calculate the arithmetic square root, and w is calculatedtFront projection sequence V of depth video V as behavior samplefrontThe frame vector of the t-th frame of (1);
calculating a front projection sequence V of a depth video V of a behavior samplefrontThe feature vector v of the t frame imagetThe calculation method is as follows:
wherein the content of the first and second substances,front projection sequence V representing depth video V on a behavior samplefrontSumming frame vectors of the 1 st frame image to the t th frame image;
calculating a front projection sequence V of a depth video V of a behavior samplefrontT frame image FtScore B oftThe calculation formula is as follows:
Bt=uT·vt,
wherein u is a vector with dimension a, a ═ R × C; u. ofTRepresents transposing the vector u; u. ofT·vtRepresenting the vector obtained by transposing the calculation pair vector u and the feature vector vtDot product of (2);
calculating the value of u such that the sequence of front projections VfrontThe middle frame images are sorted from front to back, the scores are increased progressively, namely the score B is increased when t is largertThe higher; the calculation of u can use RankSVM calculation, and the calculation method is as follows:
wherein the content of the first and second substances,denotes u that minimizes the value of E (u), λ is a constant, | u | | non-calculation2Representing the sum of the squares of each element in the calculation vector u; b isc、BjFront projection sequence V of depth video V representing behavior samples respectivelyfrontScore of image of frame c, score of image of frame j, max {0,1-Bc+BjMeans to choose 0 and 1-Bc+BjThe larger of the values;
after calculating the vector u using RankSVM, the vector u is aligned to FtThe image form with the same size is obtained to obtain u' epsilonR×CU' is the front projection sequence V of the depth video V of the behavior samplefront.
Preferably, the feature extraction module comprises a convolution unit 1, a convolution unit 2, a convolution unit 3, a convolution unit 4, a convolution unit 5, a multi-feature fusion unit, an average pooling layer, a full connection layer 1 and a full connection layer 2; firstly, the outputs of the convolution unit 1, the convolution unit 2, the convolution unit 3, the convolution unit 4 and the convolution unit 5 are input to the multi-feature fusion unit in sequence, and then the outputs of the multi-feature fusion unit are input to the multi-feature fusion unitOutput M6Inputting the output S of the average pooling layer into a full-link layer 1, wherein the number of neurons in the full-link layer 1 is D1Output Q of full connection layer 11The calculation method of (c) is as follows:
Q1=φrelu(W1·S+θ1),
wherein phi isreluIs a relu activation function, W1Is the weight of the fully connected layer 1, θ1Is the offset vector of fully connected layer 1;
output Q of full connection layer 11Inputting into the full connection layer 2, the number of the neurons of the full connection layer 2 is D2Output Q of full connection layer 22The calculation method of (c) is as follows:
Q2=φrelu(W2·Q1+θ2),
wherein, W2Is the weight of the fully connected layer 2, θ2The offset vector of the full connection layer 2, and the output of the full connection layer 2 is the feature extracted by the feature extraction module;
respectively inputting the dynamic images of the front, right side, left side and top projection sequences of the depth video V of the behavior sample into a feature extraction module to extract features
Preferably, the features in step 4) are connected in such a way that the extracted features are extractedFully-connected layer 3 connected as a vector with input activation function softmax, output Q of fully-connected layer 33The calculation method of (c) is as follows:
wherein phi issoftmaxDenotes the softmax activation function, W3Is the weight of the fully-connected layer 3,express featuresConnected as a vector, θ3Is the offset vector for fully connected layer 3.
Preferably, the step 5) of constructing the four-stream human behavior recognition network is as follows: the input of the network is a dynamic image of a projection sequence of a depth video of a behavior sample, and the output is Q3The loss function L of the network is,
wherein G is the total training sample number K is the behavior sample class number,is the net output of the g-th behavior sample, lgIs the expected output of the g-th behavior sample, where lgIs defined as:
wherein lgIs the tag value of the g-th sample.
Preferably, the behavior in step 7) is identified as: and calculating dynamic images of the front, right side, left side and top projection sequences of the depth video of the behavior sample to be tested, inputting the dynamic images into a trained four-stream behavior recognition network, and obtaining a probability value for predicting the behavior class to which the current testing behavior video sample belongs, wherein the behavior class with the maximum probability value is the behavior class to which the current testing behavior video sample finally predicts.
Preferably, said VfrontProjection sequence acquisition mode:
Vfront={Ft|t∈[1,N]in which Ft∈R×CA projection diagram obtained by front projection of the tth frame depth image of the depth video V of the motion sample is shown; point p in depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection of the point onto the projection FtThe abscissa value of the pointOrdinate valuePixel valueCan be formulated as:
wherein f is1To be depth value diMapping to [0,255]A linear function of the interval such that a point with a smaller depth value has a larger pixel value on the projection image, i.e., a point closer to the depth camera has a brighter value on the front projection image;
Vrightprojection sequence acquisition mode:
Vright={Rt|t∈[1,N]in which R ist∈R×DA projection diagram obtained by projecting the right side surface of the t frame depth image is shown; when right side projection is performed on the depth image, at least one point is projected to the same position on the projection graph; and observing behaviors from the right side, the closest point to the observer, namely the farthest point from the projection plane can be seen; keeping the abscissa value of the point farthest from the projection plane on the depth image, and calculating the pixel value of the point at the position of the projection image according to the abscissa value; the depth image is arranged in a row by row from the row with the minimum horizontal coordinate x to the direction increasing from xTraversing points in the depth image, projecting the points on the projection graph, and projecting the points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diSeparately determine the projection view RtPixel value of a point in (1)Ordinate valueHorizontal coordinate valueIs formulated as:
wherein f is2To be compared with the abscissa value xiMapping to [0,255]A linear function of the interval; when x is continuously increased, if a new point and a previously projected point are projected to the same position of the projection graph, the latest point is reserved, namely the pixel value of the position of the projection graph is calculated by using the abscissa value of the point with the largest abscissa value, namely the pixel value of the position of the projection graph is calculatedWherein xm=maxxi,xi∈XR,XRFor all ordinate values in the depth image to beDepth value ofOf points of (a), maxxi,xi∈XRA set of representations XRMaximum of the abscissa in (1);
Vleftprojection sequence acquisition mode:
Vleft={Lt|t∈[1,N]in which L ist∈R×DA projection diagram obtained by projecting the left side surface of the t-th frame depth image is shown; when a plurality of points are projected to the same position of the left side surface projection drawing, keeping the point farthest from the projection plane; traversing the points in the depth image from one row with the maximum horizontal coordinate x to the direction of decreasing x row by row, projecting the points on the left side projection image, and the points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection views L respectivelytPixel value of a point in (1)Ordinate valueHorizontal coordinate valueFor the same coordinates projected onto the left side projection viewSelecting the abscissa value of the point with the minimum abscissa to calculate the pixel value of the projection graph at the coordinate, and expressing the pixel value as the following formula:
wherein f is3To be compared with the abscissa value xnMapping to [0,255]Linear function of interval, xn=minxi,xi∈XL,XLFor all ordinate values in the depth image to beDepth value ofSet of abscissas of points of (1), minxi,xi∈XLA set of representations XLThe minimum value of the middle abscissa;
Vtopprojection sequence acquisition mode:
Vtop={Tt|t∈[1,N]in which O ist∈D×CA projection diagram obtained by projecting the depth image of the t frame from the top surface is shown; when the plurality of points are projected to the same position of the top surface projection drawing, the point farthest away from the projection plane is reserved; traversing the points in the depth image line by line from the line with the minimum ordinate y on the depth image to the increasing direction of y, projecting the points on the top surface projection image, and the points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection of the point onto the projection graph OtThe abscissa value of the pointPixel valueOrdinate valueFor the same coordinates projected onto the projection viewSelecting the longitudinal coordinate value of the point with the maximum longitudinal coordinate as the pixel value of the projection graph at the coordinate, and expressing the pixel value as follows:
wherein f is4To be compared with the ordinate value yqMapping to [0,255]Linear function of interval, yq=maxyi,yi∈YOWherein Y isOAll the abscissa values in the depth image areDepth value ofSet of ordinates of the points of (1), maxyi,yi∈YOSet of representations YOMaximum of the middle ordinate.
Preferably, the convolution unit 1 comprises 2 convolution layers and 1 max pooling layer; each convolution layer has 64 convolution kernels, each convolution kernel has a size of 3 × 3, the pooling kernel of the largest pooling layer has a size of 2 × 2, and the output of convolution unit 1 is C1;
The convolution unit 2 contains 2 convolution layers each having 128 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and the input of the convolution unit 2 is C1The output is C2;
The convolution unit 3 comprises 3 convolution layers and 1 maximum pooling layer, each convolution layer has 256 convolution kernels, each convolution kernel has a size of 3 × 3, the pooling kernel of the maximum pooling layer has a size of 2 × 2, and the input of the convolution unit 3 is C2The output is C3;
The convolution unit 4 contains 3 convolution layers each having 512 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and the input of the convolution unit 4 is C3The output is C4;
The convolution unit 5 contains 3 convolution layers each having 512 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and a volumeThe input of the product unit 5 is C4The output is C5;
The input of the multi-feature fusion unit is the output C of the convolution unit 11Output C of convolution unit 22Output C of convolution unit 33Output C of convolution unit 44Output C of convolution unit 55;
Output C of convolution unit 11Inputting a maximum pooling layer 1 and a convolutional layer 1 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 1 is 4 multiplied by 4, the convolutional layer 1 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 1 is M1;
Output C of convolution unit 22Inputting a maximum pooling layer 2 and a convolutional layer 2 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 2 is 2 x 2, the convolutional layer 2 has 512 convolutional kernels, the size of the convolutional kernel is 1 x 1, and the output of the convolutional layer 2 is M2;
The convolution unit 3 inputs the convolution layer 3 in the multi-feature fusion unit, the convolution layer 3 has 512 convolution kernels, the size of the convolution kernels is 1 multiplied by 1, and the output of the convolution layer 3 is M3;
Output C of convolution unit 44Inputting an upsampling layer 1 and a convolutional layer 4 in a multi-feature fusion unit, wherein the convolutional layer 4 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 4 is M4;
Output C of convolution unit 55Inputting an upsampling layer 2 and a convolutional layer 5 in the multi-feature fusion unit, wherein the convolutional layer 5 has 512 convolutional kernels, the size of the convolutional kernel is 1 multiplied by 1, and the output of the convolutional layer 5 is M5;
The output M of the convolution layer 11The output M of the convolutional layer 22And the output M of the convolution layer 33The output M of the convolutional layer 44The output M of the convolutional layer 55Connected by channels, input convolutional layer 6, convolutional layer 6 has 512 convolutional kernels, the size of convolutional kernel is 1 × 1, and the output of convolutional layer 6 is M6;
The output of the multi-feature fusion unit is M as the output of the convolutional layer 66。
The invention has the following beneficial effects: 1) the information such as the appearance of the person can not be acquired based on the behavior recognition of the depth video, and the privacy of the person is protected; meanwhile, the depth video is not easily influenced by illumination, and richer three-dimensional information about behaviors can be provided;
2) the depth video is projected to different planes, information of different dimensionalities of behaviors can be obtained, and the information is combined, so that the human behavior can be more easily identified; when the network is trained, only 4 dynamic images are used as the compact representation input network of the video, and the requirement on the performance of computer equipment is not high.
Drawings
FIG. 1 is a flow chart of the present invention
FIG. 2 is a flow diagram of an extraction module.
Fig. 3 is a flow chart of a four-stream human behavior recognition network.
FIG. 4 is a schematic plane projection diagram of a hand waving behavior in the embodiment.
Fig. 5 is a front projection dynamic image of the waving behavior in the embodiment.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
An embodiment of the present invention, referring to fig. 1 to 5, is a depth video behavior recognition method, including the following steps:
1) carrying out front, right side, left side and top projection on the depth video of each behavior sample to obtain 4 projection sequences;
2) calculating the dynamic images of the 4 projection sequences of each behavior sample to obtain 4 dynamic images of each behavior sample;
3) respectively inputting the 4 dynamic images into a feature extraction module to extract features;
4) connecting the extracted features of the 4 dynamic images, and inputting the features into a full connection layer;
5) constructing a four-flow human behavior recognition network;
6) calculating dynamic images of front, right side, left side and top projection sequences of the depth video of each training behavior sample, inputting a four-stream human behavior recognition network, and training the network until convergence;
7) and 4 dynamic images of each test behavior sample are calculated, and the trained four-flow human behavior recognition network is input to realize behavior recognition.
Acquiring a projection sequence in the step 1):
each behavior sample is composed of all frames in the depth video of the sample, and for any behavior sample's depth video V:
V={It|t∈[1,N]},
wherein t represents a time index, and N is the total frame number of the depth video V of the behavior sample; i ist∈R×CR, C respectively corresponding to the row number and the column number of the matrix representation of the t frame depth image of the depth video V of the behavior sample, wherein the representation matrix is a real matrix; i ist(xi,yi)=diThe coordinate on the depth image of the t-th frame is (x)i,yi) Point p ofiOf depth, i.e. point piDistance from depth camera, di∈[0,D]D represents the farthest distance that the depth camera can detect;
and respectively projecting the depth video V of the behavior sample to four planes, namely a front surface, a right side surface, a left side surface and a top surface. At this time, the depth video V of the behavior sample can be expressed as a set of four projection graph sequences, which are formulated as follows:
V={Vfront,Vright,Vleft,Vtop},
wherein, VfrontProjection sequence obtained by front projection of a depth video V representing a behavior sample, VrightA projection sequence obtained by right side projection of a depth video V representing a behavior sample, VleftA projection sequence obtained by left side projection of a depth video V representing a behavior sample, VtopA projection sequence obtained by top surface projection is carried out on the depth video V representing the behavior sample;
Vfront={Ft|t∈[1,N]in which Ft∈R×CThe method comprises the steps that a projection diagram obtained by front projection is carried out on a tth frame depth image of a depth video V of a behavior sample; point p in depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection of the point onto the projection FtThe abscissa value of the pointOrdinate valuePixel valueCan be formulated as:
wherein f is1To be depth value diMapping to [0,255]The linear function of the interval is such that points with smaller depth values have larger pixel values on the projection view, i.e. points closer to the depth camera are brighter on the front projection view.
Vright={Rt|t∈[1,N]In which R ist∈R×DA projection diagram obtained by projecting the right side surface of the t frame depth image is shown; when right side projecting the depth image, there may be more than one point projected to the same location on the projection map; while observing the behavior from the right side, what can be seen is the point closest to the observer, i.e. the point closest to the projection planeA remote point; therefore, the abscissa value of the point farthest from the projection plane on the depth image should be retained, and the pixel value of the point at the position of the projection view should be calculated using the abscissa value. For this purpose, points in the depth image are traversed column by column starting from a column on the depth image with the smallest horizontal coordinate x and increasing in the direction x, projected onto the projection map, and points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diSeparately determine the projection view RtPixel value of a point in (1)Ordinate valueHorizontal coordinate valueIs formulated as:
wherein f is2To be compared with the abscissa value xiMapping to [0,255]A linear function of the interval; when x is continuously increased, new points and the points which are projected before are projected to the same position of the projection graph, the latest points are reserved, namely the pixel value of the position of the projection graph is calculated by using the abscissa value of the point with the largest abscissa value, namely the pixel value of the position of the projection graph is calculatedWherein xm=maxxi,xi∈XR,XRFor all ordinate values in the depth image to beDepth value ofOf points of (a), maxxi,xi∈XRA set of representations XRMaximum of the abscissa in (1).
Vleft={Lt|t∈[1,N]In which L ist∈R×DA projection diagram obtained by projecting the left side surface of the t-th frame depth image is shown; similar to obtaining the right side projection view, when there are a plurality of points projected to the same position of the left side projection view, the point farthest from the projection plane should be reserved; for this purpose, points in the depth image are traversed in a row-by-row manner starting from a row of the depth image with the maximum horizontal coordinate x and decreasing in the direction of x, and projected onto the left side projection image, and points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection views L respectivelytPixel value of a point in (1)Ordinate valueHorizontal coordinate valueFor the same coordinates projected onto the left side projection viewSelecting the abscissa value of the point with the minimum abscissa to calculate the pixel value of the projection graph at the coordinate, and expressing the pixel value as the following formula:
wherein f is3To be compared with the abscissa value xnMapping to [0,255]Linear function of interval, xn=minxi,xi∈XL,XLFor all ordinate values in the depth image to beDepth value ofSet of abscissas of points of (1), minxi,xi∈XLA set of representations XLThe minimum value of the middle abscissa.
Vtop={Tt|t∈[1,N]In which O ist∈D×CShowing the projection view of the t-th frame depth image projected from the top surface. When a plurality of points are projected to the same position of the top surface projection drawing, the point farthest away from the projection plane should be reserved; for this purpose, points in the depth image are traversed line by line starting from the line with the smallest ordinate y on the depth image in the direction in which y increases, projected onto the top projection image, and points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection of the point onto the projection graph OtThe abscissa value of the pointPixel valueOrdinate valueFor the same coordinates projected onto the projection viewSelecting the longitudinal coordinate value of the point with the maximum longitudinal coordinate as the pixel value of the projection graph at the coordinate, and expressing the pixel value as follows:
wherein f is4To be compared with the ordinate value yqMapping to [0,255]Linear function of interval, yq=maxyi,yi∈YOWherein Y isOAll the abscissa values in the depth image areDepth value ofSet of ordinates of the points of (1), maxyi,yi∈YOSet of representations YOMaximum of the middle ordinate.
Acquiring a dynamic image in step 2:
front projection sequence V of depth video V with behavior samplesfront={Ft|t∈[1,N]For example, the motion image is calculated as follows:
first to FtVectorization, i.e. FtIs connected into a new row vector it;
For row vector itEach element in (a) is used for calculating the arithmetic square root to obtain a new vector wtNamely:
wherein the content of the first and second substances,representing a row vector itEach element in (a) takes the arithmetic square root; note wtFront projection sequence V of depth video V as behavior samplefrontThe frame vector of the t-th frame of (1);
computing depth of behavior samplesFront projection sequence V of degree video VfrontThe feature vector v of the t frame imagetThe calculation method is as follows:
wherein the content of the first and second substances,front projection sequence V representing depth video V on a behavior samplefrontSumming frame vectors of the 1 st frame image to the t th frame image;
calculating a front projection sequence V of a depth video V of a behavior samplefrontT frame image FtScore B oftThe calculation formula is as follows:
Bt=uT·vt,
where u is a vector with dimension a, and a ═ R × C. u. ofTRepresents transposing the vector u; u. ofT·vtRepresenting the vector obtained by transposing the calculation pair vector u and the feature vector vtDot product of (2);
calculating the value of u such that the sequence of front projections VfrontThe more the frame image arranged at the back, the higher the score, i.e. the larger t, the score BtThe higher; the calculation of u can use RankSVM calculation, and the calculation method is as follows:
wherein the content of the first and second substances,denotes u that minimizes the value of E (u), λ is a constant, | u | | non-calculation2Representing the sum of the squares of each element in the calculation vector u; b isc、BjFront projection sequence V of depth video V representing behavior samples respectivelyfrontScore of image of frame c, score of image of frame j, max {0,1-Bc+BjMeans to choose 0 and 1-Bc+BjThe larger of the values;
after calculating the vector u using RankSVM, the vector u is aligned to FtThe image form with the same size is obtained to obtain u' epsilonR×CSequence V of front projections of a depth video V, called u' as a behavioral samplefrontThe moving picture of (2).
The dynamic images of the right side, left side, and top projection sequences of the depth video V of the behavior sample are calculated in the same manner as the dynamic images of the front projection sequence.
And (3) extracting the features of the feature extraction module:
as shown in fig. 2, the dynamic images of the front, right, left, and top projection sequences of the depth video of the behavior sample are respectively input to the feature extraction module to extract features. The feature extraction module comprises a convolution unit 1, a convolution unit 2, a convolution unit 3, a convolution unit 4, a convolution unit 5, a multi-feature fusion unit, an average pooling layer, a full connection layer 1 and a full connection layer 2.
Convolution unit 1 contains 2 convolution layers and 1 max pooling layer. Each convolution layer has 64 convolution kernels, each convolution kernel has a size of 3 × 3, the pooling kernel of the largest pooling layer has a size of 2 × 2, and the output of convolution unit 1 is C1。
The convolution unit 2 contains 2 convolution layers each having 128 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and the input of the convolution unit 2 is C1The output is C2。
The convolution unit 3 comprises 3 convolution layers and 1 maximum pooling layer, each convolution layer has 256 convolution kernels, each convolution kernel has a size of 3 × 3, the pooling kernel of the maximum pooling layer has a size of 2 × 2, and the input of the convolution unit 3 is C2The output is C3。
Convolution unit 4 contains 3 convolution layers and 1 max pooling layer, each volume512 convolution kernels are stacked, the size of each convolution kernel is 3 x 3, the size of the pooling kernel of the maximum pooling layer is 2 x 2, and the input of the convolution unit 4 is C3The output is C4。
The convolution unit 5 comprises 3 convolution layers each having 512 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and the input of the convolution unit 5 is C4The output is C5。
The input of the multi-feature fusion unit is the output C of the convolution unit 11Output C of convolution unit 22Output C of convolution unit 33Output C of convolution unit 44Output C of convolution unit 55。
Output C of convolution unit 11Inputting a maximum pooling layer 1 and a convolutional layer 1 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 1 is 4 multiplied by 4, the convolutional layer 1 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 1 is M1。
Output C of convolution unit 22Inputting a maximum pooling layer 2 and a convolutional layer 2 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 2 is 2 x 2, the convolutional layer 2 has 512 convolutional kernels, the size of the convolutional kernel is 1 x 1, and the output of the convolutional layer 2 is M2。
The convolution unit 3 inputs the convolution layer 3 in the multi-feature fusion unit, the convolution layer 3 has 512 convolution kernels, the size of the convolution kernels is 1 multiplied by 1, and the output of the convolution layer 3 is M3。
Output C of convolution unit 44Inputting an upsampling layer 1 and a convolutional layer 4 in a multi-feature fusion unit, wherein the convolutional layer 4 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 4 is M4。
Output C of convolution unit 55Inputting an upsampling layer 2 and a convolutional layer 5 in the multi-feature fusion unit, wherein the convolutional layer 5 has 512 convolutional kernels, the size of the convolutional kernel is 1 multiplied by 1, and the output of the convolutional layer 5 is M5。
The output M of the convolution layer 11The output M of the convolutional layer 22And the output M of the convolution layer 33The output M of the convolutional layer 44The output M of the convolutional layer 55Connected by channels, input convolutional layer 6, convolutional layer 6 has 512 convolutional kernels, the size of convolutional kernel is 1 × 1, and the output of convolutional layer 6 is M6。
The output of the multi-feature fusion unit is M as the output of the convolutional layer 66。
Fusing the output M of the unit with multiple features6Inputting the average pooling layer, wherein the output of the average pooling layer is S, the output S of the average pooling layer is input into a full-link layer 1, and the number of neurons in the full-link layer 1 is D1Output Q of full connection layer 11The calculation method of (c) is as follows:
Q1=φrelu(W1·S+θ1),
wherein phi isreluIs a relu activation function, W1Is the weight of the fully connected layer 1, θ1Is the offset vector of fully connected layer 1;
output Q of full connection layer 11Inputting into the full connection layer 2, the number of the neurons of the full connection layer 2 is D2Output Q of full connection layer 22The calculation method of (c) is as follows:
Q2=φrelu(W2·Q1+θ2),
wherein, W2Is the weight of the fully connected layer 2, θ2Is the offset vector for fully connected layer 2. The output of the full connection layer 2 is the feature extracted by the feature extraction module;
dynamic images of front, right side, left side and top projection sequences of the depth video V of the behavior sample are respectively input into the feature extraction module, and features can be extracted
Connecting the features extracted in the step 3) in the step 4):
inputting dynamic images of four projection sequences of the depth video of each behavior sample into a feature extraction module to obtain features, and inputting a full connection layer 3 with an activation function of softmax; all connecting layers 3Output Q3The calculation method of (c) is as follows:
wherein phi issoftmaxDenotes the softmax activation function, W3Is the weight of the fully-connected layer 3,express featuresConnected as a vector, θ3Is the offset vector for fully connected layer 3.
Step 5) constructing a four-flow human behavior recognition network:
as shown in FIG. 3, the input of the network is the dynamic image of the front, right, left, and top projection sequence of the depth video of the behavior samples, and the output is the probability that the corresponding behavior sample belongs to each behavior class, i.e. the output Q of the full link layer 33(ii) a The loss function L of the network is:
wherein G is the total training sample number, K is the behavior sample class number,is the net output of the g-th behavior sample, lgIs the expected output of the g-th behavior sample, where lgIs defined as:
wherein lgIs the tag value of the g-th sample.
And 6) calculating dynamic images of the front, right side, left side and top projection sequences of the depth video of each training behavior sample, inputting the dynamic images into a four-stream human behavior recognition network, and training the network until convergence.
And 7) calculating dynamic images of the front, right side, left side and top projection sequences of the depth video of each test behavior sample, inputting the dynamic images into a trained four-stream behavior recognition network, obtaining the probability values of all the behavior classes to which the current test behavior video sample is predicted, wherein the behavior class with the maximum probability value is the behavior class to which the current test behavior video sample is finally predicted, and therefore behavior recognition is achieved.
Example (b):
as shown in the figures 4-5 of the drawings,
1) the collective number of samples of behavior samples is 2400, and there are 8 behavior classes, and each behavior class has 300 samples. Randomly selecting two thirds of samples in each behavior category to be divided into a training set, and dividing the remaining one third of samples into a testing set to obtain 1600 training samples and 800 testing samples.
Each behavior sample consists of all frames in the sample depth video. Take the depth video V of any behavior sample as an example:
V={It|t∈[1,50]},
where t represents the time index, the behavior sample has a total of 50 frames. I ist∈240×240The line sample is a matrix representation of the t frame depth image of the depth video V, and the number of rows and columns of the frame depth image is 240. The representation matrix is a real matrix. I ist(xi,yi)=diThe coordinate on the depth image of the t-th frame is (x)i,yi) Point p ofiOf depth, i.e. point piDistance from the depth camera.
And respectively projecting the depth video V of the behavior sample to four planes, namely a front surface, a right side surface, a left side surface and a top surface. At this time, the depth video V of the behavior sample can be expressed as a set of four projection graph sequences, which are formulated as follows:
V={Vfront,Vright,Vleft,Vtop},
wherein, VfrontProjection sequence obtained by front projection of a depth video V representing a behavior sample, VrightA projection sequence obtained by right side projection of a depth video V representing a behavior sample, VleftA projection sequence obtained by left side projection of a depth video V representing a behavior sample, VtopA sequence of projections of the depth video V representing the behavior sample onto the top surface.
Vfront={Ft|t∈[1,50]In which Ft∈240×240The projection diagram is obtained by front projection of the tth frame depth image of the depth video V of the behavior sample. Point p in depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection of the point onto the projection FtThe abscissa value of the pointOrdinate valuePixel valueCan be formulated as:
wherein f is1To be depth value diMapping to [0,255]The linear function of the interval is such that points with smaller depth values have larger pixel values on the projection view, i.e. points closer to the depth camera are brighter on the front projection view.
Vright={Rt|t∈[1,50]In which R ist∈240×240A projection view obtained by right side projection of the t-th frame depth image is shown.When right side projecting the depth image, there may be more than one point projected to the same location on the projection map. And viewing the behavior from the right side, the closest point to the viewer, i.e. the point furthest from the projection plane, can be seen. Therefore, the abscissa value of the point farthest from the projection plane on the depth image should be retained, and the pixel value of the point at the position of the projection view should be calculated using the abscissa value. For this purpose, points in the depth image are traversed column by column starting from a column on the depth image with the smallest horizontal coordinate x and increasing in the direction x, projected onto the projection map, and points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diSeparately determine the projection view RtPixel value of a point in (1)Ordinate valueHorizontal coordinate valueIs formulated as:
wherein f is2To be compared with the abscissa value xiMapping to [0,255]A linear function of the interval. When x is increased continuously, new points may be projected to the same position of the projection graph as the points which have been projected before, and the latest points should be kept, i.e. the pixel value of the position of the projection graph is calculated by using the abscissa value of the point with the largest abscissa valueWherein xm=maxxi,xi∈XR,XRFor all ordinate values in the depth image to beDepth value ofOf points of (a), maxxi,xi∈XRA set of representations XRMaximum of the abscissa in (1).
Vleft={Lt|t∈[1,50]In which L ist∈240×240The projection diagram is obtained by projecting the left side surface of the t-th frame depth image. Similar to obtaining the right side projection view, when there are multiple points projected to the same location of the left side projection view, the point farthest from the projection plane should be kept. For this purpose, points in the depth image are traversed in a row-by-row manner starting from a row of the depth image with the maximum horizontal coordinate x and decreasing in the direction of x, and projected onto the left side projection image, and points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection views L respectivelytPixel value of a point in (1)Ordinate valueHorizontal coordinate valueFor the same coordinates projected onto the left side projection viewSelecting the abscissa value of the point with the minimum abscissa to calculate the pixel value of the projection graph at the coordinate, and expressing the pixel value as the following formula:
wherein f is3To be compared with the abscissa value xnMapping to [0,255]Linear function of interval, xn=minxi,xi∈XL,XLFor all ordinate values in the depth image to beDepth value ofSet of abscissas of points of (1), minxi,xi∈XLA set of representations XLThe minimum value of the middle abscissa.
Vtop={Tt|t∈[1,50]In which O ist∈240×240Showing the projection view of the t-th frame depth image projected from the top surface. When there are multiple points projected onto the same location of the top surface projection, the point farthest from the projection plane should be retained. For this purpose, points in the depth image are traversed line by line starting from the line with the smallest ordinate y on the depth image in the direction in which y increases, projected onto the top projection image, and points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection of the point onto the projection graph OtThe abscissa value of the pointPixel valueOrdinate valueFor the same coordinates projected onto the projection viewAt the point of the (c) or (d),selecting the ordinate value of the point with the maximum ordinate as the pixel value of the projection graph at the coordinate, and expressing the ordinate value as the following formula:
wherein f is4To be compared with the ordinate value yqMapping to [0,255]Linear function of interval, yq=maxyi,yi∈YOWherein Y isOAll the abscissa values in the depth image areDepth value ofSet of ordinates of the points of (1), maxyi,yi∈YOSet of representations YOMaximum of the middle ordinate.
2) And calculating the dynamic images of the 4 projection sequences of the depth video of each behavior sample to obtain 4 dynamic images of each behavior sample. Front projection sequence V of depth video V with behavior samplesfront={Ft|t∈[1,50]For example, the motion image is calculated as follows:
first to FtVectorization, i.e. FtIs connected into a new row vector it。
For row vector itEach element in (a) is used for calculating the arithmetic square root to obtain a new vector wtNamely:
wherein the content of the first and second substances,representing a row vector itEach element in (a) takes the arithmetic square root. Note wtFront projection sequence V of depth video V as behavior samplefrontThe frame vector of the t-th frame.
Calculating a front projection sequence V of a depth video V of a behavior samplefrontThe feature vector v of the t frame imagetThe calculation method is as follows:
wherein the content of the first and second substances,front projection sequence V representing depth video V on a behavior samplefrontSumming frame vectors of the 1 st frame image to the t th frame image;
calculating a front projection sequence V of a depth video V of a behavior samplefrontT frame image FtScore B oftThe calculation formula is as follows:
Bt=uT·vt,
where u is a vector with dimension 57600. u. ofTRepresents transposing the vector u; u. ofT·vtRepresenting the vector obtained by transposing the calculation pair vector u and the feature vector vtDot product of (2);
calculating the value of u such that the sequence of front projections VfrontThe more the frame image arranged at the back, the higher the score, i.e. the larger t, the score BtThe higher; the calculation of u can use RankSVM calculation, and the calculation method is as follows:
wherein the content of the first and second substances,denotes u that minimizes the value of E (u), λ is a constant, | u | | non-calculation2Representing the sum of the squares of each element in the calculation vector u; b isc、BjFront projection sequence V of depth video V representing behavior samples respectivelyfrontScore of image of frame c, score of image of frame j, max {0,1-Bc+BjMeans to choose 0 and 1-Bc+BjThe larger of the values;
after calculating the vector u using RankSVM, the vector u is aligned to FtThe image form with the same size is obtained to obtain u' epsilon240×240Sequence V of front projections of a depth video V, called u' as a behavioral samplefrontThe moving picture of (2). Fig. 4 is a front projection dynamic image of the hand waving behavior.
The motion images of the right, left, and top projection sequences of the depth video V of the behavior sample are calculated in the same manner as the motion images of the front projection sequence.
3) And respectively inputting the dynamic images of the front, right side, left side and top projection sequences of the depth video of the behavior sample into a feature extraction module to extract features. The feature extraction module comprises a convolution unit 1, a convolution unit 2, a convolution unit 3, a convolution unit 4, a convolution unit 5, a multi-feature fusion unit, an average pooling layer, a full connection layer 1 and a full connection layer 2.
Convolution unit 1 contains 2 convolution layers and 1 max pooling layer. There are 64 convolution kernels per convolution layer, each convolution kernel being 3 x 3 in size, and the pooling kernel for the largest pooling layer being 2 x 2 in size. The output of convolution unit 1 is C1。
Convolution unit 2 contains 2 convolution layers and 1 max pooling layer. There are 128 convolution kernels per convolution layer, each convolution kernel being 3 x 3 in size, and the pooling kernel for the largest pooling layer being 2 x 2 in size. The input of the convolution unit 2 is C1The output is C2。
The convolution unit 3 contains 3 convolution layers and 1 max pooling layer. Each convolution layer has 256 convolution kernels eachThe size of each convolution kernel is 3 × 3, and the size of the pooling kernel of the maximum pooling layer is 2 × 2. The input of the convolution unit 3 is C2The output is C3。
The convolution unit 4 contains 3 convolution layers and 1 max pooling layer. There are 512 convolution kernels per convolution layer, each convolution kernel being 3 x 3 in size, and the pooling kernel of the maximum pooling layer being 2 x 2 in size. The input of the convolution unit 4 is C3The output is C4。
The convolution unit 5 contains 3 convolution layers and 1 max pooling layer. There are 512 convolution kernels per convolution layer, each convolution kernel being 3 x 3 in size, and the pooling kernel of the maximum pooling layer being 2 x 2 in size. The input of the convolution unit 5 is C4The output is C5。
The input of the multi-feature fusion unit is the output C of the convolution unit 11Output C of convolution unit 22Output C of convolution unit 33Output C of convolution unit 44Output C of convolution unit 55。
Output C of convolution unit 11Inputting a maximum pooling layer 1 and a convolutional layer 1 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 1 is 4 multiplied by 4, the convolutional layer 1 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 1 is M1。
Output C of convolution unit 22Inputting a maximum pooling layer 2 and a convolutional layer 2 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 2 is 2 x 2, the convolutional layer 2 has 512 convolutional kernels, the size of the convolutional kernel is 1 x 1, and the output of the convolutional layer 2 is M2。
The convolution unit 3 inputs the convolution layer 3 in the multi-feature fusion unit, the convolution layer 3 has 512 convolution kernels, the size of the convolution kernels is 1 multiplied by 1, and the output of the convolution layer 3 is M3。
Output C of convolution unit 44Inputting an upsampling layer 1 and a convolutional layer 4 in a multi-feature fusion unit, wherein the convolutional layer 4 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 4 is M4。
Output C of convolution unit 55Upsampling layer in input multi-feature fusion unit2 and convolutional layer 5, the convolutional layer 5 has 512 convolutional kernels, the size of the convolutional kernel is 1 × 1, and the output of the convolutional layer 5 is M5。
The output M of the convolution layer 11The output M of the convolutional layer 22And the output M of the convolution layer 33The output M of the convolutional layer 44The output M of the convolutional layer 55Connected by channels, input convolutional layer 6, convolutional layer 6 has 512 convolutional kernels, the size of convolutional kernel is 1 × 1, and the output of convolutional layer 6 is M6。
The output of the multi-feature fusion unit is M as the output of the convolutional layer 66。
Fusing the output M of the unit with multiple features6Inputting the average pooling layer, wherein the output of the average pooling layer is S, the output S of the average pooling layer is input into a full connection layer 1, the number of neurons of the full connection layer 1 is 4096, and the output Q of the full connection layer 11The calculation method of (c) is as follows:
Q1=φrelu(W1·S+θ1),
wherein phi isreluIs a relu activation function, W1Is the weight of the fully connected layer 1, θ1Is the offset vector for fully connected layer 1.
Output Q of full connection layer 11Inputting full connection layer 2, the number of neurons of full connection layer 2 is 1000, and the output Q of full connection layer 22The calculation method of (c) is as follows:
Q2=φrelu(W2·Q1+θ2),
wherein, W2Is the weight of the fully connected layer 2, θ2Is the offset vector for fully connected layer 2. The output of the full connection layer 2 is the feature extracted by the feature extraction module.
Dynamic images of front, right side, left side and top projection sequences of the depth video V of the behavior sample are respectively input into the feature extraction module, and features can be extracted
4) Projecting four of the depth video for each behavior sampleAnd inputting the dynamic images of the shadow sequence into the feature connection obtained by the feature extraction module, and inputting the full connection layer 3 with the activation function of softmax. Output Q of full connection layer 33The calculation method of (c) is as follows:
wherein phi issoftmaxDenotes the softmax activation function, W3Is the weight of the fully-connected layer 3,express featuresConnected as a vector, θ3Is the offset vector for fully connected layer 3.
5) Constructing a four-stream human behavior recognition network, wherein the input of the network is dynamic images of projection sequences of the front surface, the right side surface, the left side surface and the top surface of the depth video of the behavior samples, and the output is the probability that the corresponding behavior samples belong to each behavior class, namely the output Q of the full connection layer 33. The loss function L of the network is:
wherein the content of the first and second substances,is the net output of the g-th behavior sample, lgIs the expected output of the g-th behavior sample, where lgIs defined as:
wherein lgIs the tag value of the g-th sample.
6) And calculating dynamic images of the front, right side, left side and top projection sequences of the depth video of each training behavior sample, inputting the dynamic images into a four-stream human behavior recognition network, and training the network until convergence.
7) And calculating dynamic images of front, right side, left side and top projection sequences of the depth video of each test behavior sample, inputting the dynamic images into a trained four-stream human behavior recognition network, obtaining probability values which are predicted by the current test behavior video sample and belong to various behavior classes, and obtaining the behavior class with the maximum probability value as the behavior class of the finally predicted current test behavior video sample, thereby realizing behavior recognition.
relu activates the function, whose formula is f (x) ═ max (0, x), with the input of the function being x and the output being the larger of x and 0.
Softmax activation function of the formulaWherein i represents the output of the ith neuron of the full connection layer, j represents the output of the jth neuron of the full connection layer, n is the number of the neurons of the full connection layer, and SiRepresents the output of the ith neuron of the full junction layer through the softmax activation function.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (9)
1. A depth video behavior recognition method is characterized by comprising the following steps:
1) carrying out front, right side, left side and top projection on the depth video of each behavior sample to obtain a corresponding projection sequence;
2) obtaining a dynamic image of each behavior sample by calculating a dynamic image of each projection sequence;
3) inputting the dynamic image of each behavior sample into a feature extraction module and extracting features;
4) connecting the features extracted from the dynamic images of each behavior sample, and inputting the connected features into a full connection layer;
5) constructing a four-flow human behavior recognition network;
6) calculating dynamic images of front, right side, left side and top projection sequences of the depth video of each training behavior sample, inputting the dynamic images into a four-stream human behavior recognition network, and training the four-stream human behavior recognition network until convergence;
7) and calculating each dynamic image of the behavior sample to be tested, and inputting each calculated dynamic image into the trained four-flow human behavior recognition network to realize behavior recognition.
2. The method for identifying deep video behaviors as claimed in claim 1, wherein the projection sequence in step 1) is obtained by:
each behavior sample is composed of all frames in the depth video of the sample, the depth video of any behavior sample is obtained,
V={It|t∈[1,N]},
wherein t represents a time index, and N is the total frame number of the depth video V of the behavior sample; i ist∈R×CR, C respectively corresponding to the row number and the column number of the matrix representation of the t frame depth image of the depth video V of the behavior sample, wherein the representation matrix is a real matrix; i ist(xi,yi)=diThe coordinate on the depth image of the t-th frame is (x)i,yi) Point p ofiOf depth, i.e. point piDistance from depth camera, di∈[0,D]D represents the farthest distance that the depth camera can detect;
the depth video V of a behavior sample may be represented as a set of projection sequences, formulated as follows:
V={Vfront,Vright,Vleft,Vtop},
wherein, VfrontProjection sequence obtained by front projection of a depth video V representing a behavior sample, VrightA projection sequence obtained by right side projection of a depth video V representing a behavior sample, VleftA projection sequence obtained by left side projection of a depth video V representing a behavior sample, VtopAnd the projection sequence obtained by top surface projection of the depth video V representing the behavior sample.
3. The method according to claim 1, wherein the dynamic image in step 2) is calculated in a manner that:
front projection sequence V of depth video V with behavior samplesfront={Ft|t∈[1,N]Take the example, first to FtVectorization, i.e. FtIs connected into a new row vector it;
For row vector itEach element in (a) is used for calculating the arithmetic square root to obtain a new vector wtNamely:
wherein the content of the first and second substances,representing a row vector itEach element in (1) is used to calculate the arithmetic square root, and w is calculatedtAs depth of a behavioral sampleFront projection sequence V of video VfrontThe frame vector of the t-th frame of (1);
calculating a front projection sequence V of a depth video V of a behavior samplefrontThe feature vector v of the t frame imagetThe calculation method is as follows:
wherein the content of the first and second substances,front projection sequence V representing depth video V on a behavior samplefrontSumming frame vectors of the 1 st frame image to the t th frame image;
calculating a front projection sequence V of a depth video V of a behavior samplefrontT frame image FtScore B oftThe calculation formula is as follows:
Bt=uT·vt,
wherein u is a vector with dimension a, a ═ R × C; u. ofTRepresents transposing the vector u; u. ofT·vtRepresenting the vector obtained by transposing the calculation pair vector u and the feature vector vtDot product of (2);
calculating the value of u such that the sequence of front projections VfrontThe middle frame images are sorted from front to back, the scores are increased progressively, namely the score B is increased when t is largertThe higher; the calculation of u can use RankSVM calculation, and the calculation method is as follows:
wherein the content of the first and second substances,denotes u that minimizes the value of E (u), λ is a constant, | u | | non-calculation2Representing the sum of the squares of each element in the calculation vector u; b isc、BjFront projection sequence V of depth video V representing behavior samples respectivelyfrontScore of image of frame c, score of image of frame j, max {0,1-Bc+BjMeans to choose 0 and 1-Bc+BjThe larger of the values;
after calculating the vector u using RankSVM, the vector u is aligned to FtThe image form with the same size is obtained to obtain u' epsilonR×CU' is the front projection sequence V of the depth video V of the behavior samplefrontThe moving picture of (2).
4. The method for identifying the deep video behaviors according to claim 1, wherein the feature extraction module comprises a convolution unit 1, a convolution unit 2, a convolution unit 3, a convolution unit 4, a convolution unit 5, a multi-feature fusion unit, an average pooling layer, a full connection layer 1 and a full connection layer 2; firstly, the outputs of the convolution unit 1, the convolution unit 2, the convolution unit 3, the convolution unit 4 and the convolution unit 5 are input to the multi-feature fusion unit in sequence, and then the output M of the multi-feature fusion unit is output6Inputting the output S of the average pooling layer into a full-link layer 1, wherein the number of neurons in the full-link layer 1 is D1Output Q of full connection layer 11The calculation method of (c) is as follows:
Q1=φrelu(W1·S+θ1),
wherein phi isreluIs a relu activation function, W1Is the weight of the fully connected layer 1, θ1Is the offset vector of fully connected layer 1;
output Q of full connection layer 11Inputting into the full connection layer 2, the number of the neurons of the full connection layer 2 is D2Output Q of full connection layer 22The calculation method of (c) is as follows:
Q2=φrelu(W2·Q1+θ2),
wherein, W2Is the weight of the fully connected layer 2, θ2The offset vector of the full connection layer 2, and the output of the full connection layer 2 is the feature extracted by the feature extraction module;
5. The method according to claim 1, wherein the features in step 4) are connected in such a way that the extracted features are connectedFully-connected layer 3 connected as a vector with input activation function softmax, output Q of fully-connected layer 33The calculation method of (c) is as follows:
6. The deep video behavior recognition method according to claim 1, wherein the step 5) constructs a four-stream human behavior recognition network as follows: the input of the network is a dynamic image of a projection sequence of a depth video of a behavior sample, and the output is Q3The loss function L of the network is,
wherein G is the total training sample number, K is the behavior sample class number,is the net output of the g-th behavior sample, lgIs the expected output of the g-th behavior sample, where lgIs defined as:
wherein lgIs the tag value of the g-th sample.
7. The method for identifying behaviors of depth video according to claim 1, wherein the behaviors in step 7) are identified as follows: and calculating dynamic images of the front, right side, left side and top projection sequences of the depth video of the behavior sample to be tested, inputting the dynamic images into a trained four-stream behavior recognition network, and obtaining a probability value for predicting the behavior class to which the current testing behavior video sample belongs, wherein the behavior class with the maximum probability value is the behavior class to which the current testing behavior video sample finally predicts.
8. The method according to claim 2, wherein V is the video-depth behavior recognition methodfrontProjection sequence acquisition mode:
Vfront={Ft|t∈[1,N]in which Ft∈R×CA projection diagram obtained by front projection of the tth frame depth image of the depth video V of the motion sample is shown; point p in depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection of the point toProjection FtThe abscissa value of the pointOrdinate valuePixel valueCan be formulated as:
wherein f is1To be depth value diMapping to [0,255]A linear function of the interval such that a point with a smaller depth value has a larger pixel value on the projection image, i.e., a point closer to the depth camera has a brighter value on the front projection image;
Vrightprojection sequence acquisition mode:
Vright={Rt|t∈[1,N]in which R ist∈R×DA projection diagram obtained by projecting the right side surface of the t frame depth image is shown; when right side projection is performed on the depth image, at least one point is projected to the same position on the projection graph; and observing behaviors from the right side, the closest point to the observer, namely the farthest point from the projection plane can be seen; keeping the abscissa value of the point farthest from the projection plane on the depth image, and calculating the pixel value of the point at the position of the projection image according to the abscissa value; traversing the points in the depth image in a row-by-row mode from the row with the minimum horizontal coordinate x in the depth image to the direction increasing to x, projecting the points on the projection map, and obtaining the points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diSeparately determine the projection view RtPixel value of a point in (1)Ordinate valueHorizontal coordinate valueIs formulated as:
wherein f is2To be compared with the abscissa value xiMapping to [0,255]A linear function of the interval; when x is continuously increased, if a new point and a previously projected point are projected to the same position of the projection graph, the latest point is reserved, namely the pixel value of the position of the projection graph is calculated by using the abscissa value of the point with the largest abscissa value, namely the pixel value of the position of the projection graph is calculatedWherein xm=max xi,xi∈XR,XRFor all ordinate values in the depth image to beDepth value ofSet of abscissas of points of (1), max xi,xi∈XRA set of representations XRMaximum of the abscissa in (1);
Vleftsequence of projectionsThe method comprises the following steps:
Vleft={Lt|t∈[1,N]in which L ist∈R×DA projection diagram obtained by projecting the left side surface of the t-th frame depth image is shown; when a plurality of points are projected to the same position of the left side surface projection drawing, keeping the point farthest from the projection plane; traversing the points in the depth image from one row with the maximum horizontal coordinate x to the direction of decreasing x row by row, projecting the points on the left side projection image, and the points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection views L respectivelytPixel value of a point in (1)Ordinate valueHorizontal coordinate valueFor the same coordinates projected onto the left side projection viewSelecting the abscissa value of the point with the minimum abscissa to calculate the pixel value of the projection graph at the coordinate, and expressing the pixel value as the following formula:
wherein f is3To be compared with the abscissa value xnMapping to [0,255]Linear function of interval, xn=min xi,xi∈XL,XLFor all ordinate values in the depth image to beDepth value ofSet of abscissas of points (c), min xi,xi∈XLA set of representations XLThe minimum value of the middle abscissa;
Vtopprojection sequence acquisition mode:
Vtop={Tt|t∈[1,N]in which O ist∈D×CA projection diagram obtained by projecting the depth image of the t frame from the top surface is shown; when the plurality of points are projected to the same position of the top surface projection drawing, the point farthest away from the projection plane is reserved; traversing the points in the depth image line by line from the line with the minimum ordinate y on the depth image to the increasing direction of y, projecting the points on the top surface projection image, and the points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection of the point onto the projection graph OtThe abscissa value of the pointPixel valueOrdinate valueFor the same coordinates projected onto the projection viewSelecting the longitudinal coordinate value of the point with the maximum longitudinal coordinate as the pixel value of the projection graph at the coordinate, and expressing the pixel value as follows:
9. The method according to claim 4, wherein the convolution unit 1 comprises 2 convolution layers and 1 maximum pooling layer; each convolution layer has 64 convolution kernels, each convolution kernel has a size of 3 × 3, the pooling kernel of the largest pooling layer has a size of 2 × 2, and the output of convolution unit 1 is C1;
The convolution unit 2 contains 2 convolution layers each having 128 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and the input of the convolution unit 2 is C1The output is C2;
The convolution unit 3 comprises 3 convolution layers and 1 maximum pooling layer, each convolution layer has 256 convolution kernels, each convolution kernel has a size of 3 × 3, the pooling kernel of the maximum pooling layer has a size of 2 × 2, and the input of the convolution unit 3 is C2The output is C3;
The convolution unit 4 includes 3 convolution layers each having 512 convolution kernels, each having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, the convolution kernelsThe input of the cell 4 is C3The output is C4;
The convolution unit 5 comprises 3 convolution layers each having 512 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and the input of the convolution unit 5 is C4The output is C5;
The input of the multi-feature fusion unit is the output C of the convolution unit 11Output C of convolution unit 22Output C of convolution unit 33Output C of convolution unit 44Output C of convolution unit 55;
Output C of convolution unit 11Inputting a maximum pooling layer 1 and a convolutional layer 1 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 1 is 4 multiplied by 4, the convolutional layer 1 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 1 is M1;
Output C of convolution unit 22Inputting a maximum pooling layer 2 and a convolutional layer 2 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 2 is 2 x 2, the convolutional layer 2 has 512 convolutional kernels, the size of the convolutional kernel is 1 x 1, and the output of the convolutional layer 2 is M2;
The convolution unit 3 inputs the convolution layer 3 in the multi-feature fusion unit, the convolution layer 3 has 512 convolution kernels, the size of the convolution kernels is 1 multiplied by 1, and the output of the convolution layer 3 is M3;
Output C of convolution unit 44Inputting an upsampling layer 1 and a convolutional layer 4 in a multi-feature fusion unit, wherein the convolutional layer 4 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 4 is M4;
Output C of convolution unit 55Inputting an upsampling layer 2 and a convolutional layer 5 in the multi-feature fusion unit, wherein the convolutional layer 5 has 512 convolutional kernels, the size of the convolutional kernel is 1 multiplied by 1, and the output of the convolutional layer 5 is M5;
The output M of the convolution layer 11The output M of the convolutional layer 22And the output M of the convolution layer 33The output M of the convolutional layer 44The output M of the convolutional layer 55Connected by channels, fed into the convolutional layer 6, convolutional layer 6There are 512 convolution kernels, the size of the convolution kernel is 1 × 1, and the output of convolution layer 6 is M6;
The output of the multi-feature fusion unit is M as the output of the convolutional layer 66。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110967362.8A CN113591797B (en) | 2021-08-23 | 2021-08-23 | Depth video behavior recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110967362.8A CN113591797B (en) | 2021-08-23 | 2021-08-23 | Depth video behavior recognition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113591797A true CN113591797A (en) | 2021-11-02 |
CN113591797B CN113591797B (en) | 2023-07-28 |
Family
ID=78238846
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110967362.8A Active CN113591797B (en) | 2021-08-23 | 2021-08-23 | Depth video behavior recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113591797B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023024658A1 (en) * | 2021-08-23 | 2023-03-02 | 苏州大学 | Deep video linkage feature-based behavior recognition method |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105740833A (en) * | 2016-02-03 | 2016-07-06 | 北京工业大学 | Human body behavior identification method based on depth sequence |
CN107066979A (en) * | 2017-04-18 | 2017-08-18 | 重庆邮电大学 | A kind of human motion recognition method based on depth information and various dimensions convolutional neural networks |
CN108280421A (en) * | 2018-01-22 | 2018-07-13 | 湘潭大学 | Human bodys' response method based on multiple features Depth Motion figure |
CN108537196A (en) * | 2018-04-17 | 2018-09-14 | 中国民航大学 | Human bodys' response method based on the time-space distribution graph that motion history point cloud generates |
CN108805093A (en) * | 2018-06-19 | 2018-11-13 | 华南理工大学 | Escalator passenger based on deep learning falls down detection algorithm |
CN109460734A (en) * | 2018-11-08 | 2019-03-12 | 山东大学 | The video behavior recognition methods and system shown based on level dynamic depth projection difference image table |
CN110084211A (en) * | 2019-04-30 | 2019-08-02 | 苏州大学 | A kind of action identification method |
CN113221694A (en) * | 2021-04-29 | 2021-08-06 | 苏州大学 | Action recognition method |
-
2021
- 2021-08-23 CN CN202110967362.8A patent/CN113591797B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105740833A (en) * | 2016-02-03 | 2016-07-06 | 北京工业大学 | Human body behavior identification method based on depth sequence |
CN107066979A (en) * | 2017-04-18 | 2017-08-18 | 重庆邮电大学 | A kind of human motion recognition method based on depth information and various dimensions convolutional neural networks |
CN108280421A (en) * | 2018-01-22 | 2018-07-13 | 湘潭大学 | Human bodys' response method based on multiple features Depth Motion figure |
CN108537196A (en) * | 2018-04-17 | 2018-09-14 | 中国民航大学 | Human bodys' response method based on the time-space distribution graph that motion history point cloud generates |
CN108805093A (en) * | 2018-06-19 | 2018-11-13 | 华南理工大学 | Escalator passenger based on deep learning falls down detection algorithm |
CN109460734A (en) * | 2018-11-08 | 2019-03-12 | 山东大学 | The video behavior recognition methods and system shown based on level dynamic depth projection difference image table |
CN110084211A (en) * | 2019-04-30 | 2019-08-02 | 苏州大学 | A kind of action identification method |
CN113221694A (en) * | 2021-04-29 | 2021-08-06 | 苏州大学 | Action recognition method |
Non-Patent Citations (2)
Title |
---|
XIAOFENG ZHAO ET AL.: "Discriminative Pose Analysis for Human Action Recognition", 《2020 IEEE 6TH WORLD FORUM ON INTERNET OF THINGS (WF-IOT)》, pages 1 - 6 * |
刘婷婷: "基于深度数据的人体行为识别", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023024658A1 (en) * | 2021-08-23 | 2023-03-02 | 苏州大学 | Deep video linkage feature-based behavior recognition method |
Also Published As
Publication number | Publication date |
---|---|
CN113591797B (en) | 2023-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111259850B (en) | Pedestrian re-identification method integrating random batch mask and multi-scale representation learning | |
CN107341452B (en) | Human behavior identification method based on quaternion space-time convolution neural network | |
CN108520535B (en) | Object classification method based on depth recovery information | |
CN109543602B (en) | Pedestrian re-identification method based on multi-view image feature decomposition | |
CN112446476A (en) | Neural network model compression method, device, storage medium and chip | |
WO2019227479A1 (en) | Method and apparatus for generating face rotation image | |
CN113610046B (en) | Behavior recognition method based on depth video linkage characteristics | |
CN109740539B (en) | 3D object identification method based on ultralimit learning machine and fusion convolution network | |
US20240046700A1 (en) | Action recognition method | |
CN111783748A (en) | Face recognition method and device, electronic equipment and storage medium | |
CN108596256B (en) | Object recognition classifier construction method based on RGB-D | |
CN111476806A (en) | Image processing method, image processing device, computer equipment and storage medium | |
CN113011253B (en) | Facial expression recognition method, device, equipment and storage medium based on ResNeXt network | |
CN113128424A (en) | Attention mechanism-based graph convolution neural network action identification method | |
CN112580458A (en) | Facial expression recognition method, device, equipment and storage medium | |
CN111488951B (en) | Method for generating countermeasure metric learning model for RGB-D image classification | |
CN109886281A (en) | One kind is transfinited learning machine color image recognition method based on quaternary number | |
Wang et al. | Bikers are like tobacco shops, formal dressers are like suits: Recognizing urban tribes with caffe | |
CN112800979B (en) | Dynamic expression recognition method and system based on characterization flow embedded network | |
CN114612709A (en) | Multi-scale target detection method guided by image pyramid characteristics | |
CN114882537A (en) | Finger new visual angle image generation method based on nerve radiation field | |
CN113591797A (en) | Deep video behavior identification method | |
CN109886160A (en) | It is a kind of it is non-limiting under the conditions of face identification method | |
CN117037244A (en) | Face security detection method, device, computer equipment and storage medium | |
CN112560824B (en) | Facial expression recognition method based on multi-feature adaptive fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |