CN113591797A - Deep video behavior identification method - Google Patents

Deep video behavior identification method Download PDF

Info

Publication number
CN113591797A
CN113591797A CN202110967362.8A CN202110967362A CN113591797A CN 113591797 A CN113591797 A CN 113591797A CN 202110967362 A CN202110967362 A CN 202110967362A CN 113591797 A CN113591797 A CN 113591797A
Authority
CN
China
Prior art keywords
projection
depth
layer
convolution
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110967362.8A
Other languages
Chinese (zh)
Other versions
CN113591797B (en
Inventor
杨剑宇
黄瑶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202110967362.8A priority Critical patent/CN113591797B/en
Publication of CN113591797A publication Critical patent/CN113591797A/en
Application granted granted Critical
Publication of CN113591797B publication Critical patent/CN113591797B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to a method for recognizing deep video behaviors, which comprises the following steps of: carrying out front, right side, left side and top projection on the depth video of each behavior sample to obtain a corresponding projection sequence; obtaining a dynamic image of each behavior sample by calculating a dynamic image of each projection sequence; inputting the dynamic image of each behavior sample into a feature extraction module and extracting features; connecting the features extracted from the dynamic images of each behavior sample, and inputting the connected features into a full connection layer; constructing a four-flow human behavior recognition network; calculating dynamic images of front, right side, left side and top projection sequences of the depth video of each training behavior sample, inputting the dynamic images into a four-stream human behavior recognition network, and training the four-stream human behavior recognition network until convergence; and calculating each dynamic image of the behavior sample to be tested, and inputting each calculated dynamic image into the trained four-flow human behavior recognition network to realize behavior recognition.

Description

Deep video behavior identification method
Technical Field
The invention relates to the technical field of computer vision, in particular to a deep video behavior identification method.
Background
At present, human behavior recognition is an important subject in the field of computer vision. The method is widely applied to the fields of video monitoring, human-computer interaction and the like.
Traditional methods focus on design-manual feature extraction of spatiotemporal information in depth video, followed by classification using classifiers such as support vector machines. However, the methods extract shallow features, and the experimental results are not ideal. Due to the development of computers, more and more scholars use deep neural networks for human behavior recognition. The convolutional neural network has strong learning ability on images and videos, so that the convolutional neural network is used for analyzing the depth videos and identifying human behaviors is a good choice. Some researchers propose to use a three-dimensional convolutional neural network to extract deep space-time characteristics in a deep behavior video, but the deep behavior video is directly input into the convolutional neural network, so that three-dimensional information in the deep behavior video cannot be well utilized. Compared with a two-dimensional convolutional neural network, the three-dimensional convolutional neural network has more parameters, more training data are needed to enable the network to be converged, and the three-dimensional convolutional neural network generally shows the performance under the condition of a smaller data set.
Therefore, aiming at the problem of the behavior recognition algorithm, a deep video behavior recognition method is provided.
Disclosure of Invention
The invention is provided for solving the problems in the prior art, and aims to provide a deep video behavior identification method which solves the problem that the deep features extracted by the existing identification method cannot fully utilize three-dimensional information in a deep behavior video.
A depth video behavior recognition method comprises the following steps:
1) carrying out front, right side, left side and top projection on the depth video of each behavior sample to obtain a corresponding projection sequence;
2) obtaining a dynamic image of each behavior sample by calculating a dynamic image of each projection sequence;
3) inputting the dynamic image of each behavior sample into a feature extraction module and extracting features;
4) connecting the features extracted from the dynamic images of each behavior sample, and inputting the connected features into a full connection layer;
5) constructing a four-flow human behavior recognition network;
6) calculating dynamic images of front, right side, left side and top projection sequences of the depth video of each training behavior sample, inputting the dynamic images into a four-stream human behavior recognition network, and training the four-stream human behavior recognition network until convergence;
7) and calculating each dynamic image of the behavior sample to be tested, and inputting the calculated each dynamic image into a trained four-flow human behavior recognition network to realize behavior recognition.
Preferably, the projection sequence in step 1) is obtained by:
each behavior sample is composed of all frames in the depth video of the sample, the depth video of any behavior sample is obtained,
V={It|t∈[1,N]},
wherein t represents a time index, and N is the total frame number of the depth video V of the behavior sample; i istR×CR, C respectively corresponding to the row number and the column number of the matrix representation of the t frame depth image of the depth video V of the behavior sample, wherein the representation matrix is a real matrix; i ist(xi,yi)=diThe coordinate on the depth image of the t-th frame is (x)i,yi) Point p ofiOf depth, i.e. point piDistance from depth camera, di∈[0,D]D represents the farthest distance that the depth camera can detect;
the depth video V of a behavior sample may be represented as a set of projection sequences, formulated as follows:
V={Vfront,Vright,Vleft,Vtop},
wherein, VfrontProjection sequence obtained by front projection of a depth video V representing a behavior sample, VrightA projection sequence obtained by right side projection of a depth video V representing a behavior sample, VleftA projection sequence obtained by left side projection of a depth video V representing a behavior sample, VtopRepresenting depth of a behavioral sampleAnd the video V carries out top surface projection to obtain a projection sequence.
Preferably, the method for calculating the moving image in step 2) is:
front projection sequence V of depth video V with behavior samplesfront={Ft|t∈[1,N]Take the example, first to FtVectorization, i.e. FtIs connected into a new row vector it
For row vector itEach element in (a) is used for calculating the arithmetic square root to obtain a new vector wtNamely:
Figure BDA0003224425560000021
wherein the content of the first and second substances,
Figure BDA0003224425560000022
representing a row vector itEach element in (1) is used to calculate the arithmetic square root, and w is calculatedtFront projection sequence V of depth video V as behavior samplefrontThe frame vector of the t-th frame of (1);
calculating a front projection sequence V of a depth video V of a behavior samplefrontThe feature vector v of the t frame imagetThe calculation method is as follows:
Figure BDA0003224425560000031
wherein the content of the first and second substances,
Figure BDA0003224425560000032
front projection sequence V representing depth video V on a behavior samplefrontSumming frame vectors of the 1 st frame image to the t th frame image;
calculating a front projection sequence V of a depth video V of a behavior samplefrontT frame image FtScore B oftThe calculation formula is as follows:
Bt=uT·vt
wherein u is a vector with dimension a, a ═ R × C; u. ofTRepresents transposing the vector u; u. ofT·vtRepresenting the vector obtained by transposing the calculation pair vector u and the feature vector vtDot product of (2);
calculating the value of u such that the sequence of front projections VfrontThe middle frame images are sorted from front to back, the scores are increased progressively, namely the score B is increased when t is largertThe higher; the calculation of u can use RankSVM calculation, and the calculation method is as follows:
Figure BDA0003224425560000033
Figure BDA0003224425560000034
wherein the content of the first and second substances,
Figure BDA0003224425560000035
denotes u that minimizes the value of E (u), λ is a constant, | u | | non-calculation2Representing the sum of the squares of each element in the calculation vector u; b isc、BjFront projection sequence V of depth video V representing behavior samples respectivelyfrontScore of image of frame c, score of image of frame j, max {0,1-Bc+BjMeans to choose 0 and 1-Bc+BjThe larger of the values;
after calculating the vector u using RankSVM, the vector u is aligned to FtThe image form with the same size is obtained to obtain u' epsilonR×CU' is the front projection sequence V of the depth video V of the behavior samplefront.
Preferably, the feature extraction module comprises a convolution unit 1, a convolution unit 2, a convolution unit 3, a convolution unit 4, a convolution unit 5, a multi-feature fusion unit, an average pooling layer, a full connection layer 1 and a full connection layer 2; firstly, the outputs of the convolution unit 1, the convolution unit 2, the convolution unit 3, the convolution unit 4 and the convolution unit 5 are input to the multi-feature fusion unit in sequence, and then the outputs of the multi-feature fusion unit are input to the multi-feature fusion unitOutput M6Inputting the output S of the average pooling layer into a full-link layer 1, wherein the number of neurons in the full-link layer 1 is D1Output Q of full connection layer 11The calculation method of (c) is as follows:
Q1=φrelu(W1·S+θ1),
wherein phi isreluIs a relu activation function, W1Is the weight of the fully connected layer 1, θ1Is the offset vector of fully connected layer 1;
output Q of full connection layer 11Inputting into the full connection layer 2, the number of the neurons of the full connection layer 2 is D2Output Q of full connection layer 22The calculation method of (c) is as follows:
Q2=φrelu(W2·Q12),
wherein, W2Is the weight of the fully connected layer 2, θ2The offset vector of the full connection layer 2, and the output of the full connection layer 2 is the feature extracted by the feature extraction module;
respectively inputting the dynamic images of the front, right side, left side and top projection sequences of the depth video V of the behavior sample into a feature extraction module to extract features
Figure BDA0003224425560000041
Preferably, the features in step 4) are connected in such a way that the extracted features are extracted
Figure BDA0003224425560000042
Fully-connected layer 3 connected as a vector with input activation function softmax, output Q of fully-connected layer 33The calculation method of (c) is as follows:
Figure BDA0003224425560000043
wherein phi issoftmaxDenotes the softmax activation function, W3Is the weight of the fully-connected layer 3,
Figure BDA0003224425560000044
express features
Figure BDA0003224425560000045
Connected as a vector, θ3Is the offset vector for fully connected layer 3.
Preferably, the step 5) of constructing the four-stream human behavior recognition network is as follows: the input of the network is a dynamic image of a projection sequence of a depth video of a behavior sample, and the output is Q3The loss function L of the network is,
Figure BDA0003224425560000046
wherein G is the total training sample number K is the behavior sample class number,
Figure BDA0003224425560000047
is the net output of the g-th behavior sample, lgIs the expected output of the g-th behavior sample, where lgIs defined as:
Figure BDA0003224425560000051
wherein lgIs the tag value of the g-th sample.
Preferably, the behavior in step 7) is identified as: and calculating dynamic images of the front, right side, left side and top projection sequences of the depth video of the behavior sample to be tested, inputting the dynamic images into a trained four-stream behavior recognition network, and obtaining a probability value for predicting the behavior class to which the current testing behavior video sample belongs, wherein the behavior class with the maximum probability value is the behavior class to which the current testing behavior video sample finally predicts.
Preferably, said VfrontProjection sequence acquisition mode:
Vfront={Ft|t∈[1,N]in which FtR×CA projection diagram obtained by front projection of the tth frame depth image of the depth video V of the motion sample is shown; point p in depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection of the point onto the projection FtThe abscissa value of the point
Figure BDA0003224425560000052
Ordinate value
Figure BDA0003224425560000053
Pixel value
Figure BDA0003224425560000054
Can be formulated as:
Figure BDA0003224425560000055
Figure BDA0003224425560000056
wherein f is1To be depth value diMapping to [0,255]A linear function of the interval such that a point with a smaller depth value has a larger pixel value on the projection image, i.e., a point closer to the depth camera has a brighter value on the front projection image;
Vrightprojection sequence acquisition mode:
Vright={Rt|t∈[1,N]in which R istR×DA projection diagram obtained by projecting the right side surface of the t frame depth image is shown; when right side projection is performed on the depth image, at least one point is projected to the same position on the projection graph; and observing behaviors from the right side, the closest point to the observer, namely the farthest point from the projection plane can be seen; keeping the abscissa value of the point farthest from the projection plane on the depth image, and calculating the pixel value of the point at the position of the projection image according to the abscissa value; the depth image is arranged in a row by row from the row with the minimum horizontal coordinate x to the direction increasing from xTraversing points in the depth image, projecting the points on the projection graph, and projecting the points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diSeparately determine the projection view RtPixel value of a point in (1)
Figure BDA0003224425560000057
Ordinate value
Figure BDA0003224425560000058
Horizontal coordinate value
Figure BDA0003224425560000059
Is formulated as:
Figure BDA00032244255600000510
Figure BDA00032244255600000511
wherein f is2To be compared with the abscissa value xiMapping to [0,255]A linear function of the interval; when x is continuously increased, if a new point and a previously projected point are projected to the same position of the projection graph, the latest point is reserved, namely the pixel value of the position of the projection graph is calculated by using the abscissa value of the point with the largest abscissa value, namely the pixel value of the position of the projection graph is calculated
Figure BDA0003224425560000061
Wherein xm=maxxi,xi∈XR,XRFor all ordinate values in the depth image to be
Figure BDA0003224425560000062
Depth value of
Figure BDA0003224425560000063
Of points of (a), maxxi,xi∈XRA set of representations XRMaximum of the abscissa in (1);
Vleftprojection sequence acquisition mode:
Vleft={Lt|t∈[1,N]in which L istR×DA projection diagram obtained by projecting the left side surface of the t-th frame depth image is shown; when a plurality of points are projected to the same position of the left side surface projection drawing, keeping the point farthest from the projection plane; traversing the points in the depth image from one row with the maximum horizontal coordinate x to the direction of decreasing x row by row, projecting the points on the left side projection image, and the points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection views L respectivelytPixel value of a point in (1)
Figure BDA0003224425560000064
Ordinate value
Figure BDA0003224425560000065
Horizontal coordinate value
Figure BDA0003224425560000066
For the same coordinates projected onto the left side projection view
Figure BDA0003224425560000067
Selecting the abscissa value of the point with the minimum abscissa to calculate the pixel value of the projection graph at the coordinate, and expressing the pixel value as the following formula:
Figure BDA0003224425560000068
Figure BDA0003224425560000069
wherein f is3To be compared with the abscissa value xnMapping to [0,255]Linear function of interval, xn=minxi,xi∈XL,XLFor all ordinate values in the depth image to be
Figure BDA00032244255600000610
Depth value of
Figure BDA00032244255600000611
Set of abscissas of points of (1), minxi,xi∈XLA set of representations XLThe minimum value of the middle abscissa;
Vtopprojection sequence acquisition mode:
Vtop={Tt|t∈[1,N]in which O istD×CA projection diagram obtained by projecting the depth image of the t frame from the top surface is shown; when the plurality of points are projected to the same position of the top surface projection drawing, the point farthest away from the projection plane is reserved; traversing the points in the depth image line by line from the line with the minimum ordinate y on the depth image to the increasing direction of y, projecting the points on the top surface projection image, and the points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection of the point onto the projection graph OtThe abscissa value of the point
Figure BDA00032244255600000612
Pixel value
Figure BDA00032244255600000613
Ordinate value
Figure BDA00032244255600000614
For the same coordinates projected onto the projection view
Figure BDA00032244255600000615
Selecting the longitudinal coordinate value of the point with the maximum longitudinal coordinate as the pixel value of the projection graph at the coordinate, and expressing the pixel value as follows:
Figure BDA0003224425560000071
Figure BDA0003224425560000072
wherein f is4To be compared with the ordinate value yqMapping to [0,255]Linear function of interval, yq=maxyi,yi∈YOWherein Y isOAll the abscissa values in the depth image are
Figure BDA0003224425560000073
Depth value of
Figure BDA0003224425560000074
Set of ordinates of the points of (1), maxyi,yi∈YOSet of representations YOMaximum of the middle ordinate.
Preferably, the convolution unit 1 comprises 2 convolution layers and 1 max pooling layer; each convolution layer has 64 convolution kernels, each convolution kernel has a size of 3 × 3, the pooling kernel of the largest pooling layer has a size of 2 × 2, and the output of convolution unit 1 is C1
The convolution unit 2 contains 2 convolution layers each having 128 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and the input of the convolution unit 2 is C1The output is C2
The convolution unit 3 comprises 3 convolution layers and 1 maximum pooling layer, each convolution layer has 256 convolution kernels, each convolution kernel has a size of 3 × 3, the pooling kernel of the maximum pooling layer has a size of 2 × 2, and the input of the convolution unit 3 is C2The output is C3
The convolution unit 4 contains 3 convolution layers each having 512 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and the input of the convolution unit 4 is C3The output is C4
The convolution unit 5 contains 3 convolution layers each having 512 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and a volumeThe input of the product unit 5 is C4The output is C5
The input of the multi-feature fusion unit is the output C of the convolution unit 11Output C of convolution unit 22Output C of convolution unit 33Output C of convolution unit 44Output C of convolution unit 55
Output C of convolution unit 11Inputting a maximum pooling layer 1 and a convolutional layer 1 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 1 is 4 multiplied by 4, the convolutional layer 1 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 1 is M1
Output C of convolution unit 22Inputting a maximum pooling layer 2 and a convolutional layer 2 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 2 is 2 x 2, the convolutional layer 2 has 512 convolutional kernels, the size of the convolutional kernel is 1 x 1, and the output of the convolutional layer 2 is M2
The convolution unit 3 inputs the convolution layer 3 in the multi-feature fusion unit, the convolution layer 3 has 512 convolution kernels, the size of the convolution kernels is 1 multiplied by 1, and the output of the convolution layer 3 is M3
Output C of convolution unit 44Inputting an upsampling layer 1 and a convolutional layer 4 in a multi-feature fusion unit, wherein the convolutional layer 4 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 4 is M4
Output C of convolution unit 55Inputting an upsampling layer 2 and a convolutional layer 5 in the multi-feature fusion unit, wherein the convolutional layer 5 has 512 convolutional kernels, the size of the convolutional kernel is 1 multiplied by 1, and the output of the convolutional layer 5 is M5
The output M of the convolution layer 11The output M of the convolutional layer 22And the output M of the convolution layer 33The output M of the convolutional layer 44The output M of the convolutional layer 55Connected by channels, input convolutional layer 6, convolutional layer 6 has 512 convolutional kernels, the size of convolutional kernel is 1 × 1, and the output of convolutional layer 6 is M6
The output of the multi-feature fusion unit is M as the output of the convolutional layer 66
The invention has the following beneficial effects: 1) the information such as the appearance of the person can not be acquired based on the behavior recognition of the depth video, and the privacy of the person is protected; meanwhile, the depth video is not easily influenced by illumination, and richer three-dimensional information about behaviors can be provided;
2) the depth video is projected to different planes, information of different dimensionalities of behaviors can be obtained, and the information is combined, so that the human behavior can be more easily identified; when the network is trained, only 4 dynamic images are used as the compact representation input network of the video, and the requirement on the performance of computer equipment is not high.
Drawings
FIG. 1 is a flow chart of the present invention
FIG. 2 is a flow diagram of an extraction module.
Fig. 3 is a flow chart of a four-stream human behavior recognition network.
FIG. 4 is a schematic plane projection diagram of a hand waving behavior in the embodiment.
Fig. 5 is a front projection dynamic image of the waving behavior in the embodiment.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
An embodiment of the present invention, referring to fig. 1 to 5, is a depth video behavior recognition method, including the following steps:
1) carrying out front, right side, left side and top projection on the depth video of each behavior sample to obtain 4 projection sequences;
2) calculating the dynamic images of the 4 projection sequences of each behavior sample to obtain 4 dynamic images of each behavior sample;
3) respectively inputting the 4 dynamic images into a feature extraction module to extract features;
4) connecting the extracted features of the 4 dynamic images, and inputting the features into a full connection layer;
5) constructing a four-flow human behavior recognition network;
6) calculating dynamic images of front, right side, left side and top projection sequences of the depth video of each training behavior sample, inputting a four-stream human behavior recognition network, and training the network until convergence;
7) and 4 dynamic images of each test behavior sample are calculated, and the trained four-flow human behavior recognition network is input to realize behavior recognition.
Acquiring a projection sequence in the step 1):
each behavior sample is composed of all frames in the depth video of the sample, and for any behavior sample's depth video V:
V={It|t∈[1,N]},
wherein t represents a time index, and N is the total frame number of the depth video V of the behavior sample; i istR×CR, C respectively corresponding to the row number and the column number of the matrix representation of the t frame depth image of the depth video V of the behavior sample, wherein the representation matrix is a real matrix; i ist(xi,yi)=diThe coordinate on the depth image of the t-th frame is (x)i,yi) Point p ofiOf depth, i.e. point piDistance from depth camera, di∈[0,D]D represents the farthest distance that the depth camera can detect;
and respectively projecting the depth video V of the behavior sample to four planes, namely a front surface, a right side surface, a left side surface and a top surface. At this time, the depth video V of the behavior sample can be expressed as a set of four projection graph sequences, which are formulated as follows:
V={Vfront,Vright,Vleft,Vtop},
wherein, VfrontProjection sequence obtained by front projection of a depth video V representing a behavior sample, VrightA projection sequence obtained by right side projection of a depth video V representing a behavior sample, VleftA projection sequence obtained by left side projection of a depth video V representing a behavior sample, VtopA projection sequence obtained by top surface projection is carried out on the depth video V representing the behavior sample;
Vfront={Ft|t∈[1,N]in which FtR×CThe method comprises the steps that a projection diagram obtained by front projection is carried out on a tth frame depth image of a depth video V of a behavior sample; point p in depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection of the point onto the projection FtThe abscissa value of the point
Figure BDA0003224425560000101
Ordinate value
Figure BDA0003224425560000102
Pixel value
Figure BDA0003224425560000103
Can be formulated as:
Figure BDA0003224425560000104
Figure BDA0003224425560000105
wherein f is1To be depth value diMapping to [0,255]The linear function of the interval is such that points with smaller depth values have larger pixel values on the projection view, i.e. points closer to the depth camera are brighter on the front projection view.
Vright={Rt|t∈[1,N]In which R istR×DA projection diagram obtained by projecting the right side surface of the t frame depth image is shown; when right side projecting the depth image, there may be more than one point projected to the same location on the projection map; while observing the behavior from the right side, what can be seen is the point closest to the observer, i.e. the point closest to the projection planeA remote point; therefore, the abscissa value of the point farthest from the projection plane on the depth image should be retained, and the pixel value of the point at the position of the projection view should be calculated using the abscissa value. For this purpose, points in the depth image are traversed column by column starting from a column on the depth image with the smallest horizontal coordinate x and increasing in the direction x, projected onto the projection map, and points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diSeparately determine the projection view RtPixel value of a point in (1)
Figure BDA0003224425560000106
Ordinate value
Figure BDA0003224425560000107
Horizontal coordinate value
Figure BDA0003224425560000111
Is formulated as:
Figure BDA0003224425560000112
Figure BDA0003224425560000113
wherein f is2To be compared with the abscissa value xiMapping to [0,255]A linear function of the interval; when x is continuously increased, new points and the points which are projected before are projected to the same position of the projection graph, the latest points are reserved, namely the pixel value of the position of the projection graph is calculated by using the abscissa value of the point with the largest abscissa value, namely the pixel value of the position of the projection graph is calculated
Figure BDA0003224425560000114
Wherein xm=maxxi,xi∈XR,XRFor all ordinate values in the depth image to be
Figure BDA0003224425560000115
Depth value of
Figure BDA0003224425560000116
Of points of (a), maxxi,xi∈XRA set of representations XRMaximum of the abscissa in (1).
Vleft={Lt|t∈[1,N]In which L istR×DA projection diagram obtained by projecting the left side surface of the t-th frame depth image is shown; similar to obtaining the right side projection view, when there are a plurality of points projected to the same position of the left side projection view, the point farthest from the projection plane should be reserved; for this purpose, points in the depth image are traversed in a row-by-row manner starting from a row of the depth image with the maximum horizontal coordinate x and decreasing in the direction of x, and projected onto the left side projection image, and points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection views L respectivelytPixel value of a point in (1)
Figure BDA0003224425560000117
Ordinate value
Figure BDA0003224425560000118
Horizontal coordinate value
Figure BDA0003224425560000119
For the same coordinates projected onto the left side projection view
Figure BDA00032244255600001110
Selecting the abscissa value of the point with the minimum abscissa to calculate the pixel value of the projection graph at the coordinate, and expressing the pixel value as the following formula:
Figure BDA00032244255600001111
Figure BDA00032244255600001112
wherein f is3To be compared with the abscissa value xnMapping to [0,255]Linear function of interval, xn=minxi,xi∈XL,XLFor all ordinate values in the depth image to be
Figure BDA00032244255600001113
Depth value of
Figure BDA00032244255600001114
Set of abscissas of points of (1), minxi,xi∈XLA set of representations XLThe minimum value of the middle abscissa.
Vtop={Tt|t∈[1,N]In which O istD×CShowing the projection view of the t-th frame depth image projected from the top surface. When a plurality of points are projected to the same position of the top surface projection drawing, the point farthest away from the projection plane should be reserved; for this purpose, points in the depth image are traversed line by line starting from the line with the smallest ordinate y on the depth image in the direction in which y increases, projected onto the top projection image, and points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection of the point onto the projection graph OtThe abscissa value of the point
Figure BDA00032244255600001115
Pixel value
Figure BDA00032244255600001116
Ordinate value
Figure BDA00032244255600001117
For the same coordinates projected onto the projection view
Figure BDA0003224425560000121
Selecting the longitudinal coordinate value of the point with the maximum longitudinal coordinate as the pixel value of the projection graph at the coordinate, and expressing the pixel value as follows:
Figure BDA0003224425560000122
Figure BDA0003224425560000123
wherein f is4To be compared with the ordinate value yqMapping to [0,255]Linear function of interval, yq=maxyi,yi∈YOWherein Y isOAll the abscissa values in the depth image are
Figure BDA0003224425560000124
Depth value of
Figure BDA0003224425560000125
Set of ordinates of the points of (1), maxyi,yi∈YOSet of representations YOMaximum of the middle ordinate.
Acquiring a dynamic image in step 2:
front projection sequence V of depth video V with behavior samplesfront={Ft|t∈[1,N]For example, the motion image is calculated as follows:
first to FtVectorization, i.e. FtIs connected into a new row vector it
For row vector itEach element in (a) is used for calculating the arithmetic square root to obtain a new vector wtNamely:
Figure BDA0003224425560000126
wherein the content of the first and second substances,
Figure BDA0003224425560000127
representing a row vector itEach element in (a) takes the arithmetic square root; note wtFront projection sequence V of depth video V as behavior samplefrontThe frame vector of the t-th frame of (1);
computing depth of behavior samplesFront projection sequence V of degree video VfrontThe feature vector v of the t frame imagetThe calculation method is as follows:
Figure BDA0003224425560000128
wherein the content of the first and second substances,
Figure BDA0003224425560000129
front projection sequence V representing depth video V on a behavior samplefrontSumming frame vectors of the 1 st frame image to the t th frame image;
calculating a front projection sequence V of a depth video V of a behavior samplefrontT frame image FtScore B oftThe calculation formula is as follows:
Bt=uT·vt
where u is a vector with dimension a, and a ═ R × C. u. ofTRepresents transposing the vector u; u. ofT·vtRepresenting the vector obtained by transposing the calculation pair vector u and the feature vector vtDot product of (2);
calculating the value of u such that the sequence of front projections VfrontThe more the frame image arranged at the back, the higher the score, i.e. the larger t, the score BtThe higher; the calculation of u can use RankSVM calculation, and the calculation method is as follows:
Figure BDA0003224425560000131
Figure BDA0003224425560000132
wherein the content of the first and second substances,
Figure BDA0003224425560000133
denotes u that minimizes the value of E (u), λ is a constant, | u | | non-calculation2Representing the sum of the squares of each element in the calculation vector u; b isc、BjFront projection sequence V of depth video V representing behavior samples respectivelyfrontScore of image of frame c, score of image of frame j, max {0,1-Bc+BjMeans to choose 0 and 1-Bc+BjThe larger of the values;
after calculating the vector u using RankSVM, the vector u is aligned to FtThe image form with the same size is obtained to obtain u' epsilonR×CSequence V of front projections of a depth video V, called u' as a behavioral samplefrontThe moving picture of (2).
The dynamic images of the right side, left side, and top projection sequences of the depth video V of the behavior sample are calculated in the same manner as the dynamic images of the front projection sequence.
And (3) extracting the features of the feature extraction module:
as shown in fig. 2, the dynamic images of the front, right, left, and top projection sequences of the depth video of the behavior sample are respectively input to the feature extraction module to extract features. The feature extraction module comprises a convolution unit 1, a convolution unit 2, a convolution unit 3, a convolution unit 4, a convolution unit 5, a multi-feature fusion unit, an average pooling layer, a full connection layer 1 and a full connection layer 2.
Convolution unit 1 contains 2 convolution layers and 1 max pooling layer. Each convolution layer has 64 convolution kernels, each convolution kernel has a size of 3 × 3, the pooling kernel of the largest pooling layer has a size of 2 × 2, and the output of convolution unit 1 is C1
The convolution unit 2 contains 2 convolution layers each having 128 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and the input of the convolution unit 2 is C1The output is C2
The convolution unit 3 comprises 3 convolution layers and 1 maximum pooling layer, each convolution layer has 256 convolution kernels, each convolution kernel has a size of 3 × 3, the pooling kernel of the maximum pooling layer has a size of 2 × 2, and the input of the convolution unit 3 is C2The output is C3
Convolution unit 4 contains 3 convolution layers and 1 max pooling layer, each volume512 convolution kernels are stacked, the size of each convolution kernel is 3 x 3, the size of the pooling kernel of the maximum pooling layer is 2 x 2, and the input of the convolution unit 4 is C3The output is C4
The convolution unit 5 comprises 3 convolution layers each having 512 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and the input of the convolution unit 5 is C4The output is C5
The input of the multi-feature fusion unit is the output C of the convolution unit 11Output C of convolution unit 22Output C of convolution unit 33Output C of convolution unit 44Output C of convolution unit 55
Output C of convolution unit 11Inputting a maximum pooling layer 1 and a convolutional layer 1 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 1 is 4 multiplied by 4, the convolutional layer 1 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 1 is M1
Output C of convolution unit 22Inputting a maximum pooling layer 2 and a convolutional layer 2 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 2 is 2 x 2, the convolutional layer 2 has 512 convolutional kernels, the size of the convolutional kernel is 1 x 1, and the output of the convolutional layer 2 is M2
The convolution unit 3 inputs the convolution layer 3 in the multi-feature fusion unit, the convolution layer 3 has 512 convolution kernels, the size of the convolution kernels is 1 multiplied by 1, and the output of the convolution layer 3 is M3
Output C of convolution unit 44Inputting an upsampling layer 1 and a convolutional layer 4 in a multi-feature fusion unit, wherein the convolutional layer 4 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 4 is M4
Output C of convolution unit 55Inputting an upsampling layer 2 and a convolutional layer 5 in the multi-feature fusion unit, wherein the convolutional layer 5 has 512 convolutional kernels, the size of the convolutional kernel is 1 multiplied by 1, and the output of the convolutional layer 5 is M5
The output M of the convolution layer 11The output M of the convolutional layer 22And the output M of the convolution layer 33The output M of the convolutional layer 44The output M of the convolutional layer 55Connected by channels, input convolutional layer 6, convolutional layer 6 has 512 convolutional kernels, the size of convolutional kernel is 1 × 1, and the output of convolutional layer 6 is M6
The output of the multi-feature fusion unit is M as the output of the convolutional layer 66
Fusing the output M of the unit with multiple features6Inputting the average pooling layer, wherein the output of the average pooling layer is S, the output S of the average pooling layer is input into a full-link layer 1, and the number of neurons in the full-link layer 1 is D1Output Q of full connection layer 11The calculation method of (c) is as follows:
Q1=φrelu(W1·S+θ1),
wherein phi isreluIs a relu activation function, W1Is the weight of the fully connected layer 1, θ1Is the offset vector of fully connected layer 1;
output Q of full connection layer 11Inputting into the full connection layer 2, the number of the neurons of the full connection layer 2 is D2Output Q of full connection layer 22The calculation method of (c) is as follows:
Q2=φrelu(W2·Q12),
wherein, W2Is the weight of the fully connected layer 2, θ2Is the offset vector for fully connected layer 2. The output of the full connection layer 2 is the feature extracted by the feature extraction module;
dynamic images of front, right side, left side and top projection sequences of the depth video V of the behavior sample are respectively input into the feature extraction module, and features can be extracted
Figure BDA0003224425560000151
Connecting the features extracted in the step 3) in the step 4):
inputting dynamic images of four projection sequences of the depth video of each behavior sample into a feature extraction module to obtain features, and inputting a full connection layer 3 with an activation function of softmax; all connecting layers 3Output Q3The calculation method of (c) is as follows:
Figure BDA0003224425560000152
wherein phi issoftmaxDenotes the softmax activation function, W3Is the weight of the fully-connected layer 3,
Figure BDA0003224425560000153
express features
Figure BDA0003224425560000154
Connected as a vector, θ3Is the offset vector for fully connected layer 3.
Step 5) constructing a four-flow human behavior recognition network:
as shown in FIG. 3, the input of the network is the dynamic image of the front, right, left, and top projection sequence of the depth video of the behavior samples, and the output is the probability that the corresponding behavior sample belongs to each behavior class, i.e. the output Q of the full link layer 33(ii) a The loss function L of the network is:
Figure BDA0003224425560000155
wherein G is the total training sample number, K is the behavior sample class number,
Figure BDA0003224425560000161
is the net output of the g-th behavior sample, lgIs the expected output of the g-th behavior sample, where lgIs defined as:
Figure BDA0003224425560000162
wherein lgIs the tag value of the g-th sample.
And 6) calculating dynamic images of the front, right side, left side and top projection sequences of the depth video of each training behavior sample, inputting the dynamic images into a four-stream human behavior recognition network, and training the network until convergence.
And 7) calculating dynamic images of the front, right side, left side and top projection sequences of the depth video of each test behavior sample, inputting the dynamic images into a trained four-stream behavior recognition network, obtaining the probability values of all the behavior classes to which the current test behavior video sample is predicted, wherein the behavior class with the maximum probability value is the behavior class to which the current test behavior video sample is finally predicted, and therefore behavior recognition is achieved.
Example (b):
as shown in the figures 4-5 of the drawings,
1) the collective number of samples of behavior samples is 2400, and there are 8 behavior classes, and each behavior class has 300 samples. Randomly selecting two thirds of samples in each behavior category to be divided into a training set, and dividing the remaining one third of samples into a testing set to obtain 1600 training samples and 800 testing samples.
Each behavior sample consists of all frames in the sample depth video. Take the depth video V of any behavior sample as an example:
V={It|t∈[1,50]},
where t represents the time index, the behavior sample has a total of 50 frames. I ist240×240The line sample is a matrix representation of the t frame depth image of the depth video V, and the number of rows and columns of the frame depth image is 240. The representation matrix is a real matrix. I ist(xi,yi)=diThe coordinate on the depth image of the t-th frame is (x)i,yi) Point p ofiOf depth, i.e. point piDistance from the depth camera.
And respectively projecting the depth video V of the behavior sample to four planes, namely a front surface, a right side surface, a left side surface and a top surface. At this time, the depth video V of the behavior sample can be expressed as a set of four projection graph sequences, which are formulated as follows:
V={Vfront,Vright,Vleft,Vtop},
wherein, VfrontProjection sequence obtained by front projection of a depth video V representing a behavior sample, VrightA projection sequence obtained by right side projection of a depth video V representing a behavior sample, VleftA projection sequence obtained by left side projection of a depth video V representing a behavior sample, VtopA sequence of projections of the depth video V representing the behavior sample onto the top surface.
Vfront={Ft|t∈[1,50]In which Ft240×240The projection diagram is obtained by front projection of the tth frame depth image of the depth video V of the behavior sample. Point p in depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection of the point onto the projection FtThe abscissa value of the point
Figure BDA0003224425560000171
Ordinate value
Figure BDA0003224425560000172
Pixel value
Figure BDA0003224425560000173
Can be formulated as:
Figure BDA0003224425560000174
Figure BDA0003224425560000175
wherein f is1To be depth value diMapping to [0,255]The linear function of the interval is such that points with smaller depth values have larger pixel values on the projection view, i.e. points closer to the depth camera are brighter on the front projection view.
Vright={Rt|t∈[1,50]In which R ist240×240A projection view obtained by right side projection of the t-th frame depth image is shown.When right side projecting the depth image, there may be more than one point projected to the same location on the projection map. And viewing the behavior from the right side, the closest point to the viewer, i.e. the point furthest from the projection plane, can be seen. Therefore, the abscissa value of the point farthest from the projection plane on the depth image should be retained, and the pixel value of the point at the position of the projection view should be calculated using the abscissa value. For this purpose, points in the depth image are traversed column by column starting from a column on the depth image with the smallest horizontal coordinate x and increasing in the direction x, projected onto the projection map, and points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diSeparately determine the projection view RtPixel value of a point in (1)
Figure BDA0003224425560000176
Ordinate value
Figure BDA0003224425560000177
Horizontal coordinate value
Figure BDA0003224425560000178
Is formulated as:
Figure BDA0003224425560000179
Figure BDA00032244255600001710
wherein f is2To be compared with the abscissa value xiMapping to [0,255]A linear function of the interval. When x is increased continuously, new points may be projected to the same position of the projection graph as the points which have been projected before, and the latest points should be kept, i.e. the pixel value of the position of the projection graph is calculated by using the abscissa value of the point with the largest abscissa value
Figure BDA00032244255600001711
Wherein xm=maxxi,xi∈XR,XRFor all ordinate values in the depth image to be
Figure BDA00032244255600001712
Depth value of
Figure BDA00032244255600001713
Of points of (a), maxxi,xi∈XRA set of representations XRMaximum of the abscissa in (1).
Vleft={Lt|t∈[1,50]In which L ist240×240The projection diagram is obtained by projecting the left side surface of the t-th frame depth image. Similar to obtaining the right side projection view, when there are multiple points projected to the same location of the left side projection view, the point farthest from the projection plane should be kept. For this purpose, points in the depth image are traversed in a row-by-row manner starting from a row of the depth image with the maximum horizontal coordinate x and decreasing in the direction of x, and projected onto the left side projection image, and points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection views L respectivelytPixel value of a point in (1)
Figure BDA0003224425560000181
Ordinate value
Figure BDA0003224425560000182
Horizontal coordinate value
Figure BDA0003224425560000183
For the same coordinates projected onto the left side projection view
Figure BDA0003224425560000184
Selecting the abscissa value of the point with the minimum abscissa to calculate the pixel value of the projection graph at the coordinate, and expressing the pixel value as the following formula:
Figure BDA0003224425560000185
Figure BDA0003224425560000186
wherein f is3To be compared with the abscissa value xnMapping to [0,255]Linear function of interval, xn=minxi,xi∈XL,XLFor all ordinate values in the depth image to be
Figure BDA0003224425560000187
Depth value of
Figure BDA0003224425560000188
Set of abscissas of points of (1), minxi,xi∈XLA set of representations XLThe minimum value of the middle abscissa.
Vtop={Tt|t∈[1,50]In which O ist240×240Showing the projection view of the t-th frame depth image projected from the top surface. When there are multiple points projected onto the same location of the top surface projection, the point farthest from the projection plane should be retained. For this purpose, points in the depth image are traversed line by line starting from the line with the smallest ordinate y on the depth image in the direction in which y increases, projected onto the top projection image, and points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection of the point onto the projection graph OtThe abscissa value of the point
Figure BDA0003224425560000189
Pixel value
Figure BDA00032244255600001810
Ordinate value
Figure BDA00032244255600001811
For the same coordinates projected onto the projection view
Figure BDA00032244255600001812
At the point of the (c) or (d),selecting the ordinate value of the point with the maximum ordinate as the pixel value of the projection graph at the coordinate, and expressing the ordinate value as the following formula:
Figure BDA00032244255600001813
Figure BDA00032244255600001814
wherein f is4To be compared with the ordinate value yqMapping to [0,255]Linear function of interval, yq=maxyi,yi∈YOWherein Y isOAll the abscissa values in the depth image are
Figure BDA00032244255600001815
Depth value of
Figure BDA00032244255600001816
Set of ordinates of the points of (1), maxyi,yi∈YOSet of representations YOMaximum of the middle ordinate.
2) And calculating the dynamic images of the 4 projection sequences of the depth video of each behavior sample to obtain 4 dynamic images of each behavior sample. Front projection sequence V of depth video V with behavior samplesfront={Ft|t∈[1,50]For example, the motion image is calculated as follows:
first to FtVectorization, i.e. FtIs connected into a new row vector it
For row vector itEach element in (a) is used for calculating the arithmetic square root to obtain a new vector wtNamely:
Figure BDA0003224425560000191
wherein the content of the first and second substances,
Figure BDA0003224425560000192
representing a row vector itEach element in (a) takes the arithmetic square root. Note wtFront projection sequence V of depth video V as behavior samplefrontThe frame vector of the t-th frame.
Calculating a front projection sequence V of a depth video V of a behavior samplefrontThe feature vector v of the t frame imagetThe calculation method is as follows:
Figure BDA0003224425560000193
wherein the content of the first and second substances,
Figure BDA0003224425560000194
front projection sequence V representing depth video V on a behavior samplefrontSumming frame vectors of the 1 st frame image to the t th frame image;
calculating a front projection sequence V of a depth video V of a behavior samplefrontT frame image FtScore B oftThe calculation formula is as follows:
Bt=uT·vt
where u is a vector with dimension 57600. u. ofTRepresents transposing the vector u; u. ofT·vtRepresenting the vector obtained by transposing the calculation pair vector u and the feature vector vtDot product of (2);
calculating the value of u such that the sequence of front projections VfrontThe more the frame image arranged at the back, the higher the score, i.e. the larger t, the score BtThe higher; the calculation of u can use RankSVM calculation, and the calculation method is as follows:
Figure BDA0003224425560000195
Figure BDA0003224425560000196
wherein the content of the first and second substances,
Figure BDA0003224425560000197
denotes u that minimizes the value of E (u), λ is a constant, | u | | non-calculation2Representing the sum of the squares of each element in the calculation vector u; b isc、BjFront projection sequence V of depth video V representing behavior samples respectivelyfrontScore of image of frame c, score of image of frame j, max {0,1-Bc+BjMeans to choose 0 and 1-Bc+BjThe larger of the values;
after calculating the vector u using RankSVM, the vector u is aligned to FtThe image form with the same size is obtained to obtain u' epsilon240×240Sequence V of front projections of a depth video V, called u' as a behavioral samplefrontThe moving picture of (2). Fig. 4 is a front projection dynamic image of the hand waving behavior.
The motion images of the right, left, and top projection sequences of the depth video V of the behavior sample are calculated in the same manner as the motion images of the front projection sequence.
3) And respectively inputting the dynamic images of the front, right side, left side and top projection sequences of the depth video of the behavior sample into a feature extraction module to extract features. The feature extraction module comprises a convolution unit 1, a convolution unit 2, a convolution unit 3, a convolution unit 4, a convolution unit 5, a multi-feature fusion unit, an average pooling layer, a full connection layer 1 and a full connection layer 2.
Convolution unit 1 contains 2 convolution layers and 1 max pooling layer. There are 64 convolution kernels per convolution layer, each convolution kernel being 3 x 3 in size, and the pooling kernel for the largest pooling layer being 2 x 2 in size. The output of convolution unit 1 is C1
Convolution unit 2 contains 2 convolution layers and 1 max pooling layer. There are 128 convolution kernels per convolution layer, each convolution kernel being 3 x 3 in size, and the pooling kernel for the largest pooling layer being 2 x 2 in size. The input of the convolution unit 2 is C1The output is C2
The convolution unit 3 contains 3 convolution layers and 1 max pooling layer. Each convolution layer has 256 convolution kernels eachThe size of each convolution kernel is 3 × 3, and the size of the pooling kernel of the maximum pooling layer is 2 × 2. The input of the convolution unit 3 is C2The output is C3
The convolution unit 4 contains 3 convolution layers and 1 max pooling layer. There are 512 convolution kernels per convolution layer, each convolution kernel being 3 x 3 in size, and the pooling kernel of the maximum pooling layer being 2 x 2 in size. The input of the convolution unit 4 is C3The output is C4
The convolution unit 5 contains 3 convolution layers and 1 max pooling layer. There are 512 convolution kernels per convolution layer, each convolution kernel being 3 x 3 in size, and the pooling kernel of the maximum pooling layer being 2 x 2 in size. The input of the convolution unit 5 is C4The output is C5
The input of the multi-feature fusion unit is the output C of the convolution unit 11Output C of convolution unit 22Output C of convolution unit 33Output C of convolution unit 44Output C of convolution unit 55
Output C of convolution unit 11Inputting a maximum pooling layer 1 and a convolutional layer 1 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 1 is 4 multiplied by 4, the convolutional layer 1 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 1 is M1
Output C of convolution unit 22Inputting a maximum pooling layer 2 and a convolutional layer 2 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 2 is 2 x 2, the convolutional layer 2 has 512 convolutional kernels, the size of the convolutional kernel is 1 x 1, and the output of the convolutional layer 2 is M2
The convolution unit 3 inputs the convolution layer 3 in the multi-feature fusion unit, the convolution layer 3 has 512 convolution kernels, the size of the convolution kernels is 1 multiplied by 1, and the output of the convolution layer 3 is M3
Output C of convolution unit 44Inputting an upsampling layer 1 and a convolutional layer 4 in a multi-feature fusion unit, wherein the convolutional layer 4 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 4 is M4
Output C of convolution unit 55Upsampling layer in input multi-feature fusion unit2 and convolutional layer 5, the convolutional layer 5 has 512 convolutional kernels, the size of the convolutional kernel is 1 × 1, and the output of the convolutional layer 5 is M5
The output M of the convolution layer 11The output M of the convolutional layer 22And the output M of the convolution layer 33The output M of the convolutional layer 44The output M of the convolutional layer 55Connected by channels, input convolutional layer 6, convolutional layer 6 has 512 convolutional kernels, the size of convolutional kernel is 1 × 1, and the output of convolutional layer 6 is M6
The output of the multi-feature fusion unit is M as the output of the convolutional layer 66
Fusing the output M of the unit with multiple features6Inputting the average pooling layer, wherein the output of the average pooling layer is S, the output S of the average pooling layer is input into a full connection layer 1, the number of neurons of the full connection layer 1 is 4096, and the output Q of the full connection layer 11The calculation method of (c) is as follows:
Q1=φrelu(W1·S+θ1),
wherein phi isreluIs a relu activation function, W1Is the weight of the fully connected layer 1, θ1Is the offset vector for fully connected layer 1.
Output Q of full connection layer 11Inputting full connection layer 2, the number of neurons of full connection layer 2 is 1000, and the output Q of full connection layer 22The calculation method of (c) is as follows:
Q2=φrelu(W2·Q12),
wherein, W2Is the weight of the fully connected layer 2, θ2Is the offset vector for fully connected layer 2. The output of the full connection layer 2 is the feature extracted by the feature extraction module.
Dynamic images of front, right side, left side and top projection sequences of the depth video V of the behavior sample are respectively input into the feature extraction module, and features can be extracted
Figure BDA0003224425560000221
4) Projecting four of the depth video for each behavior sampleAnd inputting the dynamic images of the shadow sequence into the feature connection obtained by the feature extraction module, and inputting the full connection layer 3 with the activation function of softmax. Output Q of full connection layer 33The calculation method of (c) is as follows:
Figure BDA0003224425560000222
wherein phi issoftmaxDenotes the softmax activation function, W3Is the weight of the fully-connected layer 3,
Figure BDA0003224425560000223
express features
Figure BDA0003224425560000224
Connected as a vector, θ3Is the offset vector for fully connected layer 3.
5) Constructing a four-stream human behavior recognition network, wherein the input of the network is dynamic images of projection sequences of the front surface, the right side surface, the left side surface and the top surface of the depth video of the behavior samples, and the output is the probability that the corresponding behavior samples belong to each behavior class, namely the output Q of the full connection layer 33. The loss function L of the network is:
Figure BDA0003224425560000225
wherein the content of the first and second substances,
Figure BDA0003224425560000226
is the net output of the g-th behavior sample, lgIs the expected output of the g-th behavior sample, where lgIs defined as:
Figure BDA0003224425560000227
wherein lgIs the tag value of the g-th sample.
6) And calculating dynamic images of the front, right side, left side and top projection sequences of the depth video of each training behavior sample, inputting the dynamic images into a four-stream human behavior recognition network, and training the network until convergence.
7) And calculating dynamic images of front, right side, left side and top projection sequences of the depth video of each test behavior sample, inputting the dynamic images into a trained four-stream human behavior recognition network, obtaining probability values which are predicted by the current test behavior video sample and belong to various behavior classes, and obtaining the behavior class with the maximum probability value as the behavior class of the finally predicted current test behavior video sample, thereby realizing behavior recognition.
relu activates the function, whose formula is f (x) ═ max (0, x), with the input of the function being x and the output being the larger of x and 0.
Softmax activation function of the formula
Figure BDA0003224425560000231
Wherein i represents the output of the ith neuron of the full connection layer, j represents the output of the jth neuron of the full connection layer, n is the number of the neurons of the full connection layer, and SiRepresents the output of the ith neuron of the full junction layer through the softmax activation function.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (9)

1. A depth video behavior recognition method is characterized by comprising the following steps:
1) carrying out front, right side, left side and top projection on the depth video of each behavior sample to obtain a corresponding projection sequence;
2) obtaining a dynamic image of each behavior sample by calculating a dynamic image of each projection sequence;
3) inputting the dynamic image of each behavior sample into a feature extraction module and extracting features;
4) connecting the features extracted from the dynamic images of each behavior sample, and inputting the connected features into a full connection layer;
5) constructing a four-flow human behavior recognition network;
6) calculating dynamic images of front, right side, left side and top projection sequences of the depth video of each training behavior sample, inputting the dynamic images into a four-stream human behavior recognition network, and training the four-stream human behavior recognition network until convergence;
7) and calculating each dynamic image of the behavior sample to be tested, and inputting each calculated dynamic image into the trained four-flow human behavior recognition network to realize behavior recognition.
2. The method for identifying deep video behaviors as claimed in claim 1, wherein the projection sequence in step 1) is obtained by:
each behavior sample is composed of all frames in the depth video of the sample, the depth video of any behavior sample is obtained,
V={It|t∈[1,N]},
wherein t represents a time index, and N is the total frame number of the depth video V of the behavior sample; i istR×CR, C respectively corresponding to the row number and the column number of the matrix representation of the t frame depth image of the depth video V of the behavior sample, wherein the representation matrix is a real matrix; i ist(xi,yi)=diThe coordinate on the depth image of the t-th frame is (x)i,yi) Point p ofiOf depth, i.e. point piDistance from depth camera, di∈[0,D]D represents the farthest distance that the depth camera can detect;
the depth video V of a behavior sample may be represented as a set of projection sequences, formulated as follows:
V={Vfront,Vright,Vleft,Vtop},
wherein, VfrontProjection sequence obtained by front projection of a depth video V representing a behavior sample, VrightA projection sequence obtained by right side projection of a depth video V representing a behavior sample, VleftA projection sequence obtained by left side projection of a depth video V representing a behavior sample, VtopAnd the projection sequence obtained by top surface projection of the depth video V representing the behavior sample.
3. The method according to claim 1, wherein the dynamic image in step 2) is calculated in a manner that:
front projection sequence V of depth video V with behavior samplesfront={Ft|t∈[1,N]Take the example, first to FtVectorization, i.e. FtIs connected into a new row vector it
For row vector itEach element in (a) is used for calculating the arithmetic square root to obtain a new vector wtNamely:
Figure FDA0003224425550000021
wherein the content of the first and second substances,
Figure FDA0003224425550000022
representing a row vector itEach element in (1) is used to calculate the arithmetic square root, and w is calculatedtAs depth of a behavioral sampleFront projection sequence V of video VfrontThe frame vector of the t-th frame of (1);
calculating a front projection sequence V of a depth video V of a behavior samplefrontThe feature vector v of the t frame imagetThe calculation method is as follows:
Figure FDA0003224425550000023
wherein the content of the first and second substances,
Figure FDA0003224425550000024
front projection sequence V representing depth video V on a behavior samplefrontSumming frame vectors of the 1 st frame image to the t th frame image;
calculating a front projection sequence V of a depth video V of a behavior samplefrontT frame image FtScore B oftThe calculation formula is as follows:
Bt=uT·vt
wherein u is a vector with dimension a, a ═ R × C; u. ofTRepresents transposing the vector u; u. ofT·vtRepresenting the vector obtained by transposing the calculation pair vector u and the feature vector vtDot product of (2);
calculating the value of u such that the sequence of front projections VfrontThe middle frame images are sorted from front to back, the scores are increased progressively, namely the score B is increased when t is largertThe higher; the calculation of u can use RankSVM calculation, and the calculation method is as follows:
Figure FDA0003224425550000025
Figure FDA0003224425550000031
wherein the content of the first and second substances,
Figure FDA0003224425550000032
denotes u that minimizes the value of E (u), λ is a constant, | u | | non-calculation2Representing the sum of the squares of each element in the calculation vector u; b isc、BjFront projection sequence V of depth video V representing behavior samples respectivelyfrontScore of image of frame c, score of image of frame j, max {0,1-Bc+BjMeans to choose 0 and 1-Bc+BjThe larger of the values;
after calculating the vector u using RankSVM, the vector u is aligned to FtThe image form with the same size is obtained to obtain u' epsilonR×CU' is the front projection sequence V of the depth video V of the behavior samplefrontThe moving picture of (2).
4. The method for identifying the deep video behaviors according to claim 1, wherein the feature extraction module comprises a convolution unit 1, a convolution unit 2, a convolution unit 3, a convolution unit 4, a convolution unit 5, a multi-feature fusion unit, an average pooling layer, a full connection layer 1 and a full connection layer 2; firstly, the outputs of the convolution unit 1, the convolution unit 2, the convolution unit 3, the convolution unit 4 and the convolution unit 5 are input to the multi-feature fusion unit in sequence, and then the output M of the multi-feature fusion unit is output6Inputting the output S of the average pooling layer into a full-link layer 1, wherein the number of neurons in the full-link layer 1 is D1Output Q of full connection layer 11The calculation method of (c) is as follows:
Q1=φrelu(W1·S+θ1),
wherein phi isreluIs a relu activation function, W1Is the weight of the fully connected layer 1, θ1Is the offset vector of fully connected layer 1;
output Q of full connection layer 11Inputting into the full connection layer 2, the number of the neurons of the full connection layer 2 is D2Output Q of full connection layer 22The calculation method of (c) is as follows:
Q2=φrelu(W2·Q12),
wherein, W2Is the weight of the fully connected layer 2, θ2The offset vector of the full connection layer 2, and the output of the full connection layer 2 is the feature extracted by the feature extraction module;
respectively inputting the dynamic images of the front, right side, left side and top projection sequences of the depth video V of the behavior sample into a feature extraction module to extract features
Figure FDA0003224425550000033
5. The method according to claim 1, wherein the features in step 4) are connected in such a way that the extracted features are connected
Figure FDA0003224425550000041
Fully-connected layer 3 connected as a vector with input activation function softmax, output Q of fully-connected layer 33The calculation method of (c) is as follows:
Figure FDA0003224425550000042
wherein phi issoftmaxDenotes the softmax activation function, W3Is the weight of the fully-connected layer 3,
Figure FDA0003224425550000043
express features
Figure FDA0003224425550000044
Connected as a vector, θ3Is the offset vector for fully connected layer 3.
6. The deep video behavior recognition method according to claim 1, wherein the step 5) constructs a four-stream human behavior recognition network as follows: the input of the network is a dynamic image of a projection sequence of a depth video of a behavior sample, and the output is Q3The loss function L of the network is,
Figure FDA0003224425550000045
wherein G is the total training sample number, K is the behavior sample class number,
Figure FDA0003224425550000046
is the net output of the g-th behavior sample, lgIs the expected output of the g-th behavior sample, where lgIs defined as:
Figure FDA0003224425550000047
wherein lgIs the tag value of the g-th sample.
7. The method for identifying behaviors of depth video according to claim 1, wherein the behaviors in step 7) are identified as follows: and calculating dynamic images of the front, right side, left side and top projection sequences of the depth video of the behavior sample to be tested, inputting the dynamic images into a trained four-stream behavior recognition network, and obtaining a probability value for predicting the behavior class to which the current testing behavior video sample belongs, wherein the behavior class with the maximum probability value is the behavior class to which the current testing behavior video sample finally predicts.
8. The method according to claim 2, wherein V is the video-depth behavior recognition methodfrontProjection sequence acquisition mode:
Vfront={Ft|t∈[1,N]in which FtR×CA projection diagram obtained by front projection of the tth frame depth image of the depth video V of the motion sample is shown; point p in depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection of the point toProjection FtThe abscissa value of the point
Figure FDA0003224425550000048
Ordinate value
Figure FDA0003224425550000049
Pixel value
Figure FDA00032244255500000410
Can be formulated as:
Figure FDA00032244255500000411
Figure FDA0003224425550000051
wherein f is1To be depth value diMapping to [0,255]A linear function of the interval such that a point with a smaller depth value has a larger pixel value on the projection image, i.e., a point closer to the depth camera has a brighter value on the front projection image;
Vrightprojection sequence acquisition mode:
Vright={Rt|t∈[1,N]in which R istR×DA projection diagram obtained by projecting the right side surface of the t frame depth image is shown; when right side projection is performed on the depth image, at least one point is projected to the same position on the projection graph; and observing behaviors from the right side, the closest point to the observer, namely the farthest point from the projection plane can be seen; keeping the abscissa value of the point farthest from the projection plane on the depth image, and calculating the pixel value of the point at the position of the projection image according to the abscissa value; traversing the points in the depth image in a row-by-row mode from the row with the minimum horizontal coordinate x in the depth image to the direction increasing to x, projecting the points on the projection map, and obtaining the points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diSeparately determine the projection view RtPixel value of a point in (1)
Figure FDA0003224425550000052
Ordinate value
Figure FDA0003224425550000053
Horizontal coordinate value
Figure FDA0003224425550000054
Is formulated as:
Figure FDA0003224425550000055
Figure FDA0003224425550000056
wherein f is2To be compared with the abscissa value xiMapping to [0,255]A linear function of the interval; when x is continuously increased, if a new point and a previously projected point are projected to the same position of the projection graph, the latest point is reserved, namely the pixel value of the position of the projection graph is calculated by using the abscissa value of the point with the largest abscissa value, namely the pixel value of the position of the projection graph is calculated
Figure FDA0003224425550000057
Wherein xm=max xi,xi∈XR,XRFor all ordinate values in the depth image to be
Figure FDA0003224425550000058
Depth value of
Figure FDA0003224425550000059
Set of abscissas of points of (1), max xi,xi∈XRA set of representations XRMaximum of the abscissa in (1);
Vleftsequence of projectionsThe method comprises the following steps:
Vleft={Lt|t∈[1,N]in which L istR×DA projection diagram obtained by projecting the left side surface of the t-th frame depth image is shown; when a plurality of points are projected to the same position of the left side surface projection drawing, keeping the point farthest from the projection plane; traversing the points in the depth image from one row with the maximum horizontal coordinate x to the direction of decreasing x row by row, projecting the points on the left side projection image, and the points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection views L respectivelytPixel value of a point in (1)
Figure FDA00032244255500000510
Ordinate value
Figure FDA00032244255500000511
Horizontal coordinate value
Figure FDA00032244255500000512
For the same coordinates projected onto the left side projection view
Figure FDA00032244255500000513
Selecting the abscissa value of the point with the minimum abscissa to calculate the pixel value of the projection graph at the coordinate, and expressing the pixel value as the following formula:
Figure FDA0003224425550000061
Figure FDA0003224425550000062
wherein f is3To be compared with the abscissa value xnMapping to [0,255]Linear function of interval, xn=min xi,xi∈XL,XLFor all ordinate values in the depth image to be
Figure FDA0003224425550000063
Depth value of
Figure FDA0003224425550000064
Set of abscissas of points (c), min xi,xi∈XLA set of representations XLThe minimum value of the middle abscissa;
Vtopprojection sequence acquisition mode:
Vtop={Tt|t∈[1,N]in which O istD×CA projection diagram obtained by projecting the depth image of the t frame from the top surface is shown; when the plurality of points are projected to the same position of the top surface projection drawing, the point farthest away from the projection plane is reserved; traversing the points in the depth image line by line from the line with the minimum ordinate y on the depth image to the increasing direction of y, projecting the points on the top surface projection image, and the points p in the depth imageiX of the abscissaiY longitudinal coordinate valueiDepth value diDetermine the projection of the point onto the projection graph OtThe abscissa value of the point
Figure FDA0003224425550000065
Pixel value
Figure FDA0003224425550000066
Ordinate value
Figure FDA0003224425550000067
For the same coordinates projected onto the projection view
Figure FDA0003224425550000068
Selecting the longitudinal coordinate value of the point with the maximum longitudinal coordinate as the pixel value of the projection graph at the coordinate, and expressing the pixel value as follows:
Figure FDA0003224425550000069
Figure FDA00032244255500000610
wherein f is4To be compared with the ordinate value yqMapping to [0,255]Linear function of interval, yq=max yi,yi∈YOWherein Y isOAll the abscissa values in the depth image are
Figure FDA00032244255500000611
Depth value of
Figure FDA00032244255500000612
Set of ordinates of the points of (1), max yi,yi∈YOSet of representations YOMaximum of the middle ordinate.
9. The method according to claim 4, wherein the convolution unit 1 comprises 2 convolution layers and 1 maximum pooling layer; each convolution layer has 64 convolution kernels, each convolution kernel has a size of 3 × 3, the pooling kernel of the largest pooling layer has a size of 2 × 2, and the output of convolution unit 1 is C1
The convolution unit 2 contains 2 convolution layers each having 128 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and the input of the convolution unit 2 is C1The output is C2
The convolution unit 3 comprises 3 convolution layers and 1 maximum pooling layer, each convolution layer has 256 convolution kernels, each convolution kernel has a size of 3 × 3, the pooling kernel of the maximum pooling layer has a size of 2 × 2, and the input of the convolution unit 3 is C2The output is C3
The convolution unit 4 includes 3 convolution layers each having 512 convolution kernels, each having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, the convolution kernelsThe input of the cell 4 is C3The output is C4
The convolution unit 5 comprises 3 convolution layers each having 512 convolution kernels, each convolution kernel having a size of 3 × 3, and 1 maximum pooling layer having a size of 2 × 2, and the input of the convolution unit 5 is C4The output is C5
The input of the multi-feature fusion unit is the output C of the convolution unit 11Output C of convolution unit 22Output C of convolution unit 33Output C of convolution unit 44Output C of convolution unit 55
Output C of convolution unit 11Inputting a maximum pooling layer 1 and a convolutional layer 1 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 1 is 4 multiplied by 4, the convolutional layer 1 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 1 is M1
Output C of convolution unit 22Inputting a maximum pooling layer 2 and a convolutional layer 2 in the multi-feature fusion unit, wherein the size of a pooling kernel of the maximum pooling layer 2 is 2 x 2, the convolutional layer 2 has 512 convolutional kernels, the size of the convolutional kernel is 1 x 1, and the output of the convolutional layer 2 is M2
The convolution unit 3 inputs the convolution layer 3 in the multi-feature fusion unit, the convolution layer 3 has 512 convolution kernels, the size of the convolution kernels is 1 multiplied by 1, and the output of the convolution layer 3 is M3
Output C of convolution unit 44Inputting an upsampling layer 1 and a convolutional layer 4 in a multi-feature fusion unit, wherein the convolutional layer 4 has 512 convolutional kernels, the size of the convolutional kernels is 1 multiplied by 1, and the output of the convolutional layer 4 is M4
Output C of convolution unit 55Inputting an upsampling layer 2 and a convolutional layer 5 in the multi-feature fusion unit, wherein the convolutional layer 5 has 512 convolutional kernels, the size of the convolutional kernel is 1 multiplied by 1, and the output of the convolutional layer 5 is M5
The output M of the convolution layer 11The output M of the convolutional layer 22And the output M of the convolution layer 33The output M of the convolutional layer 44The output M of the convolutional layer 55Connected by channels, fed into the convolutional layer 6, convolutional layer 6There are 512 convolution kernels, the size of the convolution kernel is 1 × 1, and the output of convolution layer 6 is M6
The output of the multi-feature fusion unit is M as the output of the convolutional layer 66
CN202110967362.8A 2021-08-23 2021-08-23 Depth video behavior recognition method Active CN113591797B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110967362.8A CN113591797B (en) 2021-08-23 2021-08-23 Depth video behavior recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110967362.8A CN113591797B (en) 2021-08-23 2021-08-23 Depth video behavior recognition method

Publications (2)

Publication Number Publication Date
CN113591797A true CN113591797A (en) 2021-11-02
CN113591797B CN113591797B (en) 2023-07-28

Family

ID=78238846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110967362.8A Active CN113591797B (en) 2021-08-23 2021-08-23 Depth video behavior recognition method

Country Status (1)

Country Link
CN (1) CN113591797B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023024658A1 (en) * 2021-08-23 2023-03-02 苏州大学 Deep video linkage feature-based behavior recognition method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740833A (en) * 2016-02-03 2016-07-06 北京工业大学 Human body behavior identification method based on depth sequence
CN107066979A (en) * 2017-04-18 2017-08-18 重庆邮电大学 A kind of human motion recognition method based on depth information and various dimensions convolutional neural networks
CN108280421A (en) * 2018-01-22 2018-07-13 湘潭大学 Human bodys' response method based on multiple features Depth Motion figure
CN108537196A (en) * 2018-04-17 2018-09-14 中国民航大学 Human bodys' response method based on the time-space distribution graph that motion history point cloud generates
CN108805093A (en) * 2018-06-19 2018-11-13 华南理工大学 Escalator passenger based on deep learning falls down detection algorithm
CN109460734A (en) * 2018-11-08 2019-03-12 山东大学 The video behavior recognition methods and system shown based on level dynamic depth projection difference image table
CN110084211A (en) * 2019-04-30 2019-08-02 苏州大学 A kind of action identification method
CN113221694A (en) * 2021-04-29 2021-08-06 苏州大学 Action recognition method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740833A (en) * 2016-02-03 2016-07-06 北京工业大学 Human body behavior identification method based on depth sequence
CN107066979A (en) * 2017-04-18 2017-08-18 重庆邮电大学 A kind of human motion recognition method based on depth information and various dimensions convolutional neural networks
CN108280421A (en) * 2018-01-22 2018-07-13 湘潭大学 Human bodys' response method based on multiple features Depth Motion figure
CN108537196A (en) * 2018-04-17 2018-09-14 中国民航大学 Human bodys' response method based on the time-space distribution graph that motion history point cloud generates
CN108805093A (en) * 2018-06-19 2018-11-13 华南理工大学 Escalator passenger based on deep learning falls down detection algorithm
CN109460734A (en) * 2018-11-08 2019-03-12 山东大学 The video behavior recognition methods and system shown based on level dynamic depth projection difference image table
CN110084211A (en) * 2019-04-30 2019-08-02 苏州大学 A kind of action identification method
CN113221694A (en) * 2021-04-29 2021-08-06 苏州大学 Action recognition method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIAOFENG ZHAO ET AL.: "Discriminative Pose Analysis for Human Action Recognition", 《2020 IEEE 6TH WORLD FORUM ON INTERNET OF THINGS (WF-IOT)》, pages 1 - 6 *
刘婷婷: "基于深度数据的人体行为识别", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023024658A1 (en) * 2021-08-23 2023-03-02 苏州大学 Deep video linkage feature-based behavior recognition method

Also Published As

Publication number Publication date
CN113591797B (en) 2023-07-28

Similar Documents

Publication Publication Date Title
CN111259850B (en) Pedestrian re-identification method integrating random batch mask and multi-scale representation learning
CN107341452B (en) Human behavior identification method based on quaternion space-time convolution neural network
CN108520535B (en) Object classification method based on depth recovery information
CN109543602B (en) Pedestrian re-identification method based on multi-view image feature decomposition
CN112446476A (en) Neural network model compression method, device, storage medium and chip
WO2019227479A1 (en) Method and apparatus for generating face rotation image
CN113610046B (en) Behavior recognition method based on depth video linkage characteristics
CN109740539B (en) 3D object identification method based on ultralimit learning machine and fusion convolution network
US20240046700A1 (en) Action recognition method
CN111783748A (en) Face recognition method and device, electronic equipment and storage medium
CN108596256B (en) Object recognition classifier construction method based on RGB-D
CN111476806A (en) Image processing method, image processing device, computer equipment and storage medium
CN113011253B (en) Facial expression recognition method, device, equipment and storage medium based on ResNeXt network
CN113128424A (en) Attention mechanism-based graph convolution neural network action identification method
CN112580458A (en) Facial expression recognition method, device, equipment and storage medium
CN111488951B (en) Method for generating countermeasure metric learning model for RGB-D image classification
CN109886281A (en) One kind is transfinited learning machine color image recognition method based on quaternary number
Wang et al. Bikers are like tobacco shops, formal dressers are like suits: Recognizing urban tribes with caffe
CN112800979B (en) Dynamic expression recognition method and system based on characterization flow embedded network
CN114612709A (en) Multi-scale target detection method guided by image pyramid characteristics
CN114882537A (en) Finger new visual angle image generation method based on nerve radiation field
CN113591797A (en) Deep video behavior identification method
CN109886160A (en) It is a kind of it is non-limiting under the conditions of face identification method
CN117037244A (en) Face security detection method, device, computer equipment and storage medium
CN112560824B (en) Facial expression recognition method based on multi-feature adaptive fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant