CN108647599B - Human behavior recognition method combining 3D (three-dimensional) jump layer connection and recurrent neural network - Google Patents

Human behavior recognition method combining 3D (three-dimensional) jump layer connection and recurrent neural network Download PDF

Info

Publication number
CN108647599B
CN108647599B CN201810394571.6A CN201810394571A CN108647599B CN 108647599 B CN108647599 B CN 108647599B CN 201810394571 A CN201810394571 A CN 201810394571A CN 108647599 B CN108647599 B CN 108647599B
Authority
CN
China
Prior art keywords
neural network
video
dimensional
frames
recurrent neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810394571.6A
Other languages
Chinese (zh)
Other versions
CN108647599A (en
Inventor
宋佳蓉
杨忠
胡国雄
韩佳明
徐浩
陈聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN201810394571.6A priority Critical patent/CN108647599B/en
Publication of CN108647599A publication Critical patent/CN108647599A/en
Application granted granted Critical
Publication of CN108647599B publication Critical patent/CN108647599B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification

Abstract

The invention discloses a human behavior recognition method combining 3D (three-dimensional) jump layer connection and a recurrent neural network, which comprises the following steps of: step 1, dividing each video into N parts, extracting L frames of pictures from each part, wherein N, L are all natural numbers; step 2, performing space-time feature extraction on the video by using the trained 3D convolutional neural network, and performing series connection on the space-time features of different layers to obtain a high-dimensional feature vector; step 3, carrying out normalization processing on the high-dimensional feature vector obtained in the step 2; step 4, sending the high-dimensional feature vector subjected to the normalization processing in the step 3 into a recurrent neural network for feature fusion; and 5, classifying the features fused in the step 4 to obtain action categories corresponding to the videos. Compared with an artificial motion characteristic design method, the method has better robustness and can effectively process video information for a longer time.

Description

Human behavior recognition method combining 3D (three-dimensional) jump layer connection and recurrent neural network
Technical Field
The invention belongs to the technical field of computer vision recognition, and particularly relates to a human behavior recognition method combining 3D convolutional layer jump layer connection and a recurrent neural network.
Background
Human behavior recognition has important application prospects and market values in the fields of video monitoring, human-computer interaction, virtual reality and the like, so human behavior recognition based on videos becomes one of research hotspots in computer vision. Meanwhile, with the deep learning, especially the convolution neural network, obtaining effective results in computer vision, the human body behaviors based on the convolution neural network get the attention of a large number of researchers.
The behavior recognition method based on track and convolutional neural network feature extraction, with the patent number of CN201611117772.9, first extracts a track from input image video data, then extracts a convolutional feature by using a convolutional neural network, then extracts a convolutional feature based on track constraint and a stack-type local fisher vector feature by combining the track and convolutional layer features, and finally trains a support vector machine model to achieve the classification purpose.
In the human behavior recognition method based on the 3D convolutional neural network, with the patent number CN201510527937.9, an image with obvious human behavior characteristics is first screened and stored, then five pieces of channel information in total, including the gray scale, the gradient in the x and y directions, and the optical flow, are respectively extracted from the stored image, and then the convolutional characteristics of the five pieces of channel information are extracted by using the convolutional neural network, and finally classification is achieved.
Both methods need to extract low-dimensional motion information by using video data in advance, and cannot directly send original video data into a network, so that end-to-end classification prediction cannot be realized.
The method for recognizing behaviors based on deep learning and multi-scale information, with the patent number of CN201610047682.0, first splits a deep video into a plurality of video segments, then learns each video segment by using a branch neural network, then simply fuses and connects high-level representations learned by each neural network branch running in parallel, and finally sends the fused high-level representations into a full connection layer and a classification layer for classification and recognition. When a video with long duration is input in the method, the dimensionality of the fused features is too high, so that the network is difficult to train.
In summary, although there are many studies on motion recognition based on a convolutional neural network at home and abroad, there are problems that manual motion information extraction needs to be performed on video data in advance or videos with long duration cannot be processed.
Disclosure of Invention
The invention aims to provide a human behavior identification method combining 3D (three-dimensional) jump layer connection and a recurrent neural network, which does not need to artificially extract low-layer motion information.
In order to achieve the above purpose, the solution of the invention is:
a human behavior recognition method combining 3D jump layer connection and a recurrent neural network comprises the following steps:
step 1, dividing each video into N parts, extracting L frames of pictures from each part, wherein N, L are all natural numbers;
step 2, performing space-time feature extraction on the video by using the trained 3D convolutional neural network, and performing series connection on the space-time features of different layers to obtain a high-dimensional feature vector;
step 3, carrying out normalization processing on the high-dimensional feature vector obtained in the step 2;
step 4, sending the high-dimensional feature vector subjected to the normalization processing in the step 3 into a recurrent neural network for feature fusion;
and 5, classifying the features fused in the step 4 to obtain action categories corresponding to the videos.
In step 1, if the total number of frames in the video is less than 48 frames, the video is discarded, and if the total number of frames in the video cannot be divided by L, the last frames are discarded.
In the step 1, each video segment is divided into N parts, and the content of the L-frame picture extracted from each part is: one video is divided into N-3 parts on average by the number of frames, each part containing the same number of frames, and L-16 frame pictures are extracted from each part at equal intervals.
The specific process of the step 2 is as follows:
transfer learning: performing space-time feature extraction on each 16-frame input in the step 1 by using a trained convolution and pooling layer of a C3D network as a feature extractor to obtain a pool5 num-dimensional output vector, performing space-time feature extraction on the whole video, and expressing the extracted result by using a two-dimensional tensor (3, pool5num), wherein the pool5num expresses the output dimension of the pooling layer 5 of the feature extractor;
series connection: for each 16-frame input, the outputs of the pooling layers 1, 2, 3 and 5 of the feature extractor are connected in series to obtain a feature vector of pool _ num dimension, and a feature series connection operation is performed on the entire video, and the result after the series connection is represented by a two-dimensional tensor (3, pool _ num), where pool _ num is equal to pool1num + pool2num + pool3num + pool5num, pool1num, pool2num, and pool3num represent the output dimensions of the feature extractor pooling layers 1, 2, and 3, respectively.
In the step 3, the specific process of performing the normalization processing is as follows:
solving the mean value E [ x ] of each dimension of the high-dimensional feature vector in the step 2 on the whole training set(k)]Variance with Var [ x ](k)]Then, each dimension of the feature vector is normalized by the formula:
Figure BDA0001644271860000031
wherein x is(k)The value of the activation is represented by,
Figure BDA0001644271860000032
represents a value after the normalization processing;
then, the following formula is used for
Figure BDA0001644271860000033
Converting to obtain a compound of formula (I) by gamma(k)And beta(k)Changed new value y(k)Then y is(k)The characteristic value after normalization processing is represented as:
Figure BDA0001644271860000034
wherein, γ(k)And beta(k)All are parameters of the recurrent neural network, and are obtained by network learning.
In the step 4, the high-dimensional feature vector normalized in the step 3 is sent to a recurrent neural network, and the specific content of feature fusion is as follows: and feeding the normalized two-dimensional tensor (3, poroll _ num) into a circulating neural network, wherein the time step of the circulating neural network is 3, the circulating neural network comprises a hidden layer, and the number of neurons in the hidden layer is 256.
In the step 5, the output of the recurrent neural network in the step 4 is linearly classified by using a multi-class Softmax classifier.
After the scheme is adopted, the invention has the following beneficial effects:
(1) the spatial-temporal information of the video is directly extracted by utilizing a C3D network, and the motion information of the video data is not required to be extracted in advance, so that an end-to-end identification mode is realized;
(2) the characteristic information of different levels extracted by the convolution kernel is connected in series, and compared with the method of extracting low-level motion information of a video by manual design, the low-level spatiotemporal information output by the convolution kernel has higher robustness and is more comprehensive;
(3) the features of different levels in the feature extractor are connected in series to obtain high-order feature vectors containing information of different levels, and the step can obviously improve the identification accuracy;
(4) normalizing the high-dimensional feature vector to accelerate network convergence;
(5) and performing further time domain information fusion on the normalized feature vector by using the recurrent neural network, so that the whole network structure can process long-time video input.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a network architecture diagram of the present invention;
fig. 3 is a detail view of a recurrent neural network.
Detailed Description
The technical solution and the advantages of the present invention will be described in detail with reference to the accompanying drawings.
As shown in fig. 1, the present invention provides a human behavior recognition method combining 3D jump layer connection and a recurrent neural network, and the specific process is embodied in the following steps:
and (3) video segmentation, namely dividing a video into 3 parts according to the average frame number, extracting 16 frames of pictures from each part at equal intervals to form a segment, wherein if the total frame number of the video is less than 48 frames, the video is discarded, and if the total frame number of the video cannot be divided by 3, the last frames are discarded.
After the video segmentation is finished, one video can be represented as 5-dimensional tensors (3, 16, H, W, 3), and each 16-frame segment can be represented as 4-dimensional tensors (16, H, W, 3), where 3 represents that the video is uniformly divided into 3 parts, 16 represents that 16 frames of pictures are extracted from each part, H and W represent the length and width of the pictures respectively, and 3 represents the number of channels of the pictures, here representing RGB pictures.
The videos of the training set are divided according to the principle, each video in the divided training set is represented as a 5-dimensional tensor (3, 16, H, W, 3), each video is scaled to a size of 3 × 16 × 128 × 171 × 3, each video can be represented as a 5-dimensional tensor (3, 16, 128, 171, 3), 16 represents the number of frames of each segment, and 128, 171, 3 represent the length, width and channel number of each frame of the picture respectively.
For a video, 5-dimensional tensors (3, 16, 128, 171, 3) are converted into 3 representations of 4-dimensional tensors (16, 128, 171, 3).
Following the previous step, all training set data is transformed into (16, 128, 171, 3) form, where each video includes 3 consecutive 4-dimensional tensors (16, 128, 171, 3).
All 4-dimensional tensors (16, 128, 171, 3) of the training set are averaged, and the obtained average value is represented by (16, 128, 171, 3) which is a 4-dimensional tensor mean.
Subtracting mean from all segments in the training set (16, 128, 171, 3) so that each pixel value in the training set is distributed around zero, this step can eliminate the effect of noise on the classification.
For a video, 3 consecutive 4-dimensional tensors (16, 128, 171, 3) with the mean subtracted are converted into 5-dimensional tensors (3, 16, 128, 171, 3).
And converting all video data in the training set into a 5-dimensional tensor (3, 16, 128, 171 and 3) representation form according to the previous step, and cutting the 5-dimensional tensor after the mean value reduction processing into a size of (3, 16, 112, 112 and 3).
And (3) feeding the processed video into a C3D feature extractor, feeding 1 16-frame fragment into each video for 3 times continuously, namely feeding 4-dimensional tensors (16, 112, 112, 3) into each video, outputting a pool5 num-dimensional vector, and finally expressing each video feature by using a 2-dimensional tensor (3, pool5num), wherein the pool5num expresses the output dimension of the feature extractor pooling layer 5.
For each video, the outputs of the pooling layers 1, 2, 3, and 5 of the feature extractor are connected in series, and as shown in fig. 2, the high-dimensional features after the series connection are represented by two-dimensional tensors (3, pool _ num), where pool _ num is pool1num + pool2num + pool3num + pool5num, and pool1num, pool2num, and pool3num represent the output dimensions of the feature extractor pooling layers 1, 2, and 3, respectively.
And sending the whole training set into a feature extractor and performing series connection operation to obtain high-dimensional feature training data.
And inputting the obtained high-dimensional feature training data into a recurrent neural network, as shown in fig. 2, wherein a normalization operation is performed before the high-dimensional feature training data is input into the recurrent neural network, as shown in fig. 2, and a normalization layer is added to accelerate the network convergence speed and the convergence effect.
The normalization operation consists of two steps, first, the features are normalized, and the mean value E [ x ] of each dimension of the high-dimensional features in the pool5num dimension is obtained on the whole training set(k)]Variance with Var [ x ](k)]Then inputting x for each activation(k)The standardization is carried out so that the standard,
Figure BDA0001644271860000051
the normalized value is expressed by the formula:
Figure BDA0001644271860000052
secondly, in order not to change the expression ability of the feature vector, the following formula is used for
Figure BDA0001644271860000053
Converting to obtain a compound of formula (I) by gamma(k)And beta(k)Changed new value y(k)Then y is(k)The characteristic value after normalization processing is represented as:
Figure BDA0001644271860000054
wherein, γ(k)And beta(k)Obtained by network learning.
Training recurrent neural network parameters and parameters gamma using back propagation(k),β(k)And obtaining the trained network.
When an input video is predicted, a video is divided into 3 parts on average according to the number of frames, and 16 frames of pictures are extracted from each part at equal intervals to form a segment, so that the video can be expressed as a 5-dimensional tensor (3, 16, H, W, 3).
The video to be predicted (3, 16, H, W, 3) is scaled to the size of (3, 16, 128, 171, 3), then the mean value mean is subtracted from each 16-frame video to be (16, 128, 171, 3), and then the video to be predicted is cropped at the center of each frame of picture, and the processed video to be predicted can be represented as a 5-dimensional tensor (3, 16, 112, 112, 3).
And converting the processed video (3, 16, 112, 112, 3) to be predicted into 3 4-dimensional tensors (16, 112112, 3) and sequentially sending the 4-dimensional tensors into a network to obtain the high-dimensional features (3, poroll _ num) connected in series.
And (3) sending the high-dimensional features (3, poroll _ num) of the video to be predicted into the trained BN layer and the recurrent neural network to obtain prediction output.
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.

Claims (3)

1. A human behavior recognition method combining 3D jump layer connection and a recurrent neural network is characterized by comprising the following steps:
step 1, dividing each video into N parts, extracting L frames of pictures from each part, wherein N, L are all natural numbers; the content of extracting L-frame pictures from each part is: dividing a video into N-3 parts according to the average number of frames, wherein each part comprises the same number of frames, and extracting L-16 frame pictures from each part at equal intervals;
step 2, performing space-time feature extraction on the video by using the trained 3D convolutional neural network, and performing series connection on the space-time features of different layers to obtain a high-dimensional feature vector; the specific process is as follows:
transfer learning: performing space-time feature extraction on each 16-frame input in the step 1 by using a trained convolution and pooling layer of a C3D network as a feature extractor to obtain a pool5 num-dimensional output vector, performing space-time feature extraction on the whole video, and expressing the extracted result by using a two-dimensional tensor (3, pool5num), wherein the pool5num expresses the output dimension of the pooling layer 5 of the feature extractor;
series connection: for each 16-frame input, serially connecting outputs of a pooling layer 1, a pooling layer 2, a pooling layer 3 and a pooling layer 5 of a feature extractor to obtain a feature vector of a pool _ num dimension, performing feature serial connection operation on the whole video, and expressing a result after serial connection by using a two-dimensional tensor (3, pool _ num), wherein the pool _ num is pool1num + pool2num + pool3num + pool5num, the pool1num, the pool2num and the pool3num respectively express output dimensions of the pooling layer 1, the pooling layer 2 and the pooling layer 3 of the feature extractor;
step 3, carrying out normalization processing on the high-dimensional feature vector obtained in the step 2; the specific process of carrying out the normalization treatment is as follows:
solving the mean value E [ x ] of each dimension of the high-dimensional feature vector in the step 2 on the whole training set(k)]Variance with Var [ x ](k)]Then, each dimension of the feature vector is normalized by the formula:
Figure FDA0003492016090000011
wherein x is(k)The value of the activation is represented by,
Figure FDA0003492016090000012
represents a value after the normalization processing;
then, the following formula is used for
Figure FDA0003492016090000013
Converting to obtain a compound of formula (I) by gamma(k)And beta(k)Changed new value y(k)Then y is(k)The characteristic value after normalization processing is represented as:
Figure FDA0003492016090000014
wherein, γ(k)And beta(k)All the parameters are cyclic neural network parameters and are obtained by network learning;
step 4, sending the high-dimensional feature vector subjected to the normalization processing in the step 3 into a recurrent neural network for feature fusion; the specific content is as follows: sending the normalized two-dimensional tensor (3, poroll _ num) into a cyclic neural network, wherein the time step of the cyclic neural network is 3, the cyclic neural network comprises a hidden layer, and the number of neurons in the hidden layer is 256;
and 5, classifying the features fused in the step 4 to obtain action categories corresponding to the videos.
2. The human behavior recognition method in combination with 3D saltating connectivity and recurrent neural networks of claim 1, wherein: in step 1, if the total number of frames of the video is less than 48 frames, the video is discarded, and if the total number of frames of the video cannot be divided by L, the last frames are discarded.
3. The human behavior recognition method in combination with 3D saltating connectivity and recurrent neural networks of claim 1, wherein: in the step 5, the output of the recurrent neural network in the step 4 is linearly classified by using a multi-class Softmax classifier.
CN201810394571.6A 2018-04-27 2018-04-27 Human behavior recognition method combining 3D (three-dimensional) jump layer connection and recurrent neural network Active CN108647599B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810394571.6A CN108647599B (en) 2018-04-27 2018-04-27 Human behavior recognition method combining 3D (three-dimensional) jump layer connection and recurrent neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810394571.6A CN108647599B (en) 2018-04-27 2018-04-27 Human behavior recognition method combining 3D (three-dimensional) jump layer connection and recurrent neural network

Publications (2)

Publication Number Publication Date
CN108647599A CN108647599A (en) 2018-10-12
CN108647599B true CN108647599B (en) 2022-04-15

Family

ID=63747937

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810394571.6A Active CN108647599B (en) 2018-04-27 2018-04-27 Human behavior recognition method combining 3D (three-dimensional) jump layer connection and recurrent neural network

Country Status (1)

Country Link
CN (1) CN108647599B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109961037A (en) * 2019-03-20 2019-07-02 中共中央办公厅电子科技学院(北京电子科技学院) A kind of examination hall video monitoring abnormal behavior recognition methods
CN109977854A (en) * 2019-03-25 2019-07-05 浙江新再灵科技股份有限公司 Unusual checking analysis system under a kind of elevator monitoring environment
CN110839156A (en) * 2019-11-08 2020-02-25 北京邮电大学 Future frame prediction method and model based on video image
CN111460889B (en) * 2020-02-27 2023-10-31 平安科技(深圳)有限公司 Abnormal behavior recognition method, device and equipment based on voice and image characteristics
CN112449155A (en) * 2020-10-21 2021-03-05 苏州怡林城信息科技有限公司 Video monitoring method and system for protecting privacy of personnel
CN112863482B (en) * 2020-12-31 2022-09-27 思必驰科技股份有限公司 Speech synthesis method and system with rhythm

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3465532A1 (en) * 2016-06-07 2019-04-10 Toyota Motor Europe Control device, system and method for determining the perceptual load of a visual and dynamic driving scene
CN106599907B (en) * 2016-11-29 2019-11-29 北京航空航天大学 The dynamic scene classification method and device of multiple features fusion
CN107506712B (en) * 2017-08-15 2021-05-18 成都考拉悠然科技有限公司 Human behavior identification method based on 3D deep convolutional network
CN107811626A (en) * 2017-09-10 2018-03-20 天津大学 A kind of arrhythmia classification method based on one-dimensional convolutional neural networks and S-transformation

Also Published As

Publication number Publication date
CN108647599A (en) 2018-10-12

Similar Documents

Publication Publication Date Title
CN108647599B (en) Human behavior recognition method combining 3D (three-dimensional) jump layer connection and recurrent neural network
CN110119757B (en) Model training method, video category detection method, device, electronic equipment and computer readable medium
CN109829443B (en) Video behavior identification method based on image enhancement and 3D convolution neural network
CN109886225B (en) Image gesture action online detection and recognition method based on deep learning
Sixt et al. Rendergan: Generating realistic labeled data
CN110097131B (en) Semi-supervised medical image segmentation method based on countermeasure cooperative training
CN103971137B (en) Based on the three-dimensional dynamic human face expression recognition method that structural sparse features learn
CN109784280A (en) Human bodys' response method based on Bi-LSTM-Attention model
CN106778796B (en) Human body action recognition method and system based on hybrid cooperative training
CN109508669A (en) A kind of facial expression recognizing method based on production confrontation network
CN111144448A (en) Video barrage emotion analysis method based on multi-scale attention convolutional coding network
CN110135386B (en) Human body action recognition method and system based on deep learning
CN111539290B (en) Video motion recognition method and device, electronic equipment and storage medium
CN111241963B (en) First person view video interactive behavior identification method based on interactive modeling
CN112784763A (en) Expression recognition method and system based on local and overall feature adaptive fusion
CN109948721A (en) A kind of video scene classification method based on video presentation
CN111369565A (en) Digital pathological image segmentation and classification method based on graph convolution network
Yang et al. Deeplab_v3_plus-net for image semantic segmentation with channel compression
CN112418032A (en) Human behavior recognition method and device, electronic equipment and storage medium
CN112633153A (en) Facial expression motion unit identification method based on space-time graph convolutional network
CN113420703B (en) Dynamic facial expression recognition method based on multi-scale feature extraction and multi-attention mechanism modeling
CN105956604B (en) Action identification method based on two-layer space-time neighborhood characteristics
CN112990340B (en) Self-learning migration method based on feature sharing
CN113255464A (en) Airplane action recognition method and system
CN113705384A (en) Facial expression recognition method considering local space-time characteristics and global time sequence clues

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant