CN110956085A - Human behavior recognition method based on deep learning - Google Patents

Human behavior recognition method based on deep learning Download PDF

Info

Publication number
CN110956085A
CN110956085A CN201911007261.5A CN201911007261A CN110956085A CN 110956085 A CN110956085 A CN 110956085A CN 201911007261 A CN201911007261 A CN 201911007261A CN 110956085 A CN110956085 A CN 110956085A
Authority
CN
China
Prior art keywords
video
training
model
network
deep learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911007261.5A
Other languages
Chinese (zh)
Inventor
朱艺
衣杨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
National Sun Yat Sen University
Original Assignee
National Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Sun Yat Sen University filed Critical National Sun Yat Sen University
Priority to CN201911007261.5A priority Critical patent/CN110956085A/en
Publication of CN110956085A publication Critical patent/CN110956085A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a human behavior recognition method based on deep learning, which comprises an improved model M based on 3D ResNet, wherein the improved model M is a video feature extraction model based on deep learning; the method for preprocessing input data comprises the steps of preprocessing the input data by using a TSN (time delay network) method; an application-implemented method is provided, which comprises how to extract the temporal features and the spatial features of a video, and using the features to identify human body behaviors in the video. When a user uses a video as input, behavior in the video may be identified from the input video.

Description

Human behavior recognition method based on deep learning
Technical Field
The invention relates to the technical field of computer vision, in particular to a human behavior recognition method based on deep learning.
Background
With the rapid growth of the internet in recent years, networks have become the primary means by which people entertain and obtain information, and in the process, the internet has accumulated a large amount of video data. The data shows that the video duration uploaded only at youtube per minute is up to 35 hours long. How to handle these large amounts of video data is now a challenge. Therefore, computer vision has been brought forward, and human behavior recognition thereof has attracted extensive attention in academia and industry.
The human body action recognition of the video is a classic subject in the fields of computer vision and image processing due to the wide application of the human body action recognition in the fields of video monitoring, human-computer interface equipment and the like. However, many challenges still exist in motion recognition, such as efficient multi-range spatiotemporal feature extraction. Recently proposed spatio-temporal feature extractors are roughly classified into two categories: long-term and short-term characteristics.
The key to the extraction of video as a track is its short-term characteristics. The technique has the advantages of robustness and simplicity due to its local and repetitive short-term extraction function. Long-term signatures have more discriminative power than short-term signatures because they have the opposite properties of short-term signatures, i.e., long-term signatures and discriminative, while remaining sensitive to within-class variability.
More specifically, when the framework is capturing only short-term spatiotemporal information, it is difficult to distinguish between pre-crawl and breaststroke. Conversely, extracting short-term spatiotemporal features is more effective in identifying the actions of a dog walking. Therefore, a powerful action recognition system should be able to distinguish between different classes of actions through multiple contexts. Therefore, capturing information over multiple spatiotemporal ranges is very important and beneficial.
The method based on deep learning, such as TSN, I3D, obtains a good result in computer vision, and particularly remarkably improves the overall accuracy rate for a large-scale and complex-behavior data set. However, it is still a challenge to further improve the recognition rate for complex video data sets by capturing information in multiple spatio-temporal ranges.
Disclosure of Invention
The invention provides a human behavior recognition method based on deep learning, and the algorithm can enable the extracted information to take into account the repeatability of short-term features and the distinguishability of long-term features.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
a human behavior recognition method based on deep learning comprises the following steps
S1: preprocessing an input video clip;
s2: constructing an improved network model M based on ResNet 3D;
s3: training and testing a model;
s4: establishing a process for providing a background interface, and providing an identification entrance and prediction feedback;
where ResNet3D is a deep learning based video feature extraction model.
Further, the specific process of step S1 is:
s11: decomposing the whole video file into RGB images, and expressing X ═ X1, X2 …, xn by a vector matrix, wherein n is the video frame number, and the dimension of the vector matrix X is the preprocessing size 224X 224 of the photos;
s12: for data in the vector matrix, a new vector matrix XT is extracted [ XT1, XT2 …, XTs ] according to a rule that one of K RGB images is selected at intervals of K, where s is the number of extracted video frames, and the dimension of the vector matrix XT is the same as that of the vector matrix X, and is 224 × 224.
Further, the specific process of step S2 is:
s21: establishing a new space-time conversion layer T, wherein the new space-time conversion layer T consists of a Non-Local module and a plurality of convolutions with small and large dimensions, and a model is made to learn time-space information in the layer;
s22: establishing an improved network M based on ResNet3D, embedding the constructed new conversion layer into ResNet3D to complete the improved network M, and inputting the preprocessed matrix vector XT into the network to obtain a group of output eigenvectors with the length of n bits.
Further, the specific process of step S3 is:
s31: the selection of the dataset is the reference dataset HMDB51, and UCF101, and the data in the dataset is represented by 4: 1, dividing the ratio into a training set and a testing set;
s32: training the improved network M, wherein the training steps are as follows: video features are extracted by the M network, and an M network model is trained by minimizing a loss function L1, so that the model is optimal as much as possible;
s33: the improved network M is tested.
Further, the specific process of step S33 is:
s331: pre-training an improved model M by using a Kinetics video data set, finely adjusting the model by using an HMDB51 and UCF101 data set, generating a group of k-sized feature vectors by using the trained M model of each video, wherein k is the classification number of the data set, and converting the k-sized feature vectors into the self-defined n-length feature numbers by using a Linear layer;
s332: in the training of the network, cross entropy loss and triplet loss are adopted as loss functions, and the size of the loss value is obtained according to the weighted sum of the cross entropy loss and the triplet loss. The optimization function is a stochastic gradient descent algorithm in the training process, dropout is used for preventing the network from being over-fitted, the initial learning rate is set to be 0.01, and in the subsequent training, the epoch learning rate is reduced to 0.1 time after every 60 training.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
1. the method provides a multi-stage feature extraction technology, can effectively capture multi-range space-time information, compromises a short-term feature extraction technology and a long-term feature extraction technology, and enables the extracted information to take repeatability of short-term features and easy distinguishability of long-term features into account;
2. according to the invention, a new feature fusion mode is defined, and when spatial information and time information are fused, dimension splicing is not simply used, but weighted splicing is used, so that captured features can be enhanced, and more effective video representation is formed;
3. the invention preprocesses the input data, reduces the redundant calculation of repeated continuous frames, reduces the storage of video data and accelerates the training speed of the model.
Drawings
FIG. 1 is a schematic diagram of M network flow of the present invention
FIG. 2 is a flow chart of the T-translation layer according to the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1-2, a method for human behavior recognition based on deep learning includes the following steps
S1: preprocessing an input video clip;
s2: constructing an improved network model M based on ResNet 3D;
s3: training and testing a model;
s4: establishing a process for providing a background interface, and providing an identification entrance and prediction feedback;
where ResNet3D is a deep learning based video feature extraction model.
The specific process of step S1 is:
s11: decomposing the whole video file into RGB images, and expressing X ═ X1, X2 …, xn by a vector matrix, wherein n is the video frame number, and the dimension of the vector matrix X is the preprocessing size 224X 224 of the photos;
s12: for data in the vector matrix, a new vector matrix XT is extracted [ XT1, XT2 …, XTs ] according to a rule that one of K RGB images is selected at intervals of K, where s is the number of extracted video frames, and the dimension of the vector matrix XT is the same as that of the vector matrix X, and is 224 × 224.
The specific process of step S2 is:
s21: establishing a new space-time conversion layer T, wherein the new space-time conversion layer T consists of a Non-Local module and a plurality of convolutions with small and large dimensions, and a model is made to learn time-space information in the layer;
s22: establishing an improved network M based on ResNet3D, embedding the constructed new conversion layer into ResNet3D to complete the improved network M, and inputting the preprocessed matrix vector XT into the network to obtain a group of output eigenvectors with the length of n bits.
The specific process of step S3 is:
s31: the selection of the dataset is the reference dataset HMDB51, and UCF101, and the data in the dataset is represented by 4: 1, dividing the ratio into a training set and a testing set;
s32: training the improved network M, wherein the training steps are as follows: video features are extracted by the M network, and an M network model is trained by minimizing a loss function L1, so that the model is optimal as much as possible;
s33: the improved network M is tested.
The specific process of step S33 is:
s331: pre-training an improved model M by using a Kinetics video data set, finely adjusting the model by using an HMDB51 and UCF101 data set, generating a group of k-sized feature vectors by using the trained M model of each video, wherein k is the classification number of the data set, and converting the k-sized feature vectors into the self-defined n-length feature numbers by using a Linear layer;
s332: in the training of the network, cross entropy loss and triplet loss are adopted as loss functions, and the size of the loss value is obtained according to the weighted sum of the cross entropy loss and the triplet loss. The optimization function is a stochastic gradient descent algorithm in the training process, dropout is used for preventing the network from being over-fitted, the initial learning rate is set to be 0.01, and in the subsequent training, the epoch learning rate is reduced to 0.1 time after every 60 training.
Example 2
The invention discloses a recognition effect experiment of a video human behavior recognition method based on a multi-space-time network, which comprises the following steps:
1. experimental data set: including Kinetics video data set (400 motion categories total), UCF101 data set (101 motion categories total), and HMDB51 data set (51 motion categories total); the Kinetics video data set is used as a pre-training data set to pre-train the model M, and the other two data sets are a reference data set HMDB51 and a UCF101 for human behavior recognition, and the sources of the three data sets are Youtube. The basic case of the data set is shown in the following table:
Figure BDA0002243142040000041
2. the experimental environment is as follows: 4 tablets of GTX1080TI NVIDIA GPUs;
3. the experimental method comprises the following steps:
① preprocessing the input video clip, extracting a frame of picture at intervals of K frames, setting the size of the picture to 224 x 224, and splicing all the processed pictures to construct a matrix of s x 224.
② A network M is constructed, the flow of M being shown in FIG. 1, with translation layer T being shown in FIG. 2.
③ training the network M, using the matrix of s 224 in the first step as input, inputting into the network M, the network calculates cross entropy loss, and stops when the loss function reaches minimum, finally the data is processed by the network to get the output H1 … hn, where n is the number of data set types, for example, n is 51 in hmdb51, the length of the output is 51, and n is 101 in ucf101, the length of the output is 101, where hi represents the probability size that the input video is class i, and of all i, the corresponding hi is the largest, i.e. the prediction class i.
4. Evaluation criteria:
average recognition rate: the formula is as follows:
Figure BDA0002243142040000051
where Vk is the video sequence, Ci is the set of video sequences belonging to category i, h (Vk) is the prediction category of the sequence Vk, | V | is the total number of video sequences, NC is the number of motion categories.
5. The experimental results are as follows:
Figure BDA0002243142040000052
Figure BDA0002243142040000061
our network M reached the most advanced level by training all three split of the UCF101 and HMDB51 datasets. On the UCF101 and HMDB51 data sets, we obtained 95.7% and 71.5% accuracy, respectively. It should be noted that model M performed better than the most advanced model on both datasets.
The same or similar reference numerals correspond to the same or similar parts;
the positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (5)

1. A human behavior recognition method based on deep learning is characterized by comprising the following steps
S1: preprocessing an input video clip;
s2: constructing an improved network model M based on ResNet 3D;
s3: training and testing a model;
s4: establishing a process for providing a background interface, and providing an identification entrance and prediction feedback;
where ResNet3D is a deep learning based video feature extraction model.
2. The deep learning based human behavior recognition method according to claim 1, wherein the specific process of step S1 is:
s11: decomposing the whole video file into RGB images, and expressing X ═ X1, X2 …, xn by a vector matrix, wherein n is the video frame number, and the dimension of the vector matrix X is the preprocessing size 224X 224 of the photos;
s12: for data in the vector matrix, a new vector matrix XT is extracted [ XT1, XT2 …, XTs ] according to a rule that one of K RGB images is selected at intervals of K, where s is the number of extracted video frames, and the dimension of the vector matrix XT is the same as that of the vector matrix X, and is 224 × 224.
3. The deep learning based human behavior recognition method according to claim 2, wherein the specific process of step S2 is:
s21: establishing a new space-time conversion layer T, wherein the new space-time conversion layer T consists of a Non-Local module and a plurality of convolutions with small and large dimensions, and a model is made to learn time-space information in the layer;
s22: establishing an improved network M based on ResNet3D, embedding the constructed new conversion layer into ResNet3D to complete the improved network M, and inputting the preprocessed matrix vector XT into the network to obtain a group of output eigenvectors with the length of n bits.
4. The deep learning based human behavior recognition method according to claim 3, wherein the specific process of step S3 is:
s31: the selection of the dataset is the reference dataset HMDB51, and UCF101, and the data in the dataset is represented by 4: 1, dividing the ratio into a training set and a testing set;
s32: training the improved network M, wherein the training steps are as follows: video features are extracted by the M network, and an M network model is trained by minimizing a loss function L1, so that the model is optimal as much as possible;
s33: the improved network M is tested.
5. The deep learning based human behavior recognition method according to claim 4, wherein the specific process of step S33 is:
s331: pre-training an improved model M by using a Kinetics video data set, finely adjusting the model by using an HMDB51 and UCF101 data set, generating a group of k-sized feature vectors by using the trained M model of each video, wherein k is the classification number of the data set, and converting the k-sized feature vectors into the self-defined n-length feature numbers by using a Linear layer;
s332: in the training of the network, cross entropy loss and triplet loss are adopted as loss functions, and the size of the loss value is obtained according to the weighted sum of the cross entropy loss and the triplet loss. The optimization function is a stochastic gradient descent algorithm in the training process, dropout is used for preventing the network from being over-fitted, the initial learning rate is set to be 0.01, and in the subsequent training, the epoch learning rate is reduced to 0.1 time after every 60 training.
CN201911007261.5A 2019-10-22 2019-10-22 Human behavior recognition method based on deep learning Pending CN110956085A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911007261.5A CN110956085A (en) 2019-10-22 2019-10-22 Human behavior recognition method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911007261.5A CN110956085A (en) 2019-10-22 2019-10-22 Human behavior recognition method based on deep learning

Publications (1)

Publication Number Publication Date
CN110956085A true CN110956085A (en) 2020-04-03

Family

ID=69975735

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911007261.5A Pending CN110956085A (en) 2019-10-22 2019-10-22 Human behavior recognition method based on deep learning

Country Status (1)

Country Link
CN (1) CN110956085A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111860278A (en) * 2020-07-14 2020-10-30 陕西理工大学 Human behavior recognition algorithm based on deep learning
CN112949406A (en) * 2021-02-02 2021-06-11 西北农林科技大学 Sheep individual identity recognition method based on deep learning algorithm
US20220058396A1 (en) * 2019-11-19 2022-02-24 Tencent Technology (Shenzhen) Company Limited Video Classification Model Construction Method and Apparatus, Video Classification Method and Apparatus, Device, and Medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188637A (en) * 2019-05-17 2019-08-30 西安电子科技大学 A kind of Activity recognition technical method based on deep learning
CN110348381A (en) * 2019-07-11 2019-10-18 电子科技大学 A kind of video behavior recognition methods based on deep learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188637A (en) * 2019-05-17 2019-08-30 西安电子科技大学 A kind of Activity recognition technical method based on deep learning
CN110348381A (en) * 2019-07-11 2019-10-18 电子科技大学 A kind of video behavior recognition methods based on deep learning

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220058396A1 (en) * 2019-11-19 2022-02-24 Tencent Technology (Shenzhen) Company Limited Video Classification Model Construction Method and Apparatus, Video Classification Method and Apparatus, Device, and Medium
US11967152B2 (en) * 2019-11-19 2024-04-23 Tencent Technology (Shenzhen) Company Limited Video classification model construction method and apparatus, video classification method and apparatus, device, and medium
CN111860278A (en) * 2020-07-14 2020-10-30 陕西理工大学 Human behavior recognition algorithm based on deep learning
CN111860278B (en) * 2020-07-14 2024-05-14 陕西理工大学 Human behavior recognition algorithm based on deep learning
CN112949406A (en) * 2021-02-02 2021-06-11 西北农林科技大学 Sheep individual identity recognition method based on deep learning algorithm

Similar Documents

Publication Publication Date Title
WO2021093468A1 (en) Video classification method and apparatus, model training method and apparatus, device and storage medium
CN110956085A (en) Human behavior recognition method based on deep learning
CN111091045A (en) Sign language identification method based on space-time attention mechanism
CN111144448A (en) Video barrage emotion analysis method based on multi-scale attention convolutional coding network
CN107862300A (en) A kind of descending humanized recognition methods of monitoring scene based on convolutional neural networks
CN113158723B (en) End-to-end video motion detection positioning system
CN112836646B (en) Video pedestrian re-identification method based on channel attention mechanism and application
CN107818307B (en) Multi-label video event detection method based on LSTM network
CN108647599B (en) Human behavior recognition method combining 3D (three-dimensional) jump layer connection and recurrent neural network
CN112084891B (en) Cross-domain human body action recognition method based on multi-modal characteristics and countermeasure learning
CN114596520A (en) First visual angle video action identification method and device
CN110427831B (en) Human body action classification method based on fusion features
CN110956059B (en) Dynamic gesture recognition method and device and electronic equipment
CN112668638A (en) Image aesthetic quality evaluation and semantic recognition combined classification method and system
CN110889335B (en) Human skeleton double interaction behavior identification method based on multichannel space-time fusion network
CN114708649A (en) Behavior identification method based on integrated learning method and time attention diagram convolution
CN112906520A (en) Gesture coding-based action recognition method and device
CN115908896A (en) Image identification system based on impulse neural network with self-attention mechanism
CN113076905B (en) Emotion recognition method based on context interaction relation
Kumar et al. Content based movie scene retrieval using spatio-temporal features
CN114202787A (en) Multiframe micro-expression emotion recognition method based on deep learning and two-dimensional attention mechanism
CN111339888B (en) Double interaction behavior recognition method based on joint point motion diagram
CN112508121A (en) Method and system for sensing outside by industrial robot
CN112528077A (en) Video face retrieval method and system based on video embedding
CN112417989A (en) Invigilator violation identification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned

Effective date of abandoning: 20231215

AD01 Patent right deemed abandoned