CN110956085A - Human behavior recognition method based on deep learning - Google Patents
Human behavior recognition method based on deep learning Download PDFInfo
- Publication number
- CN110956085A CN110956085A CN201911007261.5A CN201911007261A CN110956085A CN 110956085 A CN110956085 A CN 110956085A CN 201911007261 A CN201911007261 A CN 201911007261A CN 110956085 A CN110956085 A CN 110956085A
- Authority
- CN
- China
- Prior art keywords
- video
- training
- model
- network
- deep learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 230000006399 behavior Effects 0.000 title claims abstract description 17
- 238000013135 deep learning Methods 0.000 title claims abstract description 17
- 238000000605 extraction Methods 0.000 claims abstract description 10
- 238000007781 pre-processing Methods 0.000 claims abstract description 9
- 238000012549 training Methods 0.000 claims description 31
- 239000013598 vector Substances 0.000 claims description 27
- 239000011159 matrix material Substances 0.000 claims description 23
- 230000008569 process Effects 0.000 claims description 19
- 230000006870 function Effects 0.000 claims description 11
- 238000006243 chemical reaction Methods 0.000 claims description 9
- 238000012360 testing method Methods 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 3
- 230000002123 temporal effect Effects 0.000 abstract 1
- 230000007774 longterm Effects 0.000 description 6
- 230000009471 action Effects 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a human behavior recognition method based on deep learning, which comprises an improved model M based on 3D ResNet, wherein the improved model M is a video feature extraction model based on deep learning; the method for preprocessing input data comprises the steps of preprocessing the input data by using a TSN (time delay network) method; an application-implemented method is provided, which comprises how to extract the temporal features and the spatial features of a video, and using the features to identify human body behaviors in the video. When a user uses a video as input, behavior in the video may be identified from the input video.
Description
Technical Field
The invention relates to the technical field of computer vision, in particular to a human behavior recognition method based on deep learning.
Background
With the rapid growth of the internet in recent years, networks have become the primary means by which people entertain and obtain information, and in the process, the internet has accumulated a large amount of video data. The data shows that the video duration uploaded only at youtube per minute is up to 35 hours long. How to handle these large amounts of video data is now a challenge. Therefore, computer vision has been brought forward, and human behavior recognition thereof has attracted extensive attention in academia and industry.
The human body action recognition of the video is a classic subject in the fields of computer vision and image processing due to the wide application of the human body action recognition in the fields of video monitoring, human-computer interface equipment and the like. However, many challenges still exist in motion recognition, such as efficient multi-range spatiotemporal feature extraction. Recently proposed spatio-temporal feature extractors are roughly classified into two categories: long-term and short-term characteristics.
The key to the extraction of video as a track is its short-term characteristics. The technique has the advantages of robustness and simplicity due to its local and repetitive short-term extraction function. Long-term signatures have more discriminative power than short-term signatures because they have the opposite properties of short-term signatures, i.e., long-term signatures and discriminative, while remaining sensitive to within-class variability.
More specifically, when the framework is capturing only short-term spatiotemporal information, it is difficult to distinguish between pre-crawl and breaststroke. Conversely, extracting short-term spatiotemporal features is more effective in identifying the actions of a dog walking. Therefore, a powerful action recognition system should be able to distinguish between different classes of actions through multiple contexts. Therefore, capturing information over multiple spatiotemporal ranges is very important and beneficial.
The method based on deep learning, such as TSN, I3D, obtains a good result in computer vision, and particularly remarkably improves the overall accuracy rate for a large-scale and complex-behavior data set. However, it is still a challenge to further improve the recognition rate for complex video data sets by capturing information in multiple spatio-temporal ranges.
Disclosure of Invention
The invention provides a human behavior recognition method based on deep learning, and the algorithm can enable the extracted information to take into account the repeatability of short-term features and the distinguishability of long-term features.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
a human behavior recognition method based on deep learning comprises the following steps
S1: preprocessing an input video clip;
s2: constructing an improved network model M based on ResNet 3D;
s3: training and testing a model;
s4: establishing a process for providing a background interface, and providing an identification entrance and prediction feedback;
where ResNet3D is a deep learning based video feature extraction model.
Further, the specific process of step S1 is:
s11: decomposing the whole video file into RGB images, and expressing X ═ X1, X2 …, xn by a vector matrix, wherein n is the video frame number, and the dimension of the vector matrix X is the preprocessing size 224X 224 of the photos;
s12: for data in the vector matrix, a new vector matrix XT is extracted [ XT1, XT2 …, XTs ] according to a rule that one of K RGB images is selected at intervals of K, where s is the number of extracted video frames, and the dimension of the vector matrix XT is the same as that of the vector matrix X, and is 224 × 224.
Further, the specific process of step S2 is:
s21: establishing a new space-time conversion layer T, wherein the new space-time conversion layer T consists of a Non-Local module and a plurality of convolutions with small and large dimensions, and a model is made to learn time-space information in the layer;
s22: establishing an improved network M based on ResNet3D, embedding the constructed new conversion layer into ResNet3D to complete the improved network M, and inputting the preprocessed matrix vector XT into the network to obtain a group of output eigenvectors with the length of n bits.
Further, the specific process of step S3 is:
s31: the selection of the dataset is the reference dataset HMDB51, and UCF101, and the data in the dataset is represented by 4: 1, dividing the ratio into a training set and a testing set;
s32: training the improved network M, wherein the training steps are as follows: video features are extracted by the M network, and an M network model is trained by minimizing a loss function L1, so that the model is optimal as much as possible;
s33: the improved network M is tested.
Further, the specific process of step S33 is:
s331: pre-training an improved model M by using a Kinetics video data set, finely adjusting the model by using an HMDB51 and UCF101 data set, generating a group of k-sized feature vectors by using the trained M model of each video, wherein k is the classification number of the data set, and converting the k-sized feature vectors into the self-defined n-length feature numbers by using a Linear layer;
s332: in the training of the network, cross entropy loss and triplet loss are adopted as loss functions, and the size of the loss value is obtained according to the weighted sum of the cross entropy loss and the triplet loss. The optimization function is a stochastic gradient descent algorithm in the training process, dropout is used for preventing the network from being over-fitted, the initial learning rate is set to be 0.01, and in the subsequent training, the epoch learning rate is reduced to 0.1 time after every 60 training.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
1. the method provides a multi-stage feature extraction technology, can effectively capture multi-range space-time information, compromises a short-term feature extraction technology and a long-term feature extraction technology, and enables the extracted information to take repeatability of short-term features and easy distinguishability of long-term features into account;
2. according to the invention, a new feature fusion mode is defined, and when spatial information and time information are fused, dimension splicing is not simply used, but weighted splicing is used, so that captured features can be enhanced, and more effective video representation is formed;
3. the invention preprocesses the input data, reduces the redundant calculation of repeated continuous frames, reduces the storage of video data and accelerates the training speed of the model.
Drawings
FIG. 1 is a schematic diagram of M network flow of the present invention
FIG. 2 is a flow chart of the T-translation layer according to the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1-2, a method for human behavior recognition based on deep learning includes the following steps
S1: preprocessing an input video clip;
s2: constructing an improved network model M based on ResNet 3D;
s3: training and testing a model;
s4: establishing a process for providing a background interface, and providing an identification entrance and prediction feedback;
where ResNet3D is a deep learning based video feature extraction model.
The specific process of step S1 is:
s11: decomposing the whole video file into RGB images, and expressing X ═ X1, X2 …, xn by a vector matrix, wherein n is the video frame number, and the dimension of the vector matrix X is the preprocessing size 224X 224 of the photos;
s12: for data in the vector matrix, a new vector matrix XT is extracted [ XT1, XT2 …, XTs ] according to a rule that one of K RGB images is selected at intervals of K, where s is the number of extracted video frames, and the dimension of the vector matrix XT is the same as that of the vector matrix X, and is 224 × 224.
The specific process of step S2 is:
s21: establishing a new space-time conversion layer T, wherein the new space-time conversion layer T consists of a Non-Local module and a plurality of convolutions with small and large dimensions, and a model is made to learn time-space information in the layer;
s22: establishing an improved network M based on ResNet3D, embedding the constructed new conversion layer into ResNet3D to complete the improved network M, and inputting the preprocessed matrix vector XT into the network to obtain a group of output eigenvectors with the length of n bits.
The specific process of step S3 is:
s31: the selection of the dataset is the reference dataset HMDB51, and UCF101, and the data in the dataset is represented by 4: 1, dividing the ratio into a training set and a testing set;
s32: training the improved network M, wherein the training steps are as follows: video features are extracted by the M network, and an M network model is trained by minimizing a loss function L1, so that the model is optimal as much as possible;
s33: the improved network M is tested.
The specific process of step S33 is:
s331: pre-training an improved model M by using a Kinetics video data set, finely adjusting the model by using an HMDB51 and UCF101 data set, generating a group of k-sized feature vectors by using the trained M model of each video, wherein k is the classification number of the data set, and converting the k-sized feature vectors into the self-defined n-length feature numbers by using a Linear layer;
s332: in the training of the network, cross entropy loss and triplet loss are adopted as loss functions, and the size of the loss value is obtained according to the weighted sum of the cross entropy loss and the triplet loss. The optimization function is a stochastic gradient descent algorithm in the training process, dropout is used for preventing the network from being over-fitted, the initial learning rate is set to be 0.01, and in the subsequent training, the epoch learning rate is reduced to 0.1 time after every 60 training.
Example 2
The invention discloses a recognition effect experiment of a video human behavior recognition method based on a multi-space-time network, which comprises the following steps:
1. experimental data set: including Kinetics video data set (400 motion categories total), UCF101 data set (101 motion categories total), and HMDB51 data set (51 motion categories total); the Kinetics video data set is used as a pre-training data set to pre-train the model M, and the other two data sets are a reference data set HMDB51 and a UCF101 for human behavior recognition, and the sources of the three data sets are Youtube. The basic case of the data set is shown in the following table:
2. the experimental environment is as follows: 4 tablets of GTX1080TI NVIDIA GPUs;
3. the experimental method comprises the following steps:
① preprocessing the input video clip, extracting a frame of picture at intervals of K frames, setting the size of the picture to 224 x 224, and splicing all the processed pictures to construct a matrix of s x 224.
② A network M is constructed, the flow of M being shown in FIG. 1, with translation layer T being shown in FIG. 2.
③ training the network M, using the matrix of s 224 in the first step as input, inputting into the network M, the network calculates cross entropy loss, and stops when the loss function reaches minimum, finally the data is processed by the network to get the output H1 … hn, where n is the number of data set types, for example, n is 51 in hmdb51, the length of the output is 51, and n is 101 in ucf101, the length of the output is 101, where hi represents the probability size that the input video is class i, and of all i, the corresponding hi is the largest, i.e. the prediction class i.
4. Evaluation criteria:
average recognition rate: the formula is as follows:
where Vk is the video sequence, Ci is the set of video sequences belonging to category i, h (Vk) is the prediction category of the sequence Vk, | V | is the total number of video sequences, NC is the number of motion categories.
5. The experimental results are as follows:
our network M reached the most advanced level by training all three split of the UCF101 and HMDB51 datasets. On the UCF101 and HMDB51 data sets, we obtained 95.7% and 71.5% accuracy, respectively. It should be noted that model M performed better than the most advanced model on both datasets.
The same or similar reference numerals correspond to the same or similar parts;
the positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
Claims (5)
1. A human behavior recognition method based on deep learning is characterized by comprising the following steps
S1: preprocessing an input video clip;
s2: constructing an improved network model M based on ResNet 3D;
s3: training and testing a model;
s4: establishing a process for providing a background interface, and providing an identification entrance and prediction feedback;
where ResNet3D is a deep learning based video feature extraction model.
2. The deep learning based human behavior recognition method according to claim 1, wherein the specific process of step S1 is:
s11: decomposing the whole video file into RGB images, and expressing X ═ X1, X2 …, xn by a vector matrix, wherein n is the video frame number, and the dimension of the vector matrix X is the preprocessing size 224X 224 of the photos;
s12: for data in the vector matrix, a new vector matrix XT is extracted [ XT1, XT2 …, XTs ] according to a rule that one of K RGB images is selected at intervals of K, where s is the number of extracted video frames, and the dimension of the vector matrix XT is the same as that of the vector matrix X, and is 224 × 224.
3. The deep learning based human behavior recognition method according to claim 2, wherein the specific process of step S2 is:
s21: establishing a new space-time conversion layer T, wherein the new space-time conversion layer T consists of a Non-Local module and a plurality of convolutions with small and large dimensions, and a model is made to learn time-space information in the layer;
s22: establishing an improved network M based on ResNet3D, embedding the constructed new conversion layer into ResNet3D to complete the improved network M, and inputting the preprocessed matrix vector XT into the network to obtain a group of output eigenvectors with the length of n bits.
4. The deep learning based human behavior recognition method according to claim 3, wherein the specific process of step S3 is:
s31: the selection of the dataset is the reference dataset HMDB51, and UCF101, and the data in the dataset is represented by 4: 1, dividing the ratio into a training set and a testing set;
s32: training the improved network M, wherein the training steps are as follows: video features are extracted by the M network, and an M network model is trained by minimizing a loss function L1, so that the model is optimal as much as possible;
s33: the improved network M is tested.
5. The deep learning based human behavior recognition method according to claim 4, wherein the specific process of step S33 is:
s331: pre-training an improved model M by using a Kinetics video data set, finely adjusting the model by using an HMDB51 and UCF101 data set, generating a group of k-sized feature vectors by using the trained M model of each video, wherein k is the classification number of the data set, and converting the k-sized feature vectors into the self-defined n-length feature numbers by using a Linear layer;
s332: in the training of the network, cross entropy loss and triplet loss are adopted as loss functions, and the size of the loss value is obtained according to the weighted sum of the cross entropy loss and the triplet loss. The optimization function is a stochastic gradient descent algorithm in the training process, dropout is used for preventing the network from being over-fitted, the initial learning rate is set to be 0.01, and in the subsequent training, the epoch learning rate is reduced to 0.1 time after every 60 training.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911007261.5A CN110956085A (en) | 2019-10-22 | 2019-10-22 | Human behavior recognition method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911007261.5A CN110956085A (en) | 2019-10-22 | 2019-10-22 | Human behavior recognition method based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110956085A true CN110956085A (en) | 2020-04-03 |
Family
ID=69975735
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911007261.5A Pending CN110956085A (en) | 2019-10-22 | 2019-10-22 | Human behavior recognition method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110956085A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111860278A (en) * | 2020-07-14 | 2020-10-30 | 陕西理工大学 | Human behavior recognition algorithm based on deep learning |
CN112949406A (en) * | 2021-02-02 | 2021-06-11 | 西北农林科技大学 | Sheep individual identity recognition method based on deep learning algorithm |
US20220058396A1 (en) * | 2019-11-19 | 2022-02-24 | Tencent Technology (Shenzhen) Company Limited | Video Classification Model Construction Method and Apparatus, Video Classification Method and Apparatus, Device, and Medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110188637A (en) * | 2019-05-17 | 2019-08-30 | 西安电子科技大学 | A kind of Activity recognition technical method based on deep learning |
CN110348381A (en) * | 2019-07-11 | 2019-10-18 | 电子科技大学 | A kind of video behavior recognition methods based on deep learning |
-
2019
- 2019-10-22 CN CN201911007261.5A patent/CN110956085A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110188637A (en) * | 2019-05-17 | 2019-08-30 | 西安电子科技大学 | A kind of Activity recognition technical method based on deep learning |
CN110348381A (en) * | 2019-07-11 | 2019-10-18 | 电子科技大学 | A kind of video behavior recognition methods based on deep learning |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220058396A1 (en) * | 2019-11-19 | 2022-02-24 | Tencent Technology (Shenzhen) Company Limited | Video Classification Model Construction Method and Apparatus, Video Classification Method and Apparatus, Device, and Medium |
US11967152B2 (en) * | 2019-11-19 | 2024-04-23 | Tencent Technology (Shenzhen) Company Limited | Video classification model construction method and apparatus, video classification method and apparatus, device, and medium |
CN111860278A (en) * | 2020-07-14 | 2020-10-30 | 陕西理工大学 | Human behavior recognition algorithm based on deep learning |
CN111860278B (en) * | 2020-07-14 | 2024-05-14 | 陕西理工大学 | Human behavior recognition algorithm based on deep learning |
CN112949406A (en) * | 2021-02-02 | 2021-06-11 | 西北农林科技大学 | Sheep individual identity recognition method based on deep learning algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021093468A1 (en) | Video classification method and apparatus, model training method and apparatus, device and storage medium | |
CN110956085A (en) | Human behavior recognition method based on deep learning | |
CN111091045A (en) | Sign language identification method based on space-time attention mechanism | |
CN111144448A (en) | Video barrage emotion analysis method based on multi-scale attention convolutional coding network | |
CN107862300A (en) | A kind of descending humanized recognition methods of monitoring scene based on convolutional neural networks | |
CN113158723B (en) | End-to-end video motion detection positioning system | |
CN112836646B (en) | Video pedestrian re-identification method based on channel attention mechanism and application | |
CN107818307B (en) | Multi-label video event detection method based on LSTM network | |
CN108647599B (en) | Human behavior recognition method combining 3D (three-dimensional) jump layer connection and recurrent neural network | |
CN112084891B (en) | Cross-domain human body action recognition method based on multi-modal characteristics and countermeasure learning | |
CN114596520A (en) | First visual angle video action identification method and device | |
CN110427831B (en) | Human body action classification method based on fusion features | |
CN110956059B (en) | Dynamic gesture recognition method and device and electronic equipment | |
CN112668638A (en) | Image aesthetic quality evaluation and semantic recognition combined classification method and system | |
CN110889335B (en) | Human skeleton double interaction behavior identification method based on multichannel space-time fusion network | |
CN114708649A (en) | Behavior identification method based on integrated learning method and time attention diagram convolution | |
CN112906520A (en) | Gesture coding-based action recognition method and device | |
CN115908896A (en) | Image identification system based on impulse neural network with self-attention mechanism | |
CN113076905B (en) | Emotion recognition method based on context interaction relation | |
Kumar et al. | Content based movie scene retrieval using spatio-temporal features | |
CN114202787A (en) | Multiframe micro-expression emotion recognition method based on deep learning and two-dimensional attention mechanism | |
CN111339888B (en) | Double interaction behavior recognition method based on joint point motion diagram | |
CN112508121A (en) | Method and system for sensing outside by industrial robot | |
CN112528077A (en) | Video face retrieval method and system based on video embedding | |
CN112417989A (en) | Invigilator violation identification method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
AD01 | Patent right deemed abandoned |
Effective date of abandoning: 20231215 |
|
AD01 | Patent right deemed abandoned |