CN110956085A

CN110956085A - Human behavior recognition method based on deep learning

Info

Publication number: CN110956085A
Application number: CN201911007261.5A
Authority: CN
Inventors: 朱艺; 衣杨
Original assignee: National Sun Yat Sen University
Current assignee: Sun Yat Sen University; National Sun Yat Sen University
Priority date: 2019-10-22
Filing date: 2019-10-22
Publication date: 2020-04-03

Abstract

The invention provides a human behavior recognition method based on deep learning, which comprises an improved model M based on 3D ResNet, wherein the improved model M is a video feature extraction model based on deep learning; the method for preprocessing input data comprises the steps of preprocessing the input data by using a TSN (time delay network) method; an application-implemented method is provided, which comprises how to extract the temporal features and the spatial features of a video, and using the features to identify human body behaviors in the video. When a user uses a video as input, behavior in the video may be identified from the input video.

Description

Human behavior recognition method based on deep learning

Technical Field

The invention relates to the technical field of computer vision, in particular to a human behavior recognition method based on deep learning.

Background

With the rapid growth of the internet in recent years, networks have become the primary means by which people entertain and obtain information, and in the process, the internet has accumulated a large amount of video data. The data shows that the video duration uploaded only at youtube per minute is up to 35 hours long. How to handle these large amounts of video data is now a challenge. Therefore, computer vision has been brought forward, and human behavior recognition thereof has attracted extensive attention in academia and industry.

The human body action recognition of the video is a classic subject in the fields of computer vision and image processing due to the wide application of the human body action recognition in the fields of video monitoring, human-computer interface equipment and the like. However, many challenges still exist in motion recognition, such as efficient multi-range spatiotemporal feature extraction. Recently proposed spatio-temporal feature extractors are roughly classified into two categories: long-term and short-term characteristics.

The key to the extraction of video as a track is its short-term characteristics. The technique has the advantages of robustness and simplicity due to its local and repetitive short-term extraction function. Long-term signatures have more discriminative power than short-term signatures because they have the opposite properties of short-term signatures, i.e., long-term signatures and discriminative, while remaining sensitive to within-class variability.

More specifically, when the framework is capturing only short-term spatiotemporal information, it is difficult to distinguish between pre-crawl and breaststroke. Conversely, extracting short-term spatiotemporal features is more effective in identifying the actions of a dog walking. Therefore, a powerful action recognition system should be able to distinguish between different classes of actions through multiple contexts. Therefore, capturing information over multiple spatiotemporal ranges is very important and beneficial.

The method based on deep learning, such as TSN, I3D, obtains a good result in computer vision, and particularly remarkably improves the overall accuracy rate for a large-scale and complex-behavior data set. However, it is still a challenge to further improve the recognition rate for complex video data sets by capturing information in multiple spatio-temporal ranges.

Disclosure of Invention

The invention provides a human behavior recognition method based on deep learning, and the algorithm can enable the extracted information to take into account the repeatability of short-term features and the distinguishability of long-term features.

In order to achieve the technical effects, the technical scheme of the invention is as follows:

a human behavior recognition method based on deep learning comprises the following steps

S1: preprocessing an input video clip;

s2: constructing an improved network model M based on ResNet 3D;

s3: training and testing a model;

s4: establishing a process for providing a background interface, and providing an identification entrance and prediction feedback;

where ResNet3D is a deep learning based video feature extraction model.

Further, the specific process of step S1 is:

s11: decomposing the whole video file into RGB images, and expressing X ═ X1, X2 …, xn by a vector matrix, wherein n is the video frame number, and the dimension of the vector matrix X is the preprocessing size 224X 224 of the photos;

s12: for data in the vector matrix, a new vector matrix XT is extracted [ XT1, XT2 …, XTs ] according to a rule that one of K RGB images is selected at intervals of K, where s is the number of extracted video frames, and the dimension of the vector matrix XT is the same as that of the vector matrix X, and is 224 × 224.

Further, the specific process of step S2 is:

s21: establishing a new space-time conversion layer T, wherein the new space-time conversion layer T consists of a Non-Local module and a plurality of convolutions with small and large dimensions, and a model is made to learn time-space information in the layer;

s22: establishing an improved network M based on ResNet3D, embedding the constructed new conversion layer into ResNet3D to complete the improved network M, and inputting the preprocessed matrix vector XT into the network to obtain a group of output eigenvectors with the length of n bits.

Further, the specific process of step S3 is:

s31: the selection of the dataset is the reference dataset HMDB51, and UCF101, and the data in the dataset is represented by 4: 1, dividing the ratio into a training set and a testing set;

s32: training the improved network M, wherein the training steps are as follows: video features are extracted by the M network, and an M network model is trained by minimizing a loss function L1, so that the model is optimal as much as possible;

s33: the improved network M is tested.

Further, the specific process of step S33 is:

s331: pre-training an improved model M by using a Kinetics video data set, finely adjusting the model by using an HMDB51 and UCF101 data set, generating a group of k-sized feature vectors by using the trained M model of each video, wherein k is the classification number of the data set, and converting the k-sized feature vectors into the self-defined n-length feature numbers by using a Linear layer;

s332: in the training of the network, cross entropy loss and triplet loss are adopted as loss functions, and the size of the loss value is obtained according to the weighted sum of the cross entropy loss and the triplet loss. The optimization function is a stochastic gradient descent algorithm in the training process, dropout is used for preventing the network from being over-fitted, the initial learning rate is set to be 0.01, and in the subsequent training, the epoch learning rate is reduced to 0.1 time after every 60 training.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

1. the method provides a multi-stage feature extraction technology, can effectively capture multi-range space-time information, compromises a short-term feature extraction technology and a long-term feature extraction technology, and enables the extracted information to take repeatability of short-term features and easy distinguishability of long-term features into account;

2. according to the invention, a new feature fusion mode is defined, and when spatial information and time information are fused, dimension splicing is not simply used, but weighted splicing is used, so that captured features can be enhanced, and more effective video representation is formed;

3. the invention preprocesses the input data, reduces the redundant calculation of repeated continuous frames, reduces the storage of video data and accelerates the training speed of the model.

Drawings

FIG. 1 is a schematic diagram of M network flow of the present invention

FIG. 2 is a flow chart of the T-translation layer according to the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1-2, a method for human behavior recognition based on deep learning includes the following steps

S1: preprocessing an input video clip;

s2: constructing an improved network model M based on ResNet 3D;

s3: training and testing a model;

where ResNet3D is a deep learning based video feature extraction model.

The specific process of step S1 is:

The specific process of step S2 is:

The specific process of step S3 is:

s33: the improved network M is tested.

The specific process of step S33 is:

Example 2

The invention discloses a recognition effect experiment of a video human behavior recognition method based on a multi-space-time network, which comprises the following steps:

1. experimental data set: including Kinetics video data set (400 motion categories total), UCF101 data set (101 motion categories total), and HMDB51 data set (51 motion categories total); the Kinetics video data set is used as a pre-training data set to pre-train the model M, and the other two data sets are a reference data set HMDB51 and a UCF101 for human behavior recognition, and the sources of the three data sets are Youtube. The basic case of the data set is shown in the following table:

2. the experimental environment is as follows: 4 tablets of GTX1080TI NVIDIA GPUs;

3. the experimental method comprises the following steps:

① preprocessing the input video clip, extracting a frame of picture at intervals of K frames, setting the size of the picture to 224 x 224, and splicing all the processed pictures to construct a matrix of s x 224.

② A network M is constructed, the flow of M being shown in FIG. 1, with translation layer T being shown in FIG. 2.

③ training the network M, using the matrix of s 224 in the first step as input, inputting into the network M, the network calculates cross entropy loss, and stops when the loss function reaches minimum, finally the data is processed by the network to get the output H1 … hn, where n is the number of data set types, for example, n is 51 in hmdb51, the length of the output is 51, and n is 101 in ucf101, the length of the output is 101, where hi represents the probability size that the input video is class i, and of all i, the corresponding hi is the largest, i.e. the prediction class i.

4. Evaluation criteria:

average recognition rate: the formula is as follows:

where Vk is the video sequence, Ci is the set of video sequences belonging to category i, h (Vk) is the prediction category of the sequence Vk, | V | is the total number of video sequences, NC is the number of motion categories.

5. The experimental results are as follows:

our network M reached the most advanced level by training all three split of the UCF101 and HMDB51 datasets. On the UCF101 and HMDB51 data sets, we obtained 95.7% and 71.5% accuracy, respectively. It should be noted that model M performed better than the most advanced model on both datasets.

The same or similar reference numerals correspond to the same or similar parts;

the positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A human behavior recognition method based on deep learning is characterized by comprising the following steps

S1: preprocessing an input video clip;

s2: constructing an improved network model M based on ResNet 3D;

s3: training and testing a model;

where ResNet3D is a deep learning based video feature extraction model.

2. The deep learning based human behavior recognition method according to claim 1, wherein the specific process of step S1 is:

3. The deep learning based human behavior recognition method according to claim 2, wherein the specific process of step S2 is:

4. The deep learning based human behavior recognition method according to claim 3, wherein the specific process of step S3 is:

s33: the improved network M is tested.

5. The deep learning based human behavior recognition method according to claim 4, wherein the specific process of step S33 is: