CN111860117A

CN111860117A - Human behavior recognition method based on deep learning

Info

Publication number: CN111860117A
Application number: CN202010494768.4A
Authority: CN
Inventors: 胡二琳
Original assignee: Anhui Bigeng Software Co ltd
Current assignee: Anhui Bigeng Software Co ltd
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2020-10-30

Abstract

The invention discloses a human behavior recognition method based on deep learning, and relates to the technical field of computer vision. The invention comprises the following steps: acquiring mass videos, uniformly sampling to obtain videos with fixed frame numbers, and using the videos as a training set; installing inertial sensors at human joints, and collecting human behavior data of each inertial sensor to serve as a test set; training the convolutional neural network by utilizing a training set and a test set; establishing a process for providing a background interface, providing an identification entry and prediction feedback. The invention trains the convolutional neural network by taking a mass of videos as a training set, obtains the joint angular velocity and the acceleration of human body behaviors in real time by utilizing the inertial sensor, obtains the most correct test set to train the convolutional neural network, and improves the accuracy of the convolutional neural network output.

Description

Human behavior recognition method based on deep learning

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a human behavior recognition method based on deep learning.

Background

With the rapid growth of the internet in recent years, networks have become the primary means by which people entertain and obtain information, and in the process, the internet has accumulated a large amount of video data. The data shows that the video duration uploaded only at youtube per minute is up to 35 hours long. How to handle these large amounts of video data is now a challenge. Therefore, computer vision has been brought forward, and human behavior recognition thereof has attracted extensive attention in academia and industry.

The human body action recognition of the video is a classic subject in the fields of computer vision and image processing due to the wide application of the human body action recognition in the fields of video monitoring, human-computer interface equipment and the like. However, many challenges still exist in motion recognition, such as efficient multi-range spatiotemporal feature extraction. Recently proposed spatio-temporal feature extractors are roughly classified into two categories: long-term and short-term characteristics.

The key to the extraction of video as a track is its short-term characteristics. The technique has the advantages of robustness and simplicity due to its local and repetitive short-term extraction function. Long-term signatures have more discriminative power than short-term signatures because they have the opposite properties of short-term signatures, i.e., long-term signatures and discriminative, while remaining sensitive to within-class variability.

More specifically, when the framework is capturing only short-term spatiotemporal information, it is difficult to distinguish between pre-crawl and breaststroke. Conversely, extracting short-term spatiotemporal features is more effective in identifying the actions of a dog walking. Therefore, a powerful action recognition system should be able to distinguish between different classes of actions through multiple contexts. Therefore, capturing information over multiple spatiotemporal ranges is very important and beneficial.

The method based on deep learning, such as TSN, I3D, obtains a good result in computer vision, and particularly remarkably improves the overall accuracy rate for a large-scale and complex-behavior data set. However, it is still a challenge to further improve the recognition rate for complex video data sets by capturing information in multiple spatio-temporal ranges.

Disclosure of Invention

The invention aims to provide a human behavior recognition method based on deep learning, which trains a convolutional neural network by taking massive videos as a training set, acquires the joint angular velocity and the acceleration of human behaviors in real time by using an inertial sensor, acquires the most correct test set to train the convolutional neural network, and solves the problems of low recognition rate and insufficient accuracy of the existing video data set.

In order to solve the technical problems, the invention is realized by the following technical scheme:

the invention relates to a human behavior recognition method based on deep learning, which comprises the following steps:

step S1: acquiring mass videos, uniformly sampling to obtain videos with fixed frame numbers, and using the videos as a training set;

step S2: installing inertial sensors at human joints, and collecting human behavior data of each inertial sensor to serve as a test set;

Step S3: training the convolutional neural network by utilizing a training set and a test set;

step S4: establishing a process for providing a background interface, providing an identification entry and prediction feedback.

Preferably, in step S1, the specific steps of obtaining a video with a fixed frame number are as follows:

step S11: inputting the whole video file into a Gaussian mixture model for motion detection;

step S12: detecting a person in a motion region using a gradient histogram;

step S13: and carrying out softmax classifier classification on the people in the detected motion area.

Preferably, in step S11, the time sequence { x ] of any pixel point in the image is determined₁,x₂,...,x_γAnd modeling through a Gaussian mixture model, wherein the probability of the pixel value of the current observation point is as follows:

where k is the number of Gaussian models, ω_itIs the weight mu of the ith Gaussian model at the moment t_it，∑_itThe mean and variance of the ith Gaussian model at time t are shown, and eta is a Gaussian probability density function.

Preferably, in step S13, the human image in the gradient histogram detection motion region is mapped onto a corresponding tag through a softmax classifier, a classification result is obtained in the process of processing the gradient histogram, the corresponding tag data is compared to calculate a corresponding relative error, the relative error is continuously reduced by training a weight on an military aircraft window in the convolutional neural network for a certain number of times, and finally the region converges, and then the final result of the gradient histogram is input to the softmax classifier network for test classification.

Preferably, in step S2, each inertial sensor performs a sliding window segmentation process on each human behavior data to obtain three-axis acceleration and angular velocity of each observation window.

Preferably, the inertial sensor performs feature extraction on the acceleration and the angular velocity of three axes to obtain a feature vector of each sensor node; wherein the performing feature extraction on the obtained acceleration of the three axes and the angular velocity of the three axes includes: the characteristic extraction is to extract the mean value of acceleration data and angular velocity data on three axes of an x axis, a y axis and a z axis, the variance of the acceleration data and the angular velocity data on the three axes of the x axis, the y axis and the z axis, the kurtosis value of the acceleration data and the angular velocity data on the three axes of the x axis, the y axis and the z axis, the covariance of the acceleration data and the angular velocity between the x axis, the y axis and the z axis, and the energy characteristic set of an intrinsic mode function obtained by ensemble empirical mode decomposition of the acceleration data and the angular velocity data on the three axes of the x axis, the y axis and the z axis by adopting a time domain analysis method and a time-frequency characteristic analysis method in a signal theory.

Preferably, in the step S3, the convolutional neural network includes an embedding layer, an LSTM, a full connection layer, and a softmax layer.

Preferably, in the training or predicting process of the convolutional neural network, the transmission process of the signal is as follows:

inputting (x, y, z) signals of a training sample into an embedding layer, respectively converting x, y and z into corresponding m-dimensional vectors by the embedding layer, and splicing the m-dimensional vectors corresponding to the x, y and z into a 3 m-dimensional vector; and inputting the 3 m-dimensional vector into an LSTM neural network according to a time sequence, outputting the 3 m-L-dimensional expression vector of the track to a full connection layer by the LSTM neural network, and outputting a judgment result of whether the track is a human behavior through a softmax layer.

The invention has the following beneficial effects:

(1) the method trains the convolutional neural network by taking a mass of videos as a training set, obtains the joint angular velocity and the acceleration of human body behaviors in real time by using the inertial sensor, obtains the most correct test set to train the convolutional neural network, and improves the accuracy of the convolutional neural network output;

(2) according to the invention, the softmax classifier is arranged on the bottom layer of the convolutional neural network, so that the image of the map person is mapped to the corresponding label through the softmax classifier, the image is preliminarily classified into sitting posture, standing posture, walking posture and running posture before prediction, monitoring is directly carried out according to the label during prediction, and the detection efficiency is improved.

Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a step diagram of a human behavior recognition method based on deep learning according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention is a method for recognizing human behavior based on deep learning, including the following steps:

In step S1, the specific steps of obtaining a video with a fixed frame number are as follows:

the Gaussian Mixture Model (GMM) is a classic adaptive background modeling method, supposing that unit pixels accord with normal distribution in a time domain, setting a threshold range to judge that the pixels are background and serving as a basis for updating the model, describing the background into a plurality of Gaussian distributions by the Gaussian mixture model, and taking the pixels which accord with one of the distribution models as background pixels.

Step S12: detecting a person in a motion region using a gradient histogram;

gradient Histogram (HOG) is an operator, which is described based on shape edge features, and is generally used for object detection, and the basic idea is to calculate pixel value gradients to express edge information of an object and extract features of local appearance and shape of an image by local gradient values.

Step S13: performing softmax classifier classification on the people in the detected motion area; in actual calculation, factors such as selection of gradient directions and parameter templates under different scales influence the final calculation result, and finally, the pedestrian target is detected through the softmax classifier.

In step S11, the time series { x ] of any pixel in the image is determined₁,x₂,...,x_γIs modeled by a Gaussian mixture modelAnd the probability of the pixel value of the current observation point is as follows:

In step S13, the human image in the gradient histogram detection motion region is mapped onto a corresponding label through a softmax classifier, a classification result is obtained in the gradient histogram processing process, the corresponding label data is compared to calculate a corresponding relative error, the relative error is continuously reduced by training the weight on the military aircraft window in the convolutional neural network for a certain number of times, and finally the region converges, and then the final result of the gradient histogram is input to the softmax classifier network for test classification.

In step S2, each inertial sensor performs sliding window segmentation processing on each human behavior data to obtain the three-axis acceleration and angular velocity of each observation window.

The inertial sensor performs feature extraction on the acceleration and the angular velocity of three axes to obtain a feature vector of each sensor node; wherein, carry out feature extraction to the acceleration of the triaxial that obtains and the angular velocity of triaxial, include: the characteristic extraction is to respectively extract the mean value of acceleration data and angular velocity data on three axes of an x axis, a y axis and a z axis, the variance of the acceleration data and the angular velocity data on the three axes of the x axis, the y axis and the z axis, the kurtosis value of the acceleration data and the angular velocity data on the three axes of the x axis, the y axis and the z axis, the covariance of the acceleration data and the angular velocity between the x axis, the y axis and the z axis, and the energy characteristic set of an intrinsic mode function obtained by ensemble empirical mode decomposition of the acceleration data and the angular velocity data on the three axes of the x axis, the y axis and the z axis by adopting a time domain analysis method and a time-frequency characteristic analysis method in a signal theory.

In step S3, the convolutional neural network includes an embedding layer, an LSTM, a full connection layer, and a softmax layer.

In the training or predicting process of the convolutional neural network, the transmission process of signals is as follows:

Inputting (x, y, z) signals of a training sample into an embedding layer, respectively converting x, y and z into corresponding m-dimensional vectors by the embedding layer, and splicing the m-dimensional vectors corresponding to x, y and z into a 3 m-dimensional vector; and inputting the 3 m-dimensional vector into an LSTM neural network according to a time sequence, outputting the 3 m-L-dimensional expression vector of the track to a full connection layer by the LSTM neural network, and outputting a judgment result of whether the track is a human behavior through a softmax layer.

It should be noted that, in the above system embodiment, each included unit is only divided according to functional logic, but is not limited to the above division as long as the corresponding function can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

In addition, it is understood by those skilled in the art that all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing associated hardware, and the corresponding program may be stored in a computer-readable storage medium.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A human behavior recognition method based on deep learning is characterized by comprising the following steps:

2. The method for recognizing human body behaviors based on deep learning of claim 1, wherein in the step S1, the specific steps of obtaining the video with fixed frame number are as follows:

step S12: detecting a person in a motion region using a gradient histogram;

3. The method for human behavior recognition based on deep learning of claim 2, wherein in step S11, the time sequence { x ] of any pixel point in the image is processed ₁,x₂,...,x_γAnd modeling through a Gaussian mixture model, wherein the probability of the pixel value of the current observation point is as follows:

4. The method for human behavior recognition based on deep learning of claim 2, wherein in step S13, the human image in the gradient histogram detection motion region is mapped onto the corresponding label through a softmax classifier, a classification result is obtained during the gradient histogram processing, the corresponding label data is compared to calculate the corresponding relative error, the weight on the military aircraft window in the convolutional neural network is continuously modified by training for a certain number of times so that the relative error is continuously reduced, and finally the region converges, and then the final result of the gradient histogram is input to the softmax classifier network for test classification.

5. The method for recognizing human body behavior based on deep learning of claim 1, wherein in step S2, each inertial sensor performs sliding window segmentation processing on each human body behavior data to obtain three-axis acceleration and angular velocity of each observation window.

6. The human behavior recognition method based on deep learning of claim 1 or 5, wherein the inertial sensor performs feature extraction on the acceleration and angular velocity of three axes to obtain a feature vector of each sensor node; wherein the performing feature extraction on the obtained acceleration of the three axes and the angular velocity of the three axes includes: the characteristic extraction is to extract the mean value of acceleration data and angular velocity data on three axes of an x axis, a y axis and a z axis, the variance of the acceleration data and the angular velocity data on the three axes of the x axis, the y axis and the z axis, the kurtosis value of the acceleration data and the angular velocity data on the three axes of the x axis, the y axis and the z axis, the covariance of the acceleration data and the angular velocity between the x axis, the y axis and the z axis, and the energy characteristic set of an intrinsic mode function obtained by ensemble empirical mode decomposition of the acceleration data and the angular velocity data on the three axes of the x axis, the y axis and the z axis by adopting a time domain analysis method and a time-frequency characteristic analysis method in a signal theory.

7. The deep learning-based human behavior recognition method of claim 1, wherein in the step S3, the convolutional neural network comprises an embedding layer, an LSTM, a full connection layer and a softmax layer.

8. The human behavior recognition method based on deep learning as claimed in claim 1 or 7, wherein in the training or prediction process of the convolutional neural network, the transmission process of signals is as follows: