CN110222634B

CN110222634B - Human body posture recognition method based on convolutional neural network

Info

Publication number: CN110222634B
Application number: CN201910481323.XA
Authority: CN
Inventors: 李建; 张袁; 罗颖; 张亦昕
Original assignee: Changzhou Campus of Hohai University
Current assignee: Changzhou Campus of Hohai University
Priority date: 2019-06-04
Filing date: 2019-06-04
Publication date: 2022-11-01
Anticipated expiration: 2039-06-04
Also published as: CN110222634A

Abstract

The invention discloses a human body posture identification method based on a convolutional neural network, which comprises the steps of firstly, acquiring a human body posture data set, and preprocessing the human body posture data set by cutting a video into image frames; then, a convolutional neural network model is built, and unnecessary input of an excitation function is reduced by introducing sparsity at the input position of the RELU excitation function; then, optimizing a traditional target loss function by combining a sparse term, and training the network by iteratively updating parameters so as to obtain an optimal solution; and finally, according to the network model obtained by training, recognizing the human body posture and outputting the human body posture category. The invention has the beneficial effects that: the method adopted by the invention can accelerate the convergence speed and improve the generalization capability of the network while keeping higher gesture recognition rate.

Description

Human body posture recognition method based on convolutional neural network

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a human body posture identification method based on a convolutional neural network.

Background

In recent years, with the development of information technology and the popularization of artificial intelligence technology, human body posture recognition technology has come into wide use. Relevant researchers have attempted to explore valid features and classify using collected body posture data sets. The traditional gesture recognition method mainly comprises two steps: (1) Extracting complex artificial features from an original input image; and (2) training a classifier from the acquired features.

In the conventional gesture recognition process, complex artificial features need to be extracted from the original input image. Although effective at recognition accuracy, the feature sizes extracted from bone keypoints and depth image information tend to be relatively high due to the high complexity of the human body. Most depth images need to be preprocessed, so that the feature extraction is difficult, the recognition efficiency is low, and the convergence time is long. The conventional method is not an optimal method.

However, compared with the traditional local combined modeling method, the artificial neural network has nonlinear modeling and self-adaptive capacity, and can perform posture recognition by training a large number of static images and mining deep-level information of the images; the neural network method is used for representing the local characteristics of the objects in the image, and the classification is more robust. In addition, by introducing sparse regularization, unnecessary increase of excitation function input is reduced, the complexity of the model is reduced, and the generalization capability of the convolutional neural network is improved, so that the convergence speed of the model is improved while the higher recognition rate is ensured.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a human body posture recognition method based on a convolutional neural network, and through designing a convolutional neural network model introducing sparse regularization, the high recognition rate is ensured, the convergence rate is improved, and the generalization capability of the model is enhanced.

In order to achieve the purpose, the invention is realized by the following technical scheme:

in the training stage, a data-driven self-learning complex network structure model is constructed, the original human body posture image and the corresponding posture label block are used as the input and the output of the network, and the network is subjected to the learning in a supervised instructor mode. In the verification stage, given an unknown input raw image, pose recognition is performed.

A human body posture identification method based on a convolutional neural network comprises the following steps: s01, acquiring a human body posture video data set, preprocessing the human body posture video data set by cutting a video into image frames, and dividing the image data set cut into the image frames into a training set and a verification set;

s02, constructing a neural network model, introducing sparsity at the input of the RELU excitation function, inputting the convolutional neural network into the image preprocessed in the step S01, and outputting the image as a human body posture category; training the convolutional neural network;

s03, recognizing the human body posture by adopting the neural network model in the S02, and performing model training and performance testing on a public human body posture data set KTH; when an unknown video is input, firstly calling the step S01 to carry out preprocessing, and then carrying out posture recognition by using the neural network model in the step S02 to obtain the human body posture category.

In the above human body posture identification method based on the convolutional neural network, in the step S01, the acquiring of the human body posture video data set specifically includes the following steps:

s11, acquiring a public KTH human posture video data set;

s12, cutting the video into frames and storing the images of each frame;

s13, screening out images of human body behaviors with complete human bodies and corresponding actions of labels from the images, deleting blank images or images without complete human bodies or corresponding actions of labels, and classifying and marking the screened images; the corresponding postures of the tags comprise boxing, hand waving, clapping, jogging, running and jogging; the classification marks are specifically marked according to 6 postures of boxing, waving hands, clapping palms, jogging, running and jogging;

s14, extracting a foreground, namely a moving human body, by using a Gaussian mixture model;

s15, carrying out normalization processing on the image screened in the step S13;

and S16, randomly dividing the image set subjected to the normalization processing in the step S15 into a training set and a verification set according to the proportion of 8.

In the above human body posture recognition method based on the convolutional neural network, in the step S02, the convolutional neural network includes 4 convolutional layers, 4 pooling layers, 2 full-connection layers, and 1 classification layer; the sizes of convolution kernels in the first two convolution layers are set to be 5 multiplied by 5, and the sizes of the last two convolution kernels are set to be 3 multiplied by 3; the sizes of all the pooling layers are set to be 2 multiplied by 2, and the pooling layers adopt maximum pooling; the number of convolution kernels in the first convolution layer is 32, the number of convolution kernels in the second convolution layer is 64, and the number of convolution kernels in the third convolution layer and the fourth convolution layer is 128; the number of neurons in the first fully-connected layer is set to be 1024, and the number of neurons in the second fully-connected layer is set to be 512; in the convolutional layer and the fully-connected layer, the RELU function is used as an activation function.

The pooling layer mainly decomposes the feature information of each layer and divides a feature set. The sample size is large during initialization, the calculation is complex, but the calculation amount is reduced along with continuous updating, and the local features are increased to the global features. For each feature map, which extracts the maximum value, the maximum pooling can extract more efficient features than the average pooling.

The full-connection layer collects the feature information of each layer, finally obtains the global features of the whole neural network, and plays a key role in subsequent sample image classification and identification.

The classification layer is used for final decision making and is the last layer in the network model. It computes the deviation between the network training penalty prediction and the actual label. The soft-max function is used in the classification layer, and is used in the multi-classification process.

In the above human body posture identification method based on the convolutional neural network, the convolutional neural network is a special deep neural network, and comprises a convolutional layer and a pooling layer. The convolutional layer and the preceding layer of neurons are connected through local connection and value sharing, and the number of training parameters is reduced.

Convolutional layers are the core of convolutional neural networks. The low-level image features are iteratively updated step by step to high-level image features. Each of the convolutional layers is composed of neurons; suppose the nth node O of the mth convolutional layer_m,nHas an input value of { x_n-1,1；x_n-1,2；…；x_n-1,kThe value of the output unit is as follows:

wherein h is_w,b(x) Is the excitation value, f (x) is the excitation function, x_iIs the ith input value of the node, w_iIs the weight of the ith input value of the node, k is the number of the input values of the current node, and b is a bias term;

the modified linear unit RELU is defined as follows:

f(x)＝max(0,x) (2)

the modified linear unit RELU is a standard excitation function of the convolutional layer, provides a non-linear activation capability for the convolutional neural network, and does not interfere with other convolutional layers.

The above human body posture recognition method based on the convolutional neural network obtains the parameters of the convolutional neural network through training of the convolutional neural network, and specifically includes:

if there are p sample sets, they are denoted as { (x)₁,y₁),(x₂,y₂),…,( x_p,y_p) For each sample, each sample set loss function is defined as follows:

wherein h is_ω,b(x) Is a predicted value after network training, namely an excitation value, y is an actual output value, and w is the weight of an input value;

for the whole sample set, the whole sample set loss function is defined as follows:

where N is the total number of layers of the convolutional neural network, s_kThe number of nodes of the k-1 layer, a is a regularization coefficient, i is the ith node, j is the jth characteristic convolution kernel, and k is the kth layer of the convolution neural network;

the connection weight between the ith node and the jth characteristic convolution kernel of the kth layer of the convolution neural network is set;

the goal of convolutional neural network model training is to minimize the value of the overall sample set loss function;

the parameter update formula is as follows:

where L is the objective loss function, μ is the learning rate,

indicating that x is derived over y, the parameters are updated using a small batch approach.

In the human body posture identification method based on the convolutional neural network, the learning rate is set to be 0.001, and optimization is performed by using an adamiptimer algorithm.

In the above human body posture recognition method based on the convolutional neural network, in step S02, sparsity is introduced at the input of the RELU excitation function, that is, at the output of the linear filter, specifically:

input h of k-th layer RELU excitation function in the convolutional neural network_kIs expressed as:

wherein h is_kIs the input of the k-th layer RELU excitation function, S (h)_k) As input h to the k-th layer RELU excitation function in convolutional neural networks_kThe sparsity of (a);

defining an optimized objective function to determine convolutional neural network parameters, the objective function being defined as follows:

E＝L+λ∑_kS(h_k) (8)

where E is the objective function of the optimization, i.e., the final result of the present invention; l is an objective function without introducing sparsity, namely an objective loss function; λ is a tuning parameter that controls sparsity.

By introducing sparsity at the RELU input, unnecessary increase of RELU output can be prevented, unnecessary negative input of RELU can be reduced, and the model generalization capability can be improved.

The invention has the beneficial effects that:

(1) The human body posture is recognized by designing a convolution neural network introducing sparsity, so that the method is applied to man-machine interaction, behavior recognition, action classification, abnormal behavior detection, automatic driving and the like, the high recognition rate is ensured, the convergence rate is improved, and the generalization capability of a model is enhanced;

(2) The invention designs a brand new convolution neural network model by utilizing the artificial neural network method mentioned in the background technology and introducing the sparse regularization method and combining the advantages of the artificial neural network method and the sparse regularization method, and classifies and identifies the human body posture by utilizing the model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of the method of the present invention;

fig. 2 is a block diagram of a convolutional neural network model structure proposed by the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further explained by combining the specific embodiments.

As shown in fig. 1, the human body posture recognition method provided by the present invention includes the following steps:

s01, acquiring a human body posture video KTH data set, wherein the data set has 6 postures of boxing, waving hands, clapping palms, jogging, running and jogging, performing video image frame cutting pretreatment on the data set, and dividing the image data set into a training set and a verification set;

s11, acquiring a KTH human posture video data set;

s12, cutting the video into frames and storing the images of each frame;

s13, screening out images of human body behaviors with complete human bodies and corresponding postures of labels from the images, deleting blank images or images without complete human bodies or corresponding postures of labels, and classifying and marking the screened images; the corresponding postures of the tags comprise boxing, hand waving, clapping, jogging, running and jogging, and the 6 postures are postures obtained from KTH human posture video data in a centralized manner; the classification mark is specifically marked according to 6 types of postures of boxing, waving hands, clapping palms, jogging, running and jogging;

s15, normalizing the image;

and S16, randomly dividing the image set into a training set and a verification set according to the proportion of 8. And each type of posture training set comprises 120 effective pictures, and the verification set comprises 30 effective pictures.

S02, constructing a neural network model, and introducing sparsity at the input position of the RELU excitation function; the input of the convolution neural network is a preprocessed image, and the output is a human body gesture category; and training the convolutional neural network.

S21, constructing a neural network, wherein a convolutional neural network architecture adopted by the invention is shown in figure 2, and the network architecture comprises 7 layers, 4 convolutional layers (comprising a pooling layer), 2 full-connection layers and 1 classification layer. The first two convolution kernels are set to 5 x 5 in size and the last two are set to 3 x 3 in size. The size of all the pooling layers is set to be 2 x 2, and the pooling layer adopts maximum pooling. The number of convolution kernels in the first convolution layer is 32, the number of convolution kernels in the second convolution layer is 64, and the number of convolution kernels in the third convolution layer and the fourth convolution layer are 128. The number of neurons in the first fully-connected layer was set to 1024, and the number of neurons in the second fully-connected layer was set to 512. In the convolutional layer and the fully-connected layer, the RELU function is used as an activation function. The learning rate is set to 0.001 and optimized using the adampotime algorithm.

Suppose the nth node O of the mth convolutional layer_m,nIs inputtedA value of { x_n-1,1； x_n-1,2；…；x_n-1,kThe value of the output unit is as follows:

the modified linear unit RELU is defined as follows:

f(x)＝max(0,x) (2)

Through training of the convolutional neural network, parameters of the convolutional neural network are obtained, and the parameters are specifically as follows:

if there are p sample sets, they are denoted as { (x)₁,y₁),(x₂,y₂),…,(x_p,y_p) For each sample, each sample set loss function is defined as follows:

where N is the total number of layers of the convolutional neural network, s_kIs the number of nodes of the k-1 th layer, a is the regularization coefficient, i is the ith node, j is the jth feature volumeA kernel, k is the k layer of the convolutional neural network;

the parameter update formula is as follows:

where L is the objective loss function, μ is the learning rate,

indicating that x is derived from y and the parameters are updated using a small batch approach.

Introducing sparsity at the RELU excitation function input, i.e. at the output of the linear filter, for sparsity S (h)_k) Layers greater than 0.6 are processed. By introducing sparsity at the RELU input, unnecessary increase of RELU output can be prevented, unnecessary negative input of RELU can be reduced, and the model generalization capability can be improved.

At the RELU excitation function input, i.e. at the output of the linear filter, sparsity is introduced, specifically:

input h of k-th-layer RELU excitation function in convolutional neural network_kIs expressed as:

wherein h is_kIs the input of the k-th layer RELU excitation function, S (h)_k) Is the spirit of convolutionInput h via RELU excitation function at k-th layer in network_kThe sparsity of (a);

E＝L+λ∑_kS(h_k) (8)

where E is the optimized objective function, L is the objective function without introducing sparsity, i.e. the objective loss function, and λ is the tuning parameter controlling sparsity.

And S03, recognizing the human body posture by adopting the neural network model in the S02, and performing model training and performance testing on the public human body posture data set KTH. When a new unknown video is input, preprocessing is firstly carried out through S01, and then the posture is judged through network prediction.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of another like element in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages and disadvantages of the embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

Those of skill would further appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that such implementation decisions may be made by those skilled in the art using various means for implementing the functions described herein without departing from the scope of the invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. The industry has described only the principles of the invention and is therefore intended to encompass within its spirit and scope all such changes and modifications as fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A human body posture identification method based on a convolutional neural network is characterized by comprising the following steps:

s01, acquiring a human body posture video data set, preprocessing the human body posture video data set by cutting a video into image frames, and dividing the image data set cut into the image frames into a training set and a verification set;

s02, constructing a neural network model, introducing sparsity at the input of the RELU excitation function, inputting the convolution neural network into the image preprocessed in the step S01, and outputting the image as a human body posture category; training the convolutional neural network;

s03, recognizing the human body posture by adopting the neural network model in the S02, and performing model training and performance testing on a public human body posture data set KTH; when an unknown video is input, firstly calling the step S01 to carry out preprocessing, and then carrying out posture recognition by using the neural network model in the step S02 to obtain the human body posture category;

in said step S02, sparsity is introduced at the RELU excitation function input, i.e. at the output of the linear filter, specifically:

input h of k-th-layer RELU excitation function in the convolutional neural network_kIs expressed as:

wherein h is_kIs the input of the k-th layer RELU excitation function, S (h)_k) As input h to the k-th layer RELU excitation function in convolutional neural networks_kSparsity of (d);

E＝L+λ∑_kS(h_k)

2. The method for recognizing the human body posture based on the convolutional neural network as claimed in claim 1, characterized in that: in step S01, the acquiring of the human body posture video data set specifically includes the following steps:

s11, acquiring a public KTH human posture video data set;

s12, cutting the video into frames and storing the images of each frame;

s13, screening out images of human body behaviors with complete human bodies and corresponding postures of labels from the images, deleting blank images or images of human body behaviors without complete human bodies or corresponding postures of labels, and carrying out classification marking on the screened images; the corresponding postures of the tags comprise boxing, hand waving, clapping, jogging, running and jogging; the classification marks are specifically marked according to 6 postures of boxing, waving hands, clapping palms, jogging, running and jogging;

3. The method for recognizing the human body posture based on the convolutional neural network as claimed in claim 1, wherein: in the step S02, the convolutional neural network includes 4 convolutional layers, 4 pooling layers, 2 fully-connected layers, and 1 classification layer; the sizes of convolution kernels in the first two convolution layers are set to be 5 multiplied by 5, and the sizes of the last two convolution kernels are set to be 3 multiplied by 3; the size of all the pooling layers is set to be 2 multiplied by 2, and the pooling layers adopt maximum pooling; the number of convolution kernels in the first convolution layer is 32, the number of convolution kernels in the second convolution layer is 64, and the number of convolution kernels in the third convolution layer and the fourth convolution layer is 128; the number of neurons in the first fully-connected layer is set to 1024, and the number of neurons in the second fully-connected layer is set to 512; in the convolutional layer and the fully-connected layer, the RELU function is used as an activation function.

4. The method for recognizing the human body posture based on the convolutional neural network as claimed in claim 3, characterized in that: suppose the nth node O of the mth convolutional layer_m,nHas an input value of { x_n-1,1；x_n-1,2；…；x_n-1,kAnd then the values of the output units are as follows:

wherein h is_w，b(x) Is the excitation value, f (x) is the excitation function, x_iIs the ith input value of the node, w_iIs the weight of the ith input value of the node, k is the number of the input values of the current node, and b is a bias term;

the modified linear unit RELU is defined as follows:

f(x)＝max(0，x)

5. The human body posture recognition method based on the convolutional neural network as claimed in claim 4, characterized in that: through training of the convolutional neural network, parameters of the convolutional neural network are obtained, and the parameters are specifically as follows:

if there are p sample sets, they are denoted as { (x)₁，y₁)，(x₂，y₂)，...，(x_p，y_p) For each sample, each sample set loss function is defined as follows:

wherein h is_ω，b(x) Is a predicted value after network training, namely the excitation value, y is an actual output value, and w is the weight of an input value;

the parameter update formula is as follows:

where L is the objective loss function, μ is the learning rate,

6. The human body posture recognition method based on the convolutional neural network as claimed in claim 5, characterized in that: the learning rate is set to 0.001 and optimized using the adampotime algorithm.