CN111291593A

CN111291593A - Method for detecting human body posture

Info

Publication number: CN111291593A
Application number: CN201811492525.6A
Authority: CN
Inventors: 黄超; 徐滢
Original assignee: Chengdu Pinguo Technology Co Ltd
Current assignee: Chengdu Pinguo Technology Co Ltd
Priority date: 2018-12-06
Filing date: 2018-12-06
Publication date: 2020-06-16
Anticipated expiration: 2038-12-06
Also published as: CN111291593B

Abstract

The invention discloses a method for detecting human body posture, which comprises the following steps: inputting the preprocessed human body image to be detected into a pre-trained neural network model, and acquiring a predetermined number of hot spot graphs, wherein each hot spot graph comprises a human body joint point; the neural network model comprises a first 14 layers of a MobileNetV2 network, a dimension transformation layer, a first up-sampling layer, a first convolution neural network layer, a BN regularization layer, a ReLU activation function layer, a second up-sampling layer and a second convolution neural network layer which are connected in sequence; the convolution operations in the neural network model all adopt separable convolution operations; acquiring a preset number of human body joint point coordinates from a preset number of heat point maps; and zooming the coordinates of each human body joint point to an image coordinate system of the human body image to be detected, and acquiring the human body posture joint points of the human body image to be detected. The technical scheme provided by the invention can be used for detecting the human body posture in real time on the terminal with smaller memory and limited operation capacity of the CPU and the GPU.

Description

Method for detecting human body posture

Technical Field

The invention relates to the technical field of deep learning, in particular to a method for detecting human body postures.

Background

The detection of the human body posture can be applied to various fields at present, for example, the detection of the human body posture can be applied to the field of security protection and can be used for identifying the behavior of the human body; the method is applied to the field of game entertainment, and can increase the interest of games. And the detection of the human posture is finally concluded as the detection of the human posture joint point.

At present, there are two main methods for detecting human body posture joint points: one is a direct regression joint point method, namely, a network model is used for directly obtaining human posture joint points; the other method is a regression hotspot graph prediction method, namely, a network model is used for obtaining a plurality of hotspot graphs, and then the hotspot graphs are processed to obtain the final coordinates of the human body joint points, wherein one hotspot graph corresponds to one joint point. The direct regression joint point method usually has a poor effect of directly regressing the coordinates of the joint points due to large changes of the posture, the wear and the portrait background of the human body, and is also very difficult to train a network model, so that a good usable model is difficult to obtain through convergence. Although the regression hot spot diagram prediction is better than the direct regression joint point in effect, due to the complex network structure and the huge network model, the training of the network model is also difficult, and the regression hot spot diagram prediction cannot be applied to terminals with small memories and limited CPU or GPU computing capacity, so that the application and popularization of human posture detection are greatly limited.

Disclosure of Invention

The invention aims to provide a method for detecting human body gestures, which can detect the human body gestures in real time on a terminal with small memory and limited CPU or GPU computing capacity.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a method of detecting human gestures, comprising: inputting the preprocessed human body image to be detected into a pre-trained neural network model, and acquiring a predetermined number of hot spot maps, wherein each hot spot map comprises a human body joint point; the neural network model comprises a first 14 layers of a MobileNetV2 network, a dimension transformation layer, a first up-sampling layer, a first convolution neural network layer, a BN regularization layer, a ReLU activation function layer, a second up-sampling layer and a second convolution neural network layer which are connected in sequence; the convolution operations in the neural network model all adopt separable convolution operations; acquiring a preset number of human body joint point coordinates from the preset number of heat point maps; and zooming the coordinates of each human body joint point to an image coordinate system of the human body image to be detected, and acquiring the human body posture joint points of the human body image to be detected.

Preferably, training the neural network model comprises: marking a human body frame and joint points on an original training image acquired in advance; clipping the original training image according to the human body frame to obtain a clipped image; scaling the cut image according to a preset proportion and filling the cut image to a preset size to obtain a training input image; converting the coordinates of the joint points marked in the original training image into the coordinates in the training input image, and generating a ground truth value by adopting a two-dimensional Gaussian distribution function; and training the neural network model by adopting the training input image and the ground truth value.

Preferably, the size of the training input image is 240 × 192; the size of the group truth value is 60 × 48.

Preferably, the loss function of the neural network model adopts a mean square loss function:

loss(x，y)＝(x-y)²

wherein x is the predicted value of the neural network model, and y is the ground truth value.

Further, still include: and in the process of training the neural network model, optimizing the neural network model by adopting an Adam optimization function.

Preferably, the first upsampling layer and the second upsampling layer both adopt 2 times upsampling; the first convolutional neural network layer and the second convolutional neural network layer are both 3 x 3 convolutional neural networks.

Preferably, the pre-trained neural network model is run on a mobile terminal; and the human body image to be detected is acquired by the mobile terminal.

The method for detecting the human body posture provided by the embodiment of the invention abandons the existing complex human body posture detection network model, self-defines a simple and efficient neural network model, and simultaneously, the convolution operation in the neural network model adopts separable convolution operation. The simplification of the network model structure and the use of separable convolution operation greatly reduce the calculation amount of the neural network model, greatly reduce the model per se and make the training process easier. Compared with the prior art, the technical scheme provided by the invention can smoothly run on the mobile terminal with smaller memory and limited CPU and GPU computing capability, and realizes real-time detection of the human body posture.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a neural network model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a first 14-tier network structure of MobileNet V2 in an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a bottleeck network of MobileNet V2 according to an embodiment of the present invention;

FIG. 5 is a visual representation of a hotspot graph in an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.

The invention needs to define a simple and efficient deep neural network model, which needs to be able to operate efficiently on the mobile terminal, so in the embodiment of the invention, in order to be able to efficiently perform forward reasoning, the input height and width of the image is 240 × 192, and the height and width of the output hotspot image is defined as 60 × 48.

At present, many experiments on deep neural network research show that the deeper the network depth, the more high-dimensional specific features can be obtained, the better the network performance is, but the deeper the network depth, the harder the training is, because there is a problem that the gradient vanishing training is not convergent, and the very deep network is not beneficial to operate at the mobile terminal, therefore, the invention can define a simple and efficient CNN (Convolutional Neural Networks) residual network for forward reasoning, because the CNN can extract different features at different network layers, but the sampling is more serious the higher the number of network layers is, and the dimensionality reduction effect can be achieved by adopting a residual network structure, namely, the low-dimensional and high-dimensional features are fused, so that the information contained in the input image under different scales can be repeatedly acquired by the structural design of the multilayer CNN, and a better feature extraction result is obtained.

Training the custom neural network model includes: marking a human body frame and joint points on an original training image acquired in advance; clipping the original training image according to the human body frame to obtain a clipped image; scaling the cut image according to a preset proportion and filling the cut image to a preset size to obtain a training input image; converting the coordinates of the joint points marked in the original training image into coordinates in the training input image, and generating a group truth value by adopting a two-dimensional Gaussian distribution function; and training the neural network model by adopting the training input image and the ground truth value. The trained neural network model can detect the human body posture joint points in the input image. In the above process, some common augmentation operations such as mirroring, rotation, scaling, color information interference of the image (such as enhancing or reducing contrast and saturation), etc. can be performed on the training data, and normalization and regularization operations are performed.

The loss function of the neural network model in the embodiment of the invention adopts a mean square loss function:

loss(x，y)＝(x-y)²

wherein x is the predicted value of the neural network model, and y is the ground truth value. I.e. pixel by pixel comparing the squared difference between x and y to define how large the difference between the predicted value and the ground truth value is, the smaller this value the better.

In the process of training the neural network model, an Adam optimization function is adopted to optimize the neural network model, wherein Adam is a first-order optimization algorithm capable of replacing the traditional random gradient descent process and can update the weight of the neural network model iteratively based on training data. The convolution operation in the neural network model adopts separable convolution operation, so that the calculated amount is reduced, and the size of the model is reduced.

If the model needs to be used on the mobile terminal, the model needs to be converted into an ONNX (Open neural network Exchange) format, and then the model in the ONNX format is converted into a network model format corresponding to a mobile terminal operation framework, such as the core ml of Apple, or Caffe2, or a model format supported by other third-party neural network feed-forward inference networks. That is, ONNX is an intermediate model format, and as long as the forward inference framework has tools to support ONNX conversion, the ONNX conversion can be converted into the model format required by the forward inference framework through the conversion tools. The method comprises the steps of obtaining camera data on a mobile terminal of the mobile phone by using a camera API (Application Programming Interface) provided by the mobile terminal, zooming video frame data of the camera to a specified size, and obtaining the condition that only one person exists in the default camera data, so that human body frame detection can be removed, and a large amount of time is saved for human body posture detection. The camera frame data is directly zoomed into the input size required by the network, namely the height and width are 240x192, the image content may be slightly stretched or compressed, but the influence on the neural network with strong robustness is not caused, and the zoomed camera frame data is input into the trained neural network model and converted to the mobile terminal to detect the human posture, so that the predicted hotspot graph is obtained. And processing the predicted hot spot diagram, namely traversing the whole hot spot diagram to obtain the maximum value of the hot spot diagram, namely obtaining the coordinate value of the joint point of the human body posture.

The specific structure of the neural network model defined by the present invention is described below:

as shown in fig. 2, the neural network model includes a first 14 layers of a MobileNetV2 network, a dimension transformation layer, a first upsampling layer, a first convolutional neural network layer, a BN regularization layer, a ReLU activation function layer, a second upsampling layer, and a second convolutional neural network layer, which are connected in sequence. Fig. 3 shows a schematic diagram of the first 14-layer network structure of MobileNetV2, where t represents a multiple of the channel capacity expansion dimension, c is the output channel, and n is the number of repetitions of this bottleeck structure. There are 5 groups of bottleeck structures, the size of the special pattern generated after each group of bottleeck becomes smaller, which embodies the idea that the network lower layer extracts the abstract feature and the network higher layer extracts the more specific feature, and s is the step length adopted by the Filter in the CNN.

Fig. 4 shows a schematic structural diagram of a bottleeck network of MobileNetV2, where a bottleeck is a bottleneck network structure, and its interior is first subjected to dimension-up, then CNN convolution operation, and finally dimension-down, so as to repeatedly extract feature data, and simultaneously determine whether to use shortcut connection according to whether the value of s is equal to that of an input channel and an output channel, where it should be noted that when n > 1, s of a first bottleeck layer of each group is a corresponding s value, s of other repetition layers are 1, and when s is 1, an input dimension is equal to an output dimension, the network has shortcut connection, that is, a residual network concept.

After the input data passes through the first 14 layers of the MobileNetV2 network and the dimension conversion layer, the input data enters an attitude joint point feature extraction and up-sampling network layer, the input feature of the network is the output of the upper layer, and the network firstly performs 2 times up-sampling on the height and width of the feature. For example, the input to the network at this time is (r)²C, H, W), up-sampling by a factor of 2, and outputting (C, rH, rW), where r is up-sampling by a factor of several times, such as up-sampling by a factor of 2, where r is 2, i.e. after the first up-sampling layer PixelShuffle, the number of channels is divided by r²And H and W are enlarged by r times.

After passing through the first up-sampling layer, the data is subjected to feature extraction again through a 3 × 3Conv first convolutional neural network layer, and meanwhile, after passing through a Batch normalization regularization, after passing through BN (Batch normalization), the network enables the data to have more expressive force through a ReLU activation function, and then is connected with sampling again (namely, a second up-sampling layer), the step has the effects of reducing the number of channels and improving and outputting the feature expression of the high-tolerance data more obviously, the second convolutional neural network layer 3 × 3Conv is the final prediction hot spot image output, the number of joint points required to be predicted is set as the data output channel, and the whole construction of the whole human posture joint point network is completed. The convolution operation of the whole network adopts separable convolution operation, namely, each channel is firstly subjected to respective convolution operation, and each channel has a filter, so that after a new channel characteristic diagram is obtained, the new channel characteristic diagram is subjected to standard 1 multiplied by 1 cross-channel convolution operation, and the two steps of operation can reduce the parameter quantity of the original traditional convolution operation to one ninth, thereby greatly reducing the size of the model, and simultaneously, the operation quantity is greatly reduced due to the reduction of the parameters.

To better illustrate the whole network flow, the data flow of the whole network is illustrated here as an example:

input image data input (3 × 240 × 192) which has been crop-scaled, filled and normalized with data augmentation, 3 represents that the image data is 3 channels, 240 represents image height, 192 represents image width, when the characteristic output after the front 14 layer network structure of MobileNetV2 is output (96, 15, 12), when we use a dimension transform layer to extend (96, 15, 12) to the output (512, 15, 12) dimension, which is followed by a 1x1 convolutional layer operation to extend the dimension to 512 and then Batch normalization and ReLU nor 6. The dimension is expanded to increase the expression capability of the feature data and correspond to the input height and width of the subsequent attitude joint feature extraction network. At this point 512 represents 512 channel dimensions, 15 represents feature height, and 12 represents feature width.

The output (512, 15, 12) is input into the pose-joint feature extraction network, and after upsampling by a first upsampling layer PixelShuffle, the output is output as output (128, 30, 24), and it can be seen that the dimension reduction of the channel is performed and the height and width are simultaneously enlarged, and the reason that the upsampling by using the PixelShuffle is the process of enlarging the image from low resolution to high resolution is that interpolation parameters are implicitly contained in the previous convolutional layer, and it can be automatically learned, and the PixelShuffle is only simple to shuffle the pixels, so the efficiency is very high.

The output (128, 30, 24) is input into the subsequent network, after passing through the first convolutional neural network layer 3 × 3Conv, the output dimension of this layer is set to 256, the BN regularization layer, the ReLU activation function layer, and the PixelShuffle upsampling is performed again, the network output is (64, 60, 48). After a 3 × 3 convolution with stride of 1 and padding of 1 again, the final feature hotspot graph output is obtained, that is, the result at this time is output (N, 60, 48), where N is the number of nodes, 60 is the hotspot graph output height defined before, and 48 is the hotspot graph output width defined before. If we define N as 17 at this time, 17 joint points are output, while the output of the second upsampling layer PixelShuffle in this document is 64-channel dimension, and the input of the last 3 × 3 convolutional network layer is 64 channels, the output channel dimension is N, which is defined as 17 joint points in this document, that is, 17 hot-spot map outputs of 60 × 48.

The training process of the neural network model comprises the following steps: and cutting out a corresponding single body frame by using the human body frame and joint point data which are marked in advance, scaling and filling the single body frame to the defined input size, and simultaneously performing data augmentation, normalization and regularization. And converting the marked coordinates of the human body joint points into a finally input coordinate system of an image with the size of 240x192, and generating a ground truth value by using a two-dimensional Gaussian distribution function. One joint point generates a hotspot graph, and if 17 joint points exist, 17 hotspot graphs are generated. The difference between the predicted result and the previous true value is evaluated using the mselos mean square loss function. And simultaneously, gradient updating is carried out by using an Adam optimization algorithm, the weight data of the whole network are updated, the learning rate is set to be 0.001, the training times are 100 epochs, Batch training can be carried out by using Batch during training, if one Batch size is 100, the shape of the corresponding input data is input (100, 3, 240 and 192). More than 80% accuracy can be obtained by using the method on a Coco data set, and meanwhile, the size of the model is only about 6M, which is enough for real-time posture detection of a human body on a mobile terminal. A visualization of a hotspot graph is shown in fig. 5, where the white point represents a corresponding joint point.

After the trained Neural Network model is obtained, the source model can be converted into a target model which can run on a mobile terminal through an Open Neural Network Exchange (ONNX) intermediate model, and the conversion process is the source model- > ONNX- > target model. Such as CoreML model converted onto i0S, or Caffe2 model, or other third party neural network operating framework model. It is noted here that a custom implementation layer needs to be added if the feed forward inference execution framework has unsupported operator. In this embodiment, video frame image data acquired from a camera is directly scaled to 240 × 192, and then input to a network model for feed-forward inference prediction, after 17 hotspot graphs with 60 × 48 are obtained, the 17 hotspot graphs are processed to obtain corresponding human body joint point coordinates, and then each human body joint point coordinate is converted into an image coordinate system where a human body image to be detected is located when the human body joint point is not scaled, so that a human body posture joint point of the human body image to be detected can be acquired.

The method for detecting the human body posture provided by the embodiment of the invention abandons the existing complex human body posture detection network model, self-defines a simple and efficient neural network model, and simultaneously, the convolution operation in the neural network model adopts separable convolution operation. The simplification of the network model structure and the use of separable convolution operation greatly reduce the calculation amount of the neural network model, greatly reduce the model per se, make the training process easier and save time and cost. Compared with the prior art, the technical scheme provided by the invention can smoothly run on the mobile terminal with smaller memory and limited CPU and GPU computing capability, realizes real-time detection of human body posture, and can be further applied to motion sensing games of the mobile terminal, body shaping and slimming, joint point mapping decoration of the human body, or other interesting applications.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.

Claims

1. A method of detecting a human pose, comprising:

inputting the preprocessed human body image to be detected into a pre-trained neural network model, and acquiring a predetermined number of hot spot maps, wherein each hot spot map comprises a human body joint point; the neural network model comprises a first 14 layers of a MobileNetV2 network, a dimension transformation layer, a first up-sampling layer, a first convolution neural network layer, a BN regularization layer, a ReLU activation function layer, a second up-sampling layer and a second convolution neural network layer which are connected in sequence; the convolution operations in the neural network model all adopt separable convolution operations;

acquiring a preset number of human body joint point coordinates from the preset number of heat point maps;

and zooming the coordinates of each human body joint point to an image coordinate system of the human body image to be detected, and acquiring the human body posture joint points of the human body image to be detected.

2. The method of detecting human body gestures of claim 1, wherein training the neural network model comprises:

marking a human body frame and joint points on an original training image acquired in advance;

clipping the original training image according to the human body frame to obtain a clipped image;

scaling the cut image according to a preset proportion and filling the cut image to a preset size to obtain a training input image;

converting the coordinates of the joint points marked in the original training image into the coordinates in the training input image, and generating a ground truth value by adopting a two-dimensional Gaussian distribution function;

and training the neural network model by adopting the training input image and the ground truth value.

3. The method of detecting human body posture of claim 2, wherein the size of the training input image is 240x 192; the size of the group truth value is 60 × 48.

4. The method for detecting human body posture as claimed in claim 2, wherein the loss function of the neural network model adopts a mean square loss function:

loss(x，y)＝(x-y)²

5. The method of detecting human body gestures of claim 2, further comprising: and in the process of training the neural network model, optimizing the neural network model by adopting an Adam optimization function.

6. The method for detecting the human body posture as claimed in claim 1, wherein the first upsampling layer and the second upsampling layer adopt 2 times upsampling; the first convolutional neural network layer and the second convolutional neural network layer are both 3 x 3 convolutional neural networks.

7. The method for detecting human body posture as claimed in claim 1, wherein the pre-trained neural network model is run on a mobile terminal; and the human body image to be detected is acquired by the mobile terminal.