CN114519868A

CN114519868A - Real-time bone key point identification method and system based on coordinate system regression

Info

Publication number: CN114519868A
Application number: CN202210160965.1A
Authority: CN
Inventors: 顾友良; 张磊; 赵乾
Original assignee: Guangdong Xinwangpai Intelligent Information Technology Co ltd
Current assignee: Guangdong Xinwangpai Intelligent Information Technology Co ltd
Priority date: 2022-02-22
Filing date: 2022-02-22
Publication date: 2022-05-20

Abstract

The invention discloses a real-time bone key point identification system based on coordinate system regression, which comprises an image acquisition module, a core calculation unit, a lightweight neural network algorithm module and a coordinate system regression output module, wherein the lightweight neural network algorithm module adopts ShuffleNet V2 as a basic backbone network, two continuous upsampling is added to the last layer of the ShuffleNet V2 backbone network, and ShuffleV2Block3 and DUC2 in the network are subjected to skip connection (heatmap), and finally a heatmap is obtained; the coordinate system regression output module defines the heat map output by each channel obtained by the lightweight neural network algorithm module as Z, normalizes the value to be between 0 and 1 through a normalization function, and defines the normalized heat map as

Obtaining a discrete probability distribution value which is expressed as a matrix of m multiplied by n, wherein m and n correspond to the resolution of the heat map, and the coordinate information existing in Z is obtained through calculation of a defined formula.

Description

Real-time bone key point identification method and system based on coordinate system regression

Technical Field

The invention relates to the technical field of image recognition, in particular to a real-time bone key point recognition method and system based on coordinate system regression.

Background

The identification technology of the key points of the skeleton is one of the basic technologies of computer vision. The technology detects joints and five sense organs of a human body in image/video data through a sensor (a camera, infrared rays and other equipment), and describes human skeleton information through key points. The existing deep learning-based new algorithm for identifying the skeletal key points is mostly based on a Gaussian heat map output mode, and has the problems of large required output characteristic diagram and low algorithm training and reasoning speed. The real-time operation is difficult on a low-cost hardware platform, and the real-time operation can be achieved only by matching high-cost hardware (such as a GPU or a high-end camera). The output of the latest skeletal key point identification algorithm based on deep learning is basically a Gaussian heat map, and the value output by the heat map is an integer, is different from a coordinate regression output which is a floating point number, and cannot lose precision, so that the problem of a lower bound of theoretical error exists.

Based on the defects, the invention is mainly oriented to the identification of the bone key points of the mobile terminal/embedded equipment, adopts a lightweight deep learning algorithm and a coordinate system regression to avoid the lower bound problem of theoretical errors of heat map output, hardware only needs to adopt a CPU and a monocular camera to complete the low-cost real-time identification of the bone key points, and a GPU or a high-end camera (such as kinect) is not needed. The traditional skeleton key point algorithm is carried out on the basis of geometric prior based on the idea of template matching, and the accuracy is poor. Due to the limitation of hardware performance, the existing bone key point identification algorithm based on deep learning has a low identification speed on a low-cost hardware platform (such as a mobile terminal mobile phone and a tablet), and the linkage application of the algorithm can cause the situations of application blocking, frame loss and the like, so that the user experience is greatly influenced.

The method can realize the real-time identification of the key points of the skeleton on a low-cost hardware platform.

Disclosure of Invention

Aiming at the technical problems, particularly in the traditional bone key point identification, the invention can realize the real-time identification of the bone key points on a low-cost hardware platform.

The present invention is directed to solving at least the problems of the prior art. Therefore, the invention discloses a real-time bone key point identification system based on coordinate system regression, which comprises an image acquisition module, a core calculation unit, a lightweight neural network algorithm module and a coordinate system regression output module. The image acquisition module adopts any monocular camera, the core calculation unit adopts a mobile end CPU, the lightweight neural network algorithm module adopts ShuffleNet V2 as a basic backbone network, two times of continuous up-sampling are added to the last layer of the ShuffleNet V2 backbone network, and the ShuffleV2Block3 and the DUC2 in the network are subjected to skip connection (heatmap), and finally a heatmap is obtained;

the coordinate system regression output module defines the heat map output by each channel obtained by the lightweight neural network algorithm module as Z, normalizes the value to be between 0 and 1 through a normalized function, and defines the normalized heat map as Z

A discrete probability distribution value is obtained, represented as a matrix of m x n, where m and n correspond to the resolution of the heat map.

Furthermore, the lightweight neural network algorithm module using ShuffleNetV2 as a basic backbone network further includes: an input image firstly enters a ShuffleNet V2 backbone network for calculation, wherein the ShuffleNet V2 backbone network consists of two convolution layers, three ShuffleV2Block layers and a maximum pooling layer, wherein the convolution layer conv1 layer passes through 24 groups of convolution kernels (the step length is 2) of 3x3, and the convolution layer conv5 passes through 1024 groups of convolution kernels (the step length is 1) of 1x 1; the size of the pooling layer Maxpool1 is 3x3, and the step length is 2; the structure of the ShuffleV2Block layer is unified, the characteristic diagram of the input channel is divided into two branches, the left branch does not carry out any operation, the right branch consists of continuous 1x1 convolution kernels and 3x3 convolution connection, the two branches are merged by concat operation, and channel shuffle (channel shuffle) is carried out immediately.

Still further, the adding two consecutive upsamplings to the last layer of the shefflenetv 2 backbone network further comprises: outputting a series of convolution characteristic graphs to the backbone network and performing continuous DUC upsampling on the convolution characteristic graphs through a DUC, wherein the DUC layer structure is unified and is formed by connecting continuous 3x3 convolution and a PixelShuffle upsampling mode, obtaining a high-resolution characteristic graph from a low-resolution characteristic graph through convolution and multi-channel recombination, and performing jump connection on the characteristic graph ShuffleBlock3 corresponding to the same shape of the ShuffleNetV2 backbone network on the last upsampling layer DUC2 to improve the robustness during training, prevent overfitting and finally output a heat graph.

Further, the normalized function is defined as the following formula 1:

first, two matrices X and Y are defined, where i is 1 … m and j is 1 … n, and each entry includes one of the matrices

X-axis coordinates and y-axis coordinates of (a).

Furthermore, the matrix X and the matrix Y are defined as the following formula 2 and formula 3

Wherein, by pair

Make a probabilistic interpretation because

Is 0 to 1 and the sum is 1, then the condition of the probability distribution is satisfied, thus the pair

Performing matrix inner product calculation with X to obtain expected value on the matrix X and obtain transverse coordinate value on the matrix X, and obtaining longitudinal coordinate value on the matrix Y by obtaining expected value on the matrix Y and corresponding to the intersection of the transverse coordinate value and the longitudinal coordinate value

Thus obtaining information of the coordinate points, and defining a function P for obtaining the information of the coordinate points as the following formula 4, wherein<.,.>_FRepresenting the matrix inner product calculation:

the invention also discloses a real-time bone key point identification method based on coordinate system regression, which comprises the following steps:

step 1: an input image firstly enters a ShuffleNet V2 backbone network for calculation, wherein the ShuffleNet V2 backbone network consists of two convolution layers, three ShuffleV2Block layers and a maximum pooling layer, wherein the convolution layer conv1 layer passes through 24 groups of convolution kernels (the step size is 2) of 3x3, and the convolution layer conv5 passes through 1024 groups of convolution kernels (the step size is 1) of 1x 1; the size of the pooling layer Maxpool1 is 3x3, and the step length is 2; the structure of a shuffle V2Block layer is unified, a feature map of an input channel is divided into two branches, the left branch is not operated, the right branch is formed by continuous convolution kernel of 1x1 and convolution connection of 3x3, the two branches are merged by concat operation and then channel shuffle (channel shuffle) is carried out, a series of convolution feature maps are output to the backbone network and are sampled by continuous DUCs, wherein the DUC layer structure is unified and is formed by continuous convolution of 3x3 and connection of PixelShuffle upsampling modes, a low-resolution feature map is subjected to convolution and recombination of multiple channels to obtain a high-resolution feature map, and a feature map 3 with the same shape as that of a ShuffNetleV 2 backbone network is subjected to jump connection on a last upsampling layer DUC2 to improve the robustness during the training, prevent over heat map and finally output the heat map;

And 2, step: for the heat map obtained by the lightweight neural network algorithm module for each channel output, defined as Z, normalizing the values to between 0 and 1 by normalized function, and defining the normalized heat map as

Obtaining a discrete probability distribution value which is expressed as a matrix of m multiplied by n, wherein m and n correspond to the resolution of the heat map, and calculating the coordinate information existing in Z through a defined formula.

The invention further discloses a device comprising: at least one processor and memory; the memory stores computer-executable instructions; the at least one processor executes computer-executable instructions stored by the memory, causing the at least one processor to perform the identification method as described above.

The invention further discloses a computer readable storage medium, wherein computer execution instructions are stored in the computer readable storage medium, and when a processor executes the computer execution instructions, the identification method is realized.

Compared with the prior art, the invention has the beneficial effects that: the traditional skeleton key point algorithm is carried out on the basis of geometric prior based on the idea of template matching, and the accuracy is poor. Due to the limitation of hardware performance, the existing bone key point identification algorithm based on deep learning has a low identification speed on a low-cost hardware platform (such as a mobile terminal mobile phone and a tablet), and the linkage application of the algorithm can cause the situations of application blocking, frame loss and the like, so that the user experience is greatly influenced. The output of the latest skeletal key point identification algorithm based on deep learning is basically a Gaussian heat map, and the value output by the heat map is an integer, is different from a coordinate regression output which is a floating point number, and cannot lose precision, so that the problem of a lower bound of theoretical error exists. Based on the defects, the invention is mainly oriented to the identification of the bone key points of the mobile terminal/embedded equipment, adopts a lightweight deep learning algorithm and a coordinate system regression to avoid the lower bound problem of theoretical errors of heat map output, hardware only needs to adopt a CPU and a monocular camera to complete the low-cost real-time identification of the bone key points, and a GPU or a high-end camera (such as kinect) is not needed.

Drawings

The invention will be further understood from the following description in conjunction with the accompanying drawings. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the embodiments. In the drawings, like reference numerals designate corresponding parts throughout the different views.

FIG. 1 is a core block diagram of the present invention for a method for real-time bone key point identification based on coordinate system regression;

FIG. 2 is an overall diagram of a lightweight network according to an embodiment of the invention;

fig. 3 is a network structure diagram of a backbone network according to an embodiment of the present invention;

fig. 4 is a network structure diagram of a backbone network according to an embodiment of the present invention.

Detailed Description

Example one

The core module of the real-time bone key point identification method based on coordinate system regression is shown in fig. 1, and comprises an image acquisition module, a core calculation unit, a lightweight neural network algorithm module and a coordinate system regression output module. The image acquisition module adopts any monocular camera, and the core computing unit adopts a mobile end CPU. The core design of the invention is a lightweight neural network algorithm module and a coordinate system regression output module, and the two modules are adopted to ensure the real-time performance of the system on low-cost hardware.

Firstly, a lightweight neural network algorithm module:

the lightweight neural network algorithm module adopts ShuffleNet V2 as a basic backbone network, adds two times of continuous up-sampling on the last layer of the ShuffleNet V2 backbone network, performs skip connection (skip connection) on ShuffleV2Block3 and DUC2 in the network, and finally obtains a heatmap (heatmap). The overall result of the lightweight network is shown in fig. 2.

The input image firstly enters a ShuffleNet V2 backbone network for calculation, and the ShuffleNet V2 backbone network consists of two convolutional layers, three ShuffleV2Block layers and a maximum pooling layer. Wherein, the convolutional layer conv1 layer passes through 24 groups of convolution kernels of 3x3 (step size is 2), and the convolutional layer conv5 passes through 1024 groups of convolution kernels of 1x1 (step size is 1); the size of the pooling layer Maxpool1 is 3x3, and the step length is 2; the structure of the shuffle 2Block layers is uniform, as shown in fig. 3 and fig. 4, as shown in fig. 3, the feature map of the input channel is divided into two branches, the left branch does not perform any operation, the right branch is formed by connecting consecutive 1x1 convolution kernels and 3x3 convolution, the two branches are merged by concat operation, and channel shuffle (channel shuffle) is performed next. As shown in fig. 4, roughly consistent with the structure of fig. 3, the branch on the left consists of a succession of 3x3 convolution kernels and 1x1 convolution concatenations.

A series of convolution signatures are output to the backbone network through successive DUC upsampling. The structure of the DUC layer is unified, the DUC layer is formed by connecting continuous 3x3 convolution and a PixelShuffle up-sampling mode, and a feature map with low resolution is obtained by convolution and recombination among multiple channels. And a characteristic diagram ShuffleBlock3 corresponding to the same shape of the ShuffleNetV2 backbone network is subjected to jump connection on the last upper sampling layer DUC2 so as to improve the robustness during training, prevent overfitting and finally output a heatmap.

II, a coordinate system regression output module:

the heatmap output by each channel obtained by the lightweight neural network algorithm module is defined as Z. The values were normalized to between 0 and 1 by normalized (normalization function), and the normalized heat map was defined as

The normalization function is defined as the following equation 1

First, two matrices X and Y are defined, i being 1 … m and j being 1 … n, so that each entry thereof includes

X-axis coordinates and y-axis coordinates.

The matrix X and the matrix Y are defined as the following formula 2 and formula 3

By pairs

Make a probabilistic interpretation, because

Is 0 to 1 and the sum is 1, then the condition of the probability distribution is satisfied. Thus is to for

And the position information of the x-axis coordinate and the y-axis coordinate, thereby obtaining the information of the coordinate point. The function P for obtaining coordinate point information is defined as the following formula 4, wherein<.,.>_FRepresenting the matrix inner product calculation.

Let m-n-6, as an example:

by pairs

Performing matrix inner product calculation with X to obtain the expected value of-0.166 on the matrix X, and the same way

The matrix inner product calculation was performed with Y to obtain an expected value of-0.166 on the matrix Y. The intersection of two values corresponds to

X-axis coordinate and y-axis coordinate position information. If the normalized Gaussian heatmap has only one peak value, the transformation method can be used for directly obtaining the information of the coordinate point.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Although the invention has been described above with reference to various embodiments, it should be understood that many changes and modifications may be made without departing from the scope of the invention. It is therefore intended that the foregoing detailed description be regarded as illustrative rather than limiting, and that it be understood that it is the following claims, including all equivalents, that are intended to define the spirit and scope of this invention. The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure in any way whatsoever. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A real-time bone key point identification system based on coordinate system regression comprises an image acquisition module, a core calculation unit, a lightweight neural network algorithm module and a coordinate system regression output module, wherein the image acquisition module adopts any monocular camera, and the core calculation unit adopts a mobile end CPU (Central processing Unit), and is characterized in that the lightweight neural network algorithm module adopts ShuffleNet V2 as a basic backbone network, adds two continuous up-sampling to the last layer of the ShuffleNet V2 backbone network, and performs skip connection (skip connection) on ShuffleV2Block3 and DUC2 in the network, and finally obtains a heatmap (heatmap);

the coordinate system regression output module defines the heat map output by each channel obtained by the lightweight neural network algorithm module as Z, normalizes the value to be between 0 and 1 through a normalization function, and defines the normalized heat map as

Obtaining a discrete probability distribution value which is expressed as a matrix of m multiplied by n, wherein m and n correspond to the resolution of the heat map, and the coordinate information of the skeleton key point in Z is obtained through calculation of a defined formula.

2. The coordinate system regression-based real-time bone keypoint identification system of claim 1, wherein said lightweight neural network algorithm module employing ShuffleNetV2 as a basic backbone network further comprises: an input image firstly enters a ShuffleNet V2 backbone network for calculation, wherein the ShuffleNet V2 backbone network consists of two convolution layers, three ShuffleV2Block layers and a maximum pooling layer, wherein the convolution layer conv1 layer passes through 24 groups of convolution kernels (the step size is 2) of 3x3, and the convolution layer conv5 passes through 1024 groups of convolution kernels (the step size is 1) of 1x 1; the size of the pooling layer Maxpool1 is 3x3, and the step length is 2; the structure of the ShuffleV2Block layer is unified, the characteristic diagram of the input channel is divided into two branches, the left branch does not carry out any operation, the right branch consists of continuous 1x1 convolution kernels and 3x3 convolution connection, the two branches are merged by concat operation, and channel shuffle (channel shuffle) is carried out immediately.

3. The coordinate system regression-based real-time bone keypoint identification system of claim 2, wherein said adding two consecutive upsamplings to the last layer of the ShuffleNet V2 backbone network further comprises: outputting a series of convolution characteristic graphs to the backbone network and performing continuous DUC upsampling on the convolution characteristic graphs through a DUC, wherein the DUC layer structure is unified and is formed by connecting continuous 3x3 convolution and a PixelShuffle upsampling mode, obtaining a high-resolution characteristic graph from a low-resolution characteristic graph through convolution and multi-channel recombination, and performing jump connection on the characteristic graph ShuffleBlock3 corresponding to the same shape of the ShuffleNetV2 backbone network on the last upsampling layer DUC2 to improve the robustness during training, prevent overfitting and finally output a heat graph.

4. The coordinate system regression-based real-time bone keypoint identification system of claim 3, wherein said normalized function is defined as formula 1:

X-axis coordinates and y-axis coordinates.

5. The coordinate system regression-based real-time bone keypoint identification system of claim 4, wherein said matrix X and matrix Y are defined as follows equation 2 and equation 3:

Wherein, by pair

Make a probabilistic interpretation because

Is 0 to 1 and the sum is 1, then the condition of the probability distribution is satisfied, thus

Performing matrix inner product calculation with X to obtain expected value on the matrix X and obtain transverse coordinate value on the matrix X, and similarly obtaining expected value on the matrix Y and obtain longitudinal coordinate value on the matrix Y, transverse coordinate value and longitudinal coordinate valueIntersection correspondences of scalar values

6. a real-time bone key point identification method based on coordinate system regression is characterized by comprising the following steps:

And 2, step: for the heat map obtained by the lightweight neural network algorithm module and output by each channel, the heat map is defined as Z, and numerical values are classified by normalized functionNormalizing to between 0 and 1 and defining the normalized heatmap as

And obtaining a discrete probability distribution value which is expressed as a matrix of m multiplied by n, wherein m and n correspond to the resolution of the heat map, and the coordinate information of the bone key point in Z is obtained through the calculation of a defined formula.

7. An apparatus, comprising: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the method of claim 6.

8. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement the method of claim 6.