CN115620016A

CN115620016A - Skeleton detection model construction method and image data identification method

Info

Publication number: CN115620016A
Application number: CN202211592632.2A
Authority: CN
Inventors: 项乐宏; 王翀; 夏银水; 李裕麒; 郑瑜杰
Original assignee: Loctek Ergonomic Technology Co Ltd
Current assignee: Loctek Ergonomic Technology Co Ltd
Priority date: 2022-12-13
Filing date: 2022-12-13
Publication date: 2023-01-17
Anticipated expiration: 2042-12-13
Also published as: CN115620016B

Abstract

The invention provides a skeleton detection model construction method and an image data identification method. The construction method comprises the following steps: acquiring a training RGB image and a training depth image according to the training image; inputting the training RGB image and the training depth image into a training network, and respectively acquiring a first thermodynamic diagram and a second thermodynamic diagram; converting the label into a first correct thermodynamic diagram, and calculating a first loss of the first thermodynamic diagram and the first correct thermodynamic diagram and a second loss of the second thermodynamic diagram and the first correct thermodynamic diagram; respectively determining a first skeleton key point and a second skeleton key point through a heat map regression technology according to the first thermodynamic map and the second thermodynamic map; calculating a third loss of the first skeleton key point and the second skeleton key point by adopting a mean square error; and optimizing parameters of the training network according to the superposition of the first loss, the second loss and the third loss. The invention solves the problems that: the prior art cannot effectively improve the robustness of a skeleton detection model through model training.

Description

Skeleton detection model construction method and image data identification method

Technical Field

The invention relates to the technical field of image data processing, in particular to a skeleton detection model construction method and an image data identification method.

Background

The human body posture recognition is a process of detecting the positions of key points of a human body in an image or a video and constructing a human body skeleton diagram. The human body posture information can be used for further performing tasks such as action recognition, man-machine information interaction, abnormal behavior detection and the like. However, human limbs are flexible, the posture characteristics are visually changed greatly, and the human limbs are easily affected by the change of the visual angle and the clothes.

In the prior art, the HRNet framework model is often used for detecting the framework key points in the recognition of the human body posture, and the traditional HRNet only uses the RGB image to train the model, so that the accuracy and robustness of the HRNet framework model finally trained are insufficient, and further the human body posture detection precision is insufficient.

It can be seen that the problems in the related art are: the prior art cannot effectively improve the robustness of a skeleton detection model through model training.

Disclosure of Invention

The invention solves the problems that: the prior art cannot effectively improve the robustness of a skeleton detection model through model training.

In order to solve the above problems, a first object of the present invention is to provide a method for constructing a skeleton detection model based on multi-view knowledge distillation,

the second purpose of the invention is to provide an image data recognition method of human body posture.

In order to achieve the first object of the present invention, an embodiment of the present invention provides a method for constructing a skeleton detection model based on multi-view knowledge distillation, the method comprising:

s100: acquiring a training image with a label, and labeling the training image, namely establishing a corresponding relation between the human skeleton key point coordinates of the training image and the training image;

s200: acquiring a training RGB image and a training depth image according to the training image;

s300: inputting the training RGB image and the training depth image into a training network, and respectively acquiring a first thermodynamic diagram and a second thermodynamic diagram;

s400: converting the label into a first correct thermodynamic diagram, and calculating a first loss of the first thermodynamic diagram and the first correct thermodynamic diagram and a second loss of the second thermodynamic diagram and the first correct thermodynamic diagram;

s500: respectively determining a first skeleton key point and a second skeleton key point through a heat map regression technology according to the first thermodynamic diagram and the second thermodynamic diagram;

s600: calculating a third loss of the first skeleton key point and the second skeleton key point by adopting a mean square error;

s700: optimizing parameters of a training network according to superposition of the first loss, the second loss and the third loss;

s800: and acquiring a plurality of training images with labels, circulating the steps from S100 to S700, iterating until loss convergence, finishing training, and fixing parameters of a training network to construct a skeleton detection model.

Compared with the prior art, the technical scheme has the following technical effects: the HRNet subjected to multi-view knowledge distillation has better robustness for different views of the same scene, the robustness of the skeleton detection model can be effectively improved by using the construction method disclosed by the invention, and the constructed skeleton detection model can effectively improve the precision of human skeleton detection.

In one embodiment of the invention, the function that calculates the first loss and the second loss is an OHKM loss function.

Compared with the prior art, the technical scheme has the following technical effects: the method of the embodiment adopts an OKM loss function, so that the obtained first loss and second loss are more accurate.

In an embodiment of the present invention, after S400, the method further includes:

s450: calculating a fourth loss by using the first thermodynamic diagram and the second thermodynamic diagram and adopting a mean square error;

s700 includes:

and optimizing parameters of the training network according to the superposition of the first loss, the second loss, the third loss and the fourth loss.

Compared with the prior art, the technical scheme has the following technical effects: and the calculation of the fourth loss is added, so that the finally trained parameters of the training network are more accurate, and the framework detection model has stronger functionality and robustness.

In one embodiment of the present invention, S300 includes:

s310: acquiring the number n of target channels of a training network;

s320: and copying and converting the training RGB images and the training depth images into images with the number of channels being n of the target channel number, inputting the images into a training network, and respectively obtaining a first thermodynamic diagram and a second thermodynamic diagram.

Compared with the prior art, the technical scheme has the following technical effects: the scheme of the embodiment can help to input the training RGB image and the training depth image into the same training network simultaneously, so that the generation of a subsequent thermodynamic diagram is more stable, and the stability and reliability of the whole construction method are effectively improved.

In one embodiment of the present invention, S700 includes:

and optimizing parameters of the training network by using a gradient descent method according to the superposition of the first loss, the second loss and the third loss.

Compared with the prior art, the technical scheme has the following technical effects: through the scheme of the embodiment, parameters of the training network can be accurately optimized according to loss, and the constructed skeleton detection model is more accurate.

To achieve the second object of the present invention, an embodiment of the present invention provides an image data recognition method for human body posture, where the image data recognition method uses a skeleton detection model constructed by the construction method according to any embodiment of the present invention, and the image data recognition method includes: acquiring an RGB image of a user; inputting the RGB image into a skeleton detection model, and acquiring a first human skeleton key point coordinate; and the first human skeleton key point coordinate is a 2D skeleton key point coordinate.

In an embodiment of the present invention, an image data identification method uses a skeleton detection model constructed by the construction method according to any embodiment of the present invention, and the image data identification method includes: acquiring a depth image of a user; inputting the depth image into a skeleton detection model, and acquiring coordinates of key points of a second human skeleton; and the second human body skeleton key point coordinate is a 3D skeleton key point coordinate.

In an embodiment of the present invention, an image data identification method uses a skeleton detection model constructed by the construction method according to any embodiment of the present invention, and the image data identification method includes: acquiring an RGB image and a depth image of a user; inputting the RGB image and the depth image into a skeleton detection model, and acquiring a third human skeleton key point coordinate; and the third human body skeleton key point coordinate is a 3D skeleton key point coordinate.

Compared with the prior art, the technical effect achieved by adopting the technical scheme is as follows: the RGB image or the depth image is independently input, or the RGB image and the depth image are simultaneously input, the skeleton detection model can adaptively and accurately output the coordinates of the key points of the human skeleton, and further the image data identification method of the embodiment can adapt to more situations.

Drawings

FIG. 1 is a flow chart of steps of a method for building a multi-view knowledge distillation-based skeleton detection model according to some embodiments of the invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

[ first embodiment ] A method for manufacturing a semiconductor device

Referring to fig. 1, the embodiment provides a method for constructing a skeleton detection model based on multi-view knowledge distillation, and the method comprises the following steps:

s700: optimizing parameters of a training network according to the superposition of the first loss, the second loss and the third loss;

s800: and acquiring a plurality of training images with labels, circulating the steps from S100 to S700, iterating until loss is converged, finishing training, and fixing parameters of a training network to construct a skeleton detection model.

In this embodiment, a skeleton detection model constructed by the construction method of the present invention can be applied to an ergonomic smart device, so that when the ergonomic smart device cannot detect a complete user posture image, a use posture of a user when the user uses the ergonomic smart device is identified and acquired by the skeleton detection model.

It should be noted that the ergonomic smart device includes, but is not limited to, a lifting table, a lifting platform, etc., a user often needs to put both hands on the ergonomic smart device for working or learning, and the ergonomic smart device can be adjusted in height by a motor.

In the prior art, an HRNet is adopted as a recognition model, the HRNet is provided for a 2D human body posture estimation task, and the network mainly aims at posture estimation of a single individual, namely only one human body target is in an image input into the network. HRNet connects sub-networks from high to low resolution in parallel, using repeated multi-scale fusion, to enhance the high resolution representation with low resolution representations of the same depth and similar levels. The final output of the model includes a plurality of skeletal keypoints of the human body.

Traditional HRNet trains the model using only RGB images. There is a classical assumption in the field of self-supervised learning that a strong representation is one that models view invariant factors. In the scheme of the invention, the RGB image and the depth image of the human body are collected and can be regarded as different views of the human body image, and the output prediction results of the two views by the same network are kept consistent, namely mutual information between different views of the same scene is maximized. Different views of the same scene provide more information for training of the model.

Further, in S100, acquiring a training image with a label, and labeling the training image means that a corresponding relationship is established between the human skeleton key point coordinates of the training image and the training image. It should be noted that, in the construction method of this embodiment, the label may be input by the worker according to the RGB image, and the label includes a plurality of key point coordinates of the human skeleton; after the label is determined, the label can be converted into a correct thermodynamic diagram of correct key point coordinates; the training images include at least RGB images and depth images.

Further, in S200, a training RGB image and a training depth image are acquired according to the training image. The training images are obtained from a database, the database comprises a plurality of training RGB images and training depth images, and the training RGB images and the training depth images are used for training the skeleton detection model. The training RGB image is a color image, and the training depth image is also called a range image, which is an image in which the distance (depth) from the image capture device to each point in the scene is used as a pixel value, and directly reflects the geometric shape of the visible surface of the scene.

Further, in S300, the training RGB image and the training depth image are input into a training network, and a first thermodynamic diagram and a second thermodynamic diagram are obtained respectively. Inputting the training RGB image and the training depth image into the same training network, namely the HRNet training network based on multi-view distillation, and respectively acquiring a first thermodynamic diagram and a second thermodynamic diagram.

Further, in S400, the label is converted into a first correct thermodynamic diagram, a first loss of the first thermodynamic diagram and the first correct thermodynamic diagram is calculated, and a second loss of the second thermodynamic diagram and the first correct thermodynamic diagram is calculated. It should be noted that, converting the label into the correct thermodynamic diagram is prior art and is not described herein again.

Further, in S500, according to the first thermodynamic diagram and the second thermodynamic diagram, a first skeleton key point and a second skeleton key point are respectively determined through a heatmap regression technique. It should be noted that the heat map regression technology is prior art and is not described herein again.

Further, in S600, a third loss is calculated by using a mean square error for the first skeleton key point and the second skeleton key point. The output 2 groups of skeleton key points should be kept consistent, so the mean square error loss is adopted for constraint. 2 groups of skeleton key points can be obtained according to two groups of thermodynamic diagrams, and 2 groups of key points are expected to be more similar, because the training RGB images and the training depth images represent the same scene and the human postures are the same, the same result can be obtained no matter which picture input network, and therefore the mean square error loss is adopted for constraint.

Further, in S700, parameters of the training network are optimized according to the superposition of the first loss, the second loss, and the third loss.

Further, in S800, a plurality of training images with labels are acquired, the steps from S100 to S700 are cycled, the iteration is performed until the loss converges, the training is completed, and the parameters of the training network are fixed, thereby constructing the skeleton detection model. It should be noted that, each time the steps from S100 to S700 are performed, the parameters of the skeleton detection model are further optimized, and when the steps from S100 to S700 are repeated for multiple times until the loss converges, the training is completed.

HRNet has a role of extracting features of an image, which is also referred to as a representation. As long as the extracted features are good enough, a more accurate skeleton can be obtained after performing thermodynamic regression. Classical assumptions in the field of self-supervised learning are: a strong representation is one that models view invariant factors. In the scheme of the invention, the HRNet is adopted to extract the features of the RGB image and the depth image, when the features of the RGB image and the depth image are consistent, the mutual information of the RGB image and the depth image is the maximum, and the extracted features are robust features.

The method has the advantages that the HRNet subjected to multi-view knowledge distillation has better robustness for different views of the same scene, the robustness of the skeleton detection model can be effectively improved by using the construction method, and the precision of human skeleton detection can be effectively improved by using the constructed skeleton detection model.

Further, the function of calculating the first loss and the second loss is an OHKM loss function.

It should be noted that, the OHKM loss function is the prior art, and the embodiment applies the OHKM loss function to the method for constructing the skeleton detection model, which can help the first training network to efficiently complete its training task.

As can be appreciated, the method of the present embodiment employs an OHKM loss function, which makes the obtained first loss and second loss more accurate.

Further, after S400, the method further includes:

s450: calculating a fourth loss by using the first thermodynamic diagram and the second thermodynamic diagram by using a mean square error;

s700 includes:

Further, in S450, a fourth loss is calculated using the first thermodynamic diagram and the second thermodynamic diagram with a mean square error. The output 2 sets of thermodynamic diagrams should also remain consistent, and therefore are constrained with a loss of mean square error. We want the 2 sets of thermodynamic diagrams to be more similar because the RGB image and the depth image both represent the same scene and the pose of the person is the same, so that the mean square error penalty is used for the constraint, regardless of which picture input network should get the same result.

Understandably, the addition of the calculation of the fourth loss can enable the parameters finally trained by the training network to be more accurate, and further enable the framework detection model to be stronger in functionality and robustness.

Further, S300 includes:

s310: acquiring the number n of target channels of a training network;

In this embodiment, because the number of channels of the training RGB images and the training depth images is not the same as the number of channels of the training network, when the training RGB images and the training depth images need to be input to the same training network at the same time, the training RGB images and the training depth images need to be copied and converted into images with the number of channels being the target channel number n, and then the images can be input to the training network.

Illustratively, n takes the value 3. The training RGB image and the training depth image are respectively input into the same network (the input of the network is 3 channels, the number of channels of the depth image is 1, the depth image is copied for 3 times and converted into 3-channel images), 2 groups of thermodynamic diagrams are respectively obtained, and then the first thermodynamic diagram and the second thermodynamic diagram are determined.

It can be understood that the scheme of the embodiment can help to input the training RGB image and the training depth image into the same training network at the same time, so that the generation of the subsequent thermodynamic diagram is more stable, and the stability and reliability of the whole construction method are effectively increased.

Further, S700 includes:

In the present embodiment, the gradient descent method is a prior art, and will not be described in detail herein.

It can be understood that, through the scheme of the embodiment, parameters of the training network can be accurately optimized according to the loss, so that the constructed skeleton detection model is more accurate.

Further, the training RGB images and the training depth images are subjected to weight sharing in the training process of the training network. The weight sharing means that the two networks are the same network, the structure is the same, the parameters are the same, namely, the training RGB images and the training depth images are input into the training network for training.

It can be understood that the RGB image and the depth image, which can be regarded as different views of the human body image, should be consistent with each other for the output prediction results of the two views in the same network, that is, mutual information between different views of the same scene is maximized, so that the RGB image and the depth image share weights in the training process of the training network, and the accuracy and reliability of the training results can be ensured.

[ second embodiment ] A

The embodiment provides an image data identification method of a human body posture, the image data identification method uses a skeleton detection model constructed by the construction method of any embodiment of the invention, and the image data identification method comprises the following steps: acquiring an RGB image of a user; inputting the RGB image into a skeleton detection model, and acquiring a first human skeleton key point coordinate; and the first human skeleton key point coordinate is a 2D skeleton key point coordinate.

Further, the image data identification method uses the skeleton detection model constructed by the construction method according to any embodiment of the invention, and the image data identification method includes: acquiring a depth image of a user; inputting the depth image into a skeleton detection model, and acquiring coordinates of key points of a second human skeleton; and the second human body skeleton key point coordinate is a 3D skeleton key point coordinate.

Further, the image data identification method uses the skeleton detection model constructed by the construction method according to any embodiment of the invention, and the image data identification method includes: acquiring an RGB image and a depth image of a user; inputting the RGB image and the depth image into a skeleton detection model, and acquiring a third human skeleton key point coordinate; and the third human body skeleton key point coordinate is a 3D skeleton key point coordinate.

In the present embodiment, an RGB image and a depth image of the upper body of the user are acquired. In this embodiment, the ergonomic smart device includes an image real-time capturing device, that is, 1 color camera and 1 depth camera are disposed right in front of the user, and are respectively used for capturing an RGB image and a depth image of the upper body of the user in real time. Depth images, also known as range images, refer to images having the distance (depth) from an image capture to each point in a scene as a pixel value, which directly reflects the geometry of the visible surface of the scene.

It should be noted that, an RGB image is input, and the skeleton detection model can obtain coordinates of key points of the first human skeleton, and since the RGB image is a 2D color image, the coordinates of the key points of the first human skeleton are 2D coordinates of the key points of the skeleton; inputting a depth image, wherein the skeleton detection model can acquire coordinates of key points of a second human skeleton, and the coordinates of the key points of the second human skeleton are 3D (three-dimensional) skeleton coordinates due to the fact that the depth image is a 3D image; and simultaneously inputting the RGB image and the depth image, wherein the skeleton detection model can obtain a third human skeleton key point coordinate, and the third human skeleton key point coordinate is a 3D skeleton key point coordinate.

It can be understood that, when the RGB image or the depth image is input separately or the RGB image and the depth image are input simultaneously, the skeleton detection model can adaptively and accurately output the coordinates of the key points of the human skeleton, so that the image data identification method of the embodiment can be adapted to more situations.

Further, after the RGB image is input into the skeleton detection model, convolution down-sampling and convolution up-sampling operations are carried out for multiple times, feature maps with multiple dimensions are obtained, feature fusion is carried out on the feature maps, 1x1 convolution is carried out, a human body key point heat map is obtained, and first human skeleton key point coordinates are obtained through a heat map regression technology according to the human body key point heat map.

Illustratively, for each first human skeleton keypoint coordinate picture, the output dimension is 1 × 17 × 3,1 represents the number of people, 17 represents 17 keypoints on each person, and 3 represents the coordinates and confidence of each keypoint.

In this embodiment, the high resolution feature maps need to be downsampled by one or several consecutive 3 × 3 convolutions of step size 2, and then the feature maps of different resolutions are fused using element-by-element addition. Similarly, the low-resolution feature map is subjected to resolution enhancement in an up-sampling manner, then 1 × 1 convolution is used to make the number of channels consistent with the high-resolution feature map, and then feature fusion operation is performed. In the upsampling operation, the width and height of the nearest neighbor interpolation alignment feature map are used first, and then the number of channels of the feature map is aligned by 1 × 1 convolution. In the 2-fold down-sampling operation, 3 × 3 convolution with a step size of 2 is used, and in order to complete 4-fold down-sampling, 3 × 3 convolution with 2 step sizes is used.

It can be understood that, by the method of the embodiment, the acquired feature maps of multiple dimensions can be more accurate, and then the first human skeleton key point coordinates of the user can be more accurately acquired.

Further, performing a plurality of convolution downsampling and convolution upsampling operations, comprising: the downsampling is performed a plurality of times using at least one successive 3x3 convolution of step size 2 and the upsampling is performed a plurality of times using at least one 1x1 convolution. The method can more accurately acquire the characteristic diagrams of multiple dimensions.

It should be noted that the image data recognition method for human body gestures according to the present embodiment can be applied to an ergonomic smart device. In daily use, the optimal height of the ergonomic smart device is: the height of the desktop is flush with the elbows. At this time, whether the user types with a keyboard or writes over a desk, the shoulder shrugging situation can be prevented, and the spine of the user is protected. When the user uses the ergonomic intelligent device, the user needs to place both hands flat on the desktop, and the method further adjusts the height of the ergonomic intelligent device according to the 3D human skeleton obtained by real-time calculation, so that the height of the ergonomic intelligent device is maintained at the optimal height.

It can be understood that, according to the method of the embodiment, the posture information of the user, namely the coordinate information of the key point of the human skeleton, is recognized and acquired according to the RGB image and/or the depth image which are acquired in real time, so that the height of the ergonomic intelligent device can be adjusted according to the posture information of the user, the ergonomic intelligent device is adjusted to a proper height, and the user does not need to put thoughts on the height of an adjustment desktop during working, so that the ergonomic intelligent device can work more attentively and efficiently, and the comfort of user experience is effectively improved.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A construction method of a skeleton detection model based on multi-view knowledge distillation is characterized by comprising the following steps:

s100: acquiring a training image with a label, and marking the label on the training image, namely establishing a corresponding relation between the human skeleton key point coordinates of the training image and the training image;

s400: converting the label into a first correct thermodynamic diagram, calculating a first loss of the first thermodynamic diagram and the first correct thermodynamic diagram, and a second loss of the second thermodynamic diagram and the first correct thermodynamic diagram;

s500: determining a first skeleton key point and a second skeleton key point respectively through a heat map regression technology according to the first heat map and the second heat map;

s600: calculating a third loss by using a mean square error of the first skeleton key point and the second skeleton key point;

s700: optimizing parameters of the training network according to a superposition of the first loss, the second loss, and the third loss;

s800: and acquiring a plurality of training images with labels, circulating the steps from S100 to S700, iterating until loss convergence, finishing training, and fixing parameters of the training network so as to construct a skeleton detection model.

2. The building method according to claim 1, wherein the function of calculating the first loss and the second loss is an OHKM loss function.

3. The construction method according to claim 1,

after the S400, further comprising:

s450: calculating a fourth loss by using the first thermodynamic diagram and the second thermodynamic diagram by adopting a mean square error;

the S700 includes:

and superposing and optimizing parameters of the training network according to the first loss, the second loss, the third loss and the fourth loss.

4. The building method according to claim 1, wherein the S300 includes:

s310: acquiring the number n of target channels of the training network;

s320: and copying and converting the training RGB image and the training depth image into images with the number of channels being the number n of the target channels, inputting the images into the training network, and respectively acquiring the first thermodynamic diagram and the second thermodynamic diagram.

5. The building method according to claim 1, wherein the S700 includes:

6. An image data recognition method for a human body posture, characterized in that the image data recognition method uses a skeleton detection model constructed by the construction method according to any one of claims 1 to 5, and the image data recognition method comprises:

acquiring an RGB image of a user;

inputting the RGB image into the skeleton detection model to obtain a first human skeleton key point coordinate;

and the first human skeleton key point coordinate is a 2D skeleton key point coordinate.

7. An image data recognition method for a human body posture, characterized in that the image data recognition method uses a skeleton detection model constructed by the construction method according to any one of claims 1 to 5, and the image data recognition method comprises:

acquiring a depth image of a user;

inputting the depth image into the skeleton detection model to obtain a second human body skeleton key point coordinate;

and the second human body skeleton key point coordinate is a 3D skeleton key point coordinate.

8. An image data recognition method for a human body posture, characterized in that the image data recognition method uses a skeleton detection model constructed by the construction method according to any one of claims 1 to 5, and the image data recognition method comprises:

acquiring an RGB image and a depth image of a user;

inputting the RGB image and the depth image into the skeleton detection model to obtain a third human skeleton key point coordinate;

and the third human body skeleton key point coordinate is a 3D skeleton key point coordinate.