CN115620016A - Skeleton detection model construction method and image data identification method - Google Patents

Skeleton detection model construction method and image data identification method Download PDF

Info

Publication number
CN115620016A
CN115620016A CN202211592632.2A CN202211592632A CN115620016A CN 115620016 A CN115620016 A CN 115620016A CN 202211592632 A CN202211592632 A CN 202211592632A CN 115620016 A CN115620016 A CN 115620016A
Authority
CN
China
Prior art keywords
training
loss
image
skeleton
key point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211592632.2A
Other languages
Chinese (zh)
Other versions
CN115620016B (en
Inventor
项乐宏
王翀
夏银水
李裕麒
郑瑜杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Loctek Ergonomic Technology Co Ltd
Original Assignee
Loctek Ergonomic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Loctek Ergonomic Technology Co Ltd filed Critical Loctek Ergonomic Technology Co Ltd
Priority to CN202211592632.2A priority Critical patent/CN115620016B/en
Publication of CN115620016A publication Critical patent/CN115620016A/en
Application granted granted Critical
Publication of CN115620016B publication Critical patent/CN115620016B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/34Smoothing or thinning of the pattern; Morphological operations; Skeletonisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a skeleton detection model construction method and an image data identification method. The construction method comprises the following steps: acquiring a training RGB image and a training depth image according to the training image; inputting the training RGB image and the training depth image into a training network, and respectively acquiring a first thermodynamic diagram and a second thermodynamic diagram; converting the label into a first correct thermodynamic diagram, and calculating a first loss of the first thermodynamic diagram and the first correct thermodynamic diagram and a second loss of the second thermodynamic diagram and the first correct thermodynamic diagram; respectively determining a first skeleton key point and a second skeleton key point through a heat map regression technology according to the first thermodynamic map and the second thermodynamic map; calculating a third loss of the first skeleton key point and the second skeleton key point by adopting a mean square error; and optimizing parameters of the training network according to the superposition of the first loss, the second loss and the third loss. The invention solves the problems that: the prior art cannot effectively improve the robustness of a skeleton detection model through model training.

Description

Skeleton detection model construction method and image data identification method
Technical Field
The invention relates to the technical field of image data processing, in particular to a skeleton detection model construction method and an image data identification method.
Background
The human body posture recognition is a process of detecting the positions of key points of a human body in an image or a video and constructing a human body skeleton diagram. The human body posture information can be used for further performing tasks such as action recognition, man-machine information interaction, abnormal behavior detection and the like. However, human limbs are flexible, the posture characteristics are visually changed greatly, and the human limbs are easily affected by the change of the visual angle and the clothes.
In the prior art, the HRNet framework model is often used for detecting the framework key points in the recognition of the human body posture, and the traditional HRNet only uses the RGB image to train the model, so that the accuracy and robustness of the HRNet framework model finally trained are insufficient, and further the human body posture detection precision is insufficient.
It can be seen that the problems in the related art are: the prior art cannot effectively improve the robustness of a skeleton detection model through model training.
Disclosure of Invention
The invention solves the problems that: the prior art cannot effectively improve the robustness of a skeleton detection model through model training.
In order to solve the above problems, a first object of the present invention is to provide a method for constructing a skeleton detection model based on multi-view knowledge distillation,
the second purpose of the invention is to provide an image data recognition method of human body posture.
In order to achieve the first object of the present invention, an embodiment of the present invention provides a method for constructing a skeleton detection model based on multi-view knowledge distillation, the method comprising:
s100: acquiring a training image with a label, and labeling the training image, namely establishing a corresponding relation between the human skeleton key point coordinates of the training image and the training image;
s200: acquiring a training RGB image and a training depth image according to the training image;
s300: inputting the training RGB image and the training depth image into a training network, and respectively acquiring a first thermodynamic diagram and a second thermodynamic diagram;
s400: converting the label into a first correct thermodynamic diagram, and calculating a first loss of the first thermodynamic diagram and the first correct thermodynamic diagram and a second loss of the second thermodynamic diagram and the first correct thermodynamic diagram;
s500: respectively determining a first skeleton key point and a second skeleton key point through a heat map regression technology according to the first thermodynamic diagram and the second thermodynamic diagram;
s600: calculating a third loss of the first skeleton key point and the second skeleton key point by adopting a mean square error;
s700: optimizing parameters of a training network according to superposition of the first loss, the second loss and the third loss;
s800: and acquiring a plurality of training images with labels, circulating the steps from S100 to S700, iterating until loss convergence, finishing training, and fixing parameters of a training network to construct a skeleton detection model.
Compared with the prior art, the technical scheme has the following technical effects: the HRNet subjected to multi-view knowledge distillation has better robustness for different views of the same scene, the robustness of the skeleton detection model can be effectively improved by using the construction method disclosed by the invention, and the constructed skeleton detection model can effectively improve the precision of human skeleton detection.
In one embodiment of the invention, the function that calculates the first loss and the second loss is an OHKM loss function.
Compared with the prior art, the technical scheme has the following technical effects: the method of the embodiment adopts an OKM loss function, so that the obtained first loss and second loss are more accurate.
In an embodiment of the present invention, after S400, the method further includes:
s450: calculating a fourth loss by using the first thermodynamic diagram and the second thermodynamic diagram and adopting a mean square error;
s700 includes:
and optimizing parameters of the training network according to the superposition of the first loss, the second loss, the third loss and the fourth loss.
Compared with the prior art, the technical scheme has the following technical effects: and the calculation of the fourth loss is added, so that the finally trained parameters of the training network are more accurate, and the framework detection model has stronger functionality and robustness.
In one embodiment of the present invention, S300 includes:
s310: acquiring the number n of target channels of a training network;
s320: and copying and converting the training RGB images and the training depth images into images with the number of channels being n of the target channel number, inputting the images into a training network, and respectively obtaining a first thermodynamic diagram and a second thermodynamic diagram.
Compared with the prior art, the technical scheme has the following technical effects: the scheme of the embodiment can help to input the training RGB image and the training depth image into the same training network simultaneously, so that the generation of a subsequent thermodynamic diagram is more stable, and the stability and reliability of the whole construction method are effectively improved.
In one embodiment of the present invention, S700 includes:
and optimizing parameters of the training network by using a gradient descent method according to the superposition of the first loss, the second loss and the third loss.
Compared with the prior art, the technical scheme has the following technical effects: through the scheme of the embodiment, parameters of the training network can be accurately optimized according to loss, and the constructed skeleton detection model is more accurate.
To achieve the second object of the present invention, an embodiment of the present invention provides an image data recognition method for human body posture, where the image data recognition method uses a skeleton detection model constructed by the construction method according to any embodiment of the present invention, and the image data recognition method includes: acquiring an RGB image of a user; inputting the RGB image into a skeleton detection model, and acquiring a first human skeleton key point coordinate; and the first human skeleton key point coordinate is a 2D skeleton key point coordinate.
In an embodiment of the present invention, an image data identification method uses a skeleton detection model constructed by the construction method according to any embodiment of the present invention, and the image data identification method includes: acquiring a depth image of a user; inputting the depth image into a skeleton detection model, and acquiring coordinates of key points of a second human skeleton; and the second human body skeleton key point coordinate is a 3D skeleton key point coordinate.
In an embodiment of the present invention, an image data identification method uses a skeleton detection model constructed by the construction method according to any embodiment of the present invention, and the image data identification method includes: acquiring an RGB image and a depth image of a user; inputting the RGB image and the depth image into a skeleton detection model, and acquiring a third human skeleton key point coordinate; and the third human body skeleton key point coordinate is a 3D skeleton key point coordinate.
Compared with the prior art, the technical effect achieved by adopting the technical scheme is as follows: the RGB image or the depth image is independently input, or the RGB image and the depth image are simultaneously input, the skeleton detection model can adaptively and accurately output the coordinates of the key points of the human skeleton, and further the image data identification method of the embodiment can adapt to more situations.
Drawings
FIG. 1 is a flow chart of steps of a method for building a multi-view knowledge distillation-based skeleton detection model according to some embodiments of the invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
[ first embodiment ] A method for manufacturing a semiconductor device
Referring to fig. 1, the embodiment provides a method for constructing a skeleton detection model based on multi-view knowledge distillation, and the method comprises the following steps:
s100: acquiring a training image with a label, and labeling the training image, namely establishing a corresponding relation between the human skeleton key point coordinates of the training image and the training image;
s200: acquiring a training RGB image and a training depth image according to the training image;
s300: inputting the training RGB image and the training depth image into a training network, and respectively acquiring a first thermodynamic diagram and a second thermodynamic diagram;
s400: converting the label into a first correct thermodynamic diagram, and calculating a first loss of the first thermodynamic diagram and the first correct thermodynamic diagram and a second loss of the second thermodynamic diagram and the first correct thermodynamic diagram;
s500: respectively determining a first skeleton key point and a second skeleton key point through a heat map regression technology according to the first thermodynamic diagram and the second thermodynamic diagram;
s600: calculating a third loss of the first skeleton key point and the second skeleton key point by adopting a mean square error;
s700: optimizing parameters of a training network according to the superposition of the first loss, the second loss and the third loss;
s800: and acquiring a plurality of training images with labels, circulating the steps from S100 to S700, iterating until loss is converged, finishing training, and fixing parameters of a training network to construct a skeleton detection model.
In this embodiment, a skeleton detection model constructed by the construction method of the present invention can be applied to an ergonomic smart device, so that when the ergonomic smart device cannot detect a complete user posture image, a use posture of a user when the user uses the ergonomic smart device is identified and acquired by the skeleton detection model.
It should be noted that the ergonomic smart device includes, but is not limited to, a lifting table, a lifting platform, etc., a user often needs to put both hands on the ergonomic smart device for working or learning, and the ergonomic smart device can be adjusted in height by a motor.
In the prior art, an HRNet is adopted as a recognition model, the HRNet is provided for a 2D human body posture estimation task, and the network mainly aims at posture estimation of a single individual, namely only one human body target is in an image input into the network. HRNet connects sub-networks from high to low resolution in parallel, using repeated multi-scale fusion, to enhance the high resolution representation with low resolution representations of the same depth and similar levels. The final output of the model includes a plurality of skeletal keypoints of the human body.
Traditional HRNet trains the model using only RGB images. There is a classical assumption in the field of self-supervised learning that a strong representation is one that models view invariant factors. In the scheme of the invention, the RGB image and the depth image of the human body are collected and can be regarded as different views of the human body image, and the output prediction results of the two views by the same network are kept consistent, namely mutual information between different views of the same scene is maximized. Different views of the same scene provide more information for training of the model.
Further, in S100, acquiring a training image with a label, and labeling the training image means that a corresponding relationship is established between the human skeleton key point coordinates of the training image and the training image. It should be noted that, in the construction method of this embodiment, the label may be input by the worker according to the RGB image, and the label includes a plurality of key point coordinates of the human skeleton; after the label is determined, the label can be converted into a correct thermodynamic diagram of correct key point coordinates; the training images include at least RGB images and depth images.
Further, in S200, a training RGB image and a training depth image are acquired according to the training image. The training images are obtained from a database, the database comprises a plurality of training RGB images and training depth images, and the training RGB images and the training depth images are used for training the skeleton detection model. The training RGB image is a color image, and the training depth image is also called a range image, which is an image in which the distance (depth) from the image capture device to each point in the scene is used as a pixel value, and directly reflects the geometric shape of the visible surface of the scene.
Further, in S300, the training RGB image and the training depth image are input into a training network, and a first thermodynamic diagram and a second thermodynamic diagram are obtained respectively. Inputting the training RGB image and the training depth image into the same training network, namely the HRNet training network based on multi-view distillation, and respectively acquiring a first thermodynamic diagram and a second thermodynamic diagram.
Further, in S400, the label is converted into a first correct thermodynamic diagram, a first loss of the first thermodynamic diagram and the first correct thermodynamic diagram is calculated, and a second loss of the second thermodynamic diagram and the first correct thermodynamic diagram is calculated. It should be noted that, converting the label into the correct thermodynamic diagram is prior art and is not described herein again.
Further, in S500, according to the first thermodynamic diagram and the second thermodynamic diagram, a first skeleton key point and a second skeleton key point are respectively determined through a heatmap regression technique. It should be noted that the heat map regression technology is prior art and is not described herein again.
Further, in S600, a third loss is calculated by using a mean square error for the first skeleton key point and the second skeleton key point. The output 2 groups of skeleton key points should be kept consistent, so the mean square error loss is adopted for constraint. 2 groups of skeleton key points can be obtained according to two groups of thermodynamic diagrams, and 2 groups of key points are expected to be more similar, because the training RGB images and the training depth images represent the same scene and the human postures are the same, the same result can be obtained no matter which picture input network, and therefore the mean square error loss is adopted for constraint.
Further, in S700, parameters of the training network are optimized according to the superposition of the first loss, the second loss, and the third loss.
Further, in S800, a plurality of training images with labels are acquired, the steps from S100 to S700 are cycled, the iteration is performed until the loss converges, the training is completed, and the parameters of the training network are fixed, thereby constructing the skeleton detection model. It should be noted that, each time the steps from S100 to S700 are performed, the parameters of the skeleton detection model are further optimized, and when the steps from S100 to S700 are repeated for multiple times until the loss converges, the training is completed.
HRNet has a role of extracting features of an image, which is also referred to as a representation. As long as the extracted features are good enough, a more accurate skeleton can be obtained after performing thermodynamic regression. Classical assumptions in the field of self-supervised learning are: a strong representation is one that models view invariant factors. In the scheme of the invention, the HRNet is adopted to extract the features of the RGB image and the depth image, when the features of the RGB image and the depth image are consistent, the mutual information of the RGB image and the depth image is the maximum, and the extracted features are robust features.
The method has the advantages that the HRNet subjected to multi-view knowledge distillation has better robustness for different views of the same scene, the robustness of the skeleton detection model can be effectively improved by using the construction method, and the precision of human skeleton detection can be effectively improved by using the constructed skeleton detection model.
Further, the function of calculating the first loss and the second loss is an OHKM loss function.
It should be noted that, the OHKM loss function is the prior art, and the embodiment applies the OHKM loss function to the method for constructing the skeleton detection model, which can help the first training network to efficiently complete its training task.
As can be appreciated, the method of the present embodiment employs an OHKM loss function, which makes the obtained first loss and second loss more accurate.
Further, after S400, the method further includes:
s450: calculating a fourth loss by using the first thermodynamic diagram and the second thermodynamic diagram by using a mean square error;
s700 includes:
and optimizing parameters of the training network according to the superposition of the first loss, the second loss, the third loss and the fourth loss.
Further, in S450, a fourth loss is calculated using the first thermodynamic diagram and the second thermodynamic diagram with a mean square error. The output 2 sets of thermodynamic diagrams should also remain consistent, and therefore are constrained with a loss of mean square error. We want the 2 sets of thermodynamic diagrams to be more similar because the RGB image and the depth image both represent the same scene and the pose of the person is the same, so that the mean square error penalty is used for the constraint, regardless of which picture input network should get the same result.
Understandably, the addition of the calculation of the fourth loss can enable the parameters finally trained by the training network to be more accurate, and further enable the framework detection model to be stronger in functionality and robustness.
Further, S300 includes:
s310: acquiring the number n of target channels of a training network;
s320: and copying and converting the training RGB images and the training depth images into images with the number of channels being n of the target channel number, inputting the images into a training network, and respectively obtaining a first thermodynamic diagram and a second thermodynamic diagram.
In this embodiment, because the number of channels of the training RGB images and the training depth images is not the same as the number of channels of the training network, when the training RGB images and the training depth images need to be input to the same training network at the same time, the training RGB images and the training depth images need to be copied and converted into images with the number of channels being the target channel number n, and then the images can be input to the training network.
Illustratively, n takes the value 3. The training RGB image and the training depth image are respectively input into the same network (the input of the network is 3 channels, the number of channels of the depth image is 1, the depth image is copied for 3 times and converted into 3-channel images), 2 groups of thermodynamic diagrams are respectively obtained, and then the first thermodynamic diagram and the second thermodynamic diagram are determined.
It can be understood that the scheme of the embodiment can help to input the training RGB image and the training depth image into the same training network at the same time, so that the generation of the subsequent thermodynamic diagram is more stable, and the stability and reliability of the whole construction method are effectively increased.
Further, S700 includes:
and optimizing parameters of the training network by using a gradient descent method according to the superposition of the first loss, the second loss and the third loss.
In the present embodiment, the gradient descent method is a prior art, and will not be described in detail herein.
It can be understood that, through the scheme of the embodiment, parameters of the training network can be accurately optimized according to the loss, so that the constructed skeleton detection model is more accurate.
Further, the training RGB images and the training depth images are subjected to weight sharing in the training process of the training network. The weight sharing means that the two networks are the same network, the structure is the same, the parameters are the same, namely, the training RGB images and the training depth images are input into the training network for training.
It can be understood that the RGB image and the depth image, which can be regarded as different views of the human body image, should be consistent with each other for the output prediction results of the two views in the same network, that is, mutual information between different views of the same scene is maximized, so that the RGB image and the depth image share weights in the training process of the training network, and the accuracy and reliability of the training results can be ensured.
[ second embodiment ] A
The embodiment provides an image data identification method of a human body posture, the image data identification method uses a skeleton detection model constructed by the construction method of any embodiment of the invention, and the image data identification method comprises the following steps: acquiring an RGB image of a user; inputting the RGB image into a skeleton detection model, and acquiring a first human skeleton key point coordinate; and the first human skeleton key point coordinate is a 2D skeleton key point coordinate.
Further, the image data identification method uses the skeleton detection model constructed by the construction method according to any embodiment of the invention, and the image data identification method includes: acquiring a depth image of a user; inputting the depth image into a skeleton detection model, and acquiring coordinates of key points of a second human skeleton; and the second human body skeleton key point coordinate is a 3D skeleton key point coordinate.
Further, the image data identification method uses the skeleton detection model constructed by the construction method according to any embodiment of the invention, and the image data identification method includes: acquiring an RGB image and a depth image of a user; inputting the RGB image and the depth image into a skeleton detection model, and acquiring a third human skeleton key point coordinate; and the third human body skeleton key point coordinate is a 3D skeleton key point coordinate.
In the present embodiment, an RGB image and a depth image of the upper body of the user are acquired. In this embodiment, the ergonomic smart device includes an image real-time capturing device, that is, 1 color camera and 1 depth camera are disposed right in front of the user, and are respectively used for capturing an RGB image and a depth image of the upper body of the user in real time. Depth images, also known as range images, refer to images having the distance (depth) from an image capture to each point in a scene as a pixel value, which directly reflects the geometry of the visible surface of the scene.
It should be noted that, an RGB image is input, and the skeleton detection model can obtain coordinates of key points of the first human skeleton, and since the RGB image is a 2D color image, the coordinates of the key points of the first human skeleton are 2D coordinates of the key points of the skeleton; inputting a depth image, wherein the skeleton detection model can acquire coordinates of key points of a second human skeleton, and the coordinates of the key points of the second human skeleton are 3D (three-dimensional) skeleton coordinates due to the fact that the depth image is a 3D image; and simultaneously inputting the RGB image and the depth image, wherein the skeleton detection model can obtain a third human skeleton key point coordinate, and the third human skeleton key point coordinate is a 3D skeleton key point coordinate.
It can be understood that, when the RGB image or the depth image is input separately or the RGB image and the depth image are input simultaneously, the skeleton detection model can adaptively and accurately output the coordinates of the key points of the human skeleton, so that the image data identification method of the embodiment can be adapted to more situations.
Further, after the RGB image is input into the skeleton detection model, convolution down-sampling and convolution up-sampling operations are carried out for multiple times, feature maps with multiple dimensions are obtained, feature fusion is carried out on the feature maps, 1x1 convolution is carried out, a human body key point heat map is obtained, and first human skeleton key point coordinates are obtained through a heat map regression technology according to the human body key point heat map.
Illustratively, for each first human skeleton keypoint coordinate picture, the output dimension is 1 × 17 × 3,1 represents the number of people, 17 represents 17 keypoints on each person, and 3 represents the coordinates and confidence of each keypoint.
In this embodiment, the high resolution feature maps need to be downsampled by one or several consecutive 3 × 3 convolutions of step size 2, and then the feature maps of different resolutions are fused using element-by-element addition. Similarly, the low-resolution feature map is subjected to resolution enhancement in an up-sampling manner, then 1 × 1 convolution is used to make the number of channels consistent with the high-resolution feature map, and then feature fusion operation is performed. In the upsampling operation, the width and height of the nearest neighbor interpolation alignment feature map are used first, and then the number of channels of the feature map is aligned by 1 × 1 convolution. In the 2-fold down-sampling operation, 3 × 3 convolution with a step size of 2 is used, and in order to complete 4-fold down-sampling, 3 × 3 convolution with 2 step sizes is used.
It can be understood that, by the method of the embodiment, the acquired feature maps of multiple dimensions can be more accurate, and then the first human skeleton key point coordinates of the user can be more accurately acquired.
Further, performing a plurality of convolution downsampling and convolution upsampling operations, comprising: the downsampling is performed a plurality of times using at least one successive 3x3 convolution of step size 2 and the upsampling is performed a plurality of times using at least one 1x1 convolution. The method can more accurately acquire the characteristic diagrams of multiple dimensions.
It should be noted that the image data recognition method for human body gestures according to the present embodiment can be applied to an ergonomic smart device. In daily use, the optimal height of the ergonomic smart device is: the height of the desktop is flush with the elbows. At this time, whether the user types with a keyboard or writes over a desk, the shoulder shrugging situation can be prevented, and the spine of the user is protected. When the user uses the ergonomic intelligent device, the user needs to place both hands flat on the desktop, and the method further adjusts the height of the ergonomic intelligent device according to the 3D human skeleton obtained by real-time calculation, so that the height of the ergonomic intelligent device is maintained at the optimal height.
It can be understood that, according to the method of the embodiment, the posture information of the user, namely the coordinate information of the key point of the human skeleton, is recognized and acquired according to the RGB image and/or the depth image which are acquired in real time, so that the height of the ergonomic intelligent device can be adjusted according to the posture information of the user, the ergonomic intelligent device is adjusted to a proper height, and the user does not need to put thoughts on the height of an adjustment desktop during working, so that the ergonomic intelligent device can work more attentively and efficiently, and the comfort of user experience is effectively improved.
Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (8)

1. A construction method of a skeleton detection model based on multi-view knowledge distillation is characterized by comprising the following steps:
s100: acquiring a training image with a label, and marking the label on the training image, namely establishing a corresponding relation between the human skeleton key point coordinates of the training image and the training image;
s200: acquiring a training RGB image and a training depth image according to the training image;
s300: inputting the training RGB image and the training depth image into a training network, and respectively acquiring a first thermodynamic diagram and a second thermodynamic diagram;
s400: converting the label into a first correct thermodynamic diagram, calculating a first loss of the first thermodynamic diagram and the first correct thermodynamic diagram, and a second loss of the second thermodynamic diagram and the first correct thermodynamic diagram;
s500: determining a first skeleton key point and a second skeleton key point respectively through a heat map regression technology according to the first heat map and the second heat map;
s600: calculating a third loss by using a mean square error of the first skeleton key point and the second skeleton key point;
s700: optimizing parameters of the training network according to a superposition of the first loss, the second loss, and the third loss;
s800: and acquiring a plurality of training images with labels, circulating the steps from S100 to S700, iterating until loss convergence, finishing training, and fixing parameters of the training network so as to construct a skeleton detection model.
2. The building method according to claim 1, wherein the function of calculating the first loss and the second loss is an OHKM loss function.
3. The construction method according to claim 1,
after the S400, further comprising:
s450: calculating a fourth loss by using the first thermodynamic diagram and the second thermodynamic diagram by adopting a mean square error;
the S700 includes:
and superposing and optimizing parameters of the training network according to the first loss, the second loss, the third loss and the fourth loss.
4. The building method according to claim 1, wherein the S300 includes:
s310: acquiring the number n of target channels of the training network;
s320: and copying and converting the training RGB image and the training depth image into images with the number of channels being the number n of the target channels, inputting the images into the training network, and respectively acquiring the first thermodynamic diagram and the second thermodynamic diagram.
5. The building method according to claim 1, wherein the S700 includes:
and optimizing parameters of the training network by using a gradient descent method according to the superposition of the first loss, the second loss and the third loss.
6. An image data recognition method for a human body posture, characterized in that the image data recognition method uses a skeleton detection model constructed by the construction method according to any one of claims 1 to 5, and the image data recognition method comprises:
acquiring an RGB image of a user;
inputting the RGB image into the skeleton detection model to obtain a first human skeleton key point coordinate;
and the first human skeleton key point coordinate is a 2D skeleton key point coordinate.
7. An image data recognition method for a human body posture, characterized in that the image data recognition method uses a skeleton detection model constructed by the construction method according to any one of claims 1 to 5, and the image data recognition method comprises:
acquiring a depth image of a user;
inputting the depth image into the skeleton detection model to obtain a second human body skeleton key point coordinate;
and the second human body skeleton key point coordinate is a 3D skeleton key point coordinate.
8. An image data recognition method for a human body posture, characterized in that the image data recognition method uses a skeleton detection model constructed by the construction method according to any one of claims 1 to 5, and the image data recognition method comprises:
acquiring an RGB image and a depth image of a user;
inputting the RGB image and the depth image into the skeleton detection model to obtain a third human skeleton key point coordinate;
and the third human body skeleton key point coordinate is a 3D skeleton key point coordinate.
CN202211592632.2A 2022-12-13 2022-12-13 Skeleton detection model construction method and image data identification method Active CN115620016B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211592632.2A CN115620016B (en) 2022-12-13 2022-12-13 Skeleton detection model construction method and image data identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211592632.2A CN115620016B (en) 2022-12-13 2022-12-13 Skeleton detection model construction method and image data identification method

Publications (2)

Publication Number Publication Date
CN115620016A true CN115620016A (en) 2023-01-17
CN115620016B CN115620016B (en) 2023-03-28

Family

ID=84879739

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211592632.2A Active CN115620016B (en) 2022-12-13 2022-12-13 Skeleton detection model construction method and image data identification method

Country Status (1)

Country Link
CN (1) CN115620016B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115984972A (en) * 2023-03-20 2023-04-18 乐歌人体工学科技股份有限公司 Human body posture identification method based on motion video drive

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647639A (en) * 2018-05-10 2018-10-12 电子科技大学 Real-time body's skeletal joint point detecting method
US20200272888A1 (en) * 2019-02-24 2020-08-27 Microsoft Technology Licensing, Llc Neural network for skeletons from input images
CN111652047A (en) * 2020-04-17 2020-09-11 福建天泉教育科技有限公司 Human body gesture recognition method based on color image and depth image and storage medium
CN114693779A (en) * 2022-04-02 2022-07-01 蔚来汽车科技(安徽)有限公司 Method and device for determining three-dimensional key points of hand

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647639A (en) * 2018-05-10 2018-10-12 电子科技大学 Real-time body's skeletal joint point detecting method
US20200272888A1 (en) * 2019-02-24 2020-08-27 Microsoft Technology Licensing, Llc Neural network for skeletons from input images
CN111652047A (en) * 2020-04-17 2020-09-11 福建天泉教育科技有限公司 Human body gesture recognition method based on color image and depth image and storage medium
CN114693779A (en) * 2022-04-02 2022-07-01 蔚来汽车科技(安徽)有限公司 Method and device for determining three-dimensional key points of hand

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ANGEL MART´INEZ-GONZ´ALEZ等: "《Real-time Convolutional Networks for Depth-based Human Pose Estimation》" *
林依林等: "《:基于级联特征和图卷积的三维手部姿态估计算法》" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115984972A (en) * 2023-03-20 2023-04-18 乐歌人体工学科技股份有限公司 Human body posture identification method based on motion video drive
CN115984972B (en) * 2023-03-20 2023-08-11 乐歌人体工学科技股份有限公司 Human body posture recognition method based on motion video driving

Also Published As

Publication number Publication date
CN115620016B (en) 2023-03-28

Similar Documents

Publication Publication Date Title
CN111339903B (en) Multi-person human body posture estimation method
CN108898063B (en) Human body posture recognition device and method based on full convolution neural network
Zeng et al. View-invariant gait recognition via deterministic learning
CN113283525B (en) Image matching method based on deep learning
CN107424161B (en) Coarse-to-fine indoor scene image layout estimation method
WO2019035155A1 (en) Image processing system, image processing method, and program
JP2019125057A (en) Image processing apparatus, method thereof and program
CN113516693B (en) Rapid and universal image registration method
CN110751097B (en) Semi-supervised three-dimensional point cloud gesture key point detection method
CN108154066B (en) Three-dimensional target identification method based on curvature characteristic recurrent neural network
CN110570474B (en) Pose estimation method and system of depth camera
CN105279522A (en) Scene object real-time registering method based on SIFT
CN115620016B (en) Skeleton detection model construction method and image data identification method
KR20160088814A (en) Conversion Method For A 2D Image to 3D Graphic Models
JP5027030B2 (en) Object detection method, object detection apparatus, and object detection program
CN112750198A (en) Dense correspondence prediction method based on non-rigid point cloud
CN114494594B (en) Deep learning-based astronaut operation equipment state identification method
Yin et al. Estimation of the fundamental matrix from uncalibrated stereo hand images for 3D hand gesture recognition
CN116385660A (en) Indoor single view scene semantic reconstruction method and system
Darujati et al. Facial motion capture with 3D active appearance models
KR20050063991A (en) Image matching method and apparatus using image pyramid
CN111531546B (en) Robot pose estimation method, device, equipment and storage medium
CN113112547A (en) Robot, repositioning method thereof, positioning device and storage medium
KR101673144B1 (en) Stereoscopic image registration method based on a partial linear method
JP6198104B2 (en) 3D object recognition apparatus and 3D object recognition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant