CN114429474A

CN114429474A - Human body key point detection method and device and computer storage medium

Info

Publication number: CN114429474A
Application number: CN202011183648.9A
Authority: CN
Inventors: 徐兆坤; 卢运西; 冀怀远
Original assignee: Suning Cloud Computing Co Ltd
Current assignee: Suning Cloud Computing Co Ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2022-05-03
Also published as: CA3136990A1

Abstract

The application relates to a human body key point detection method, a human body key point detection device and a computer storage medium. The method comprises the following steps: detecting whether a human body exists in a current scene or not, and if the human body is not detected, acquiring a first depth image; if the human body is detected, acquiring a second depth image and storing a human body detection frame; comparing the frame difference of the collected first depth image and the second depth image to obtain a depth change area image; obtaining a human body mask, and obtaining a human body region image through the human body mask; obtaining a single human body region image, inputting the single human body region image into the key point detection model, and outputting a plurality of human body key points; and obtaining the confidence coefficient of each human body key point, and obtaining the final human body key point according to the confidence coefficient. The human body mask-based multi-resolution key point detection method greatly improves the accuracy of key point detection, solves background false detection, reduces false detection situations of key point positioning error people and the like, reduces false order and missed order situations of unmanned stores, reduces economic loss and improves user experience.

Description

Human body key point detection method and device and computer storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a method and an apparatus for detecting key points of a human body, and a computer storage medium.

Background

With the continuous development of the computer vision field, human body key point detection algorithms are more and more concerned by people. Human body key point detection is important for human body action recognition, human body posture description and even human body behavior prediction. Human body key point detection is a cornerstone of many computer vision tasks, such as unmanned store human-cargo interaction, abnormal human body motion detection of each monitoring scene, motion capture in movie production, and the like.

At present, a human body key point detection method based on a deep learning framework obtains remarkable results under the condition of a single person with a simple background, and mainly comprises a bottom-up method and a top-down method. However, no method has yet achieved an ideal effect even under a complicated background and in a case where a plurality of persons are present in a short distance. There are two main categories of problems:

1. in an actual scene with a complex background, the existing algorithm is easy to detect key points of a human body to the background by mistake.

2. Under the condition that a plurality of people are close to each other, human key points of a target human body are easily and mistakenly detected on limbs close to the target human body, and meanwhile key points of other human bodies are easily and mistakenly detected on the limbs of the target human body.

Disclosure of Invention

In order to solve the above problems, the present invention proposes that,

a human body key point detection method comprises the following steps:

detecting whether a human body exists in a current scene or not, if the human body is not detected, acquiring a first depth image in the current scene, and covering the first depth image of the previous scene, in which the human body is not detected; if the human body is detected, acquiring a second depth image in the current scene, and storing a human body detection frame corresponding to the second depth image;

comparing the frame difference of the collected first depth image and the second depth image to obtain a depth change area image;

obtaining a human body mask, and obtaining a human body region image through the human body mask;

obtaining a single human body region image by using a human body detection frame, inputting the single human body region image into a key point detection model, and outputting a plurality of human body key points;

and obtaining the confidence coefficient of each human body key point, and obtaining the final human body key point according to the confidence coefficient.

In one embodiment, frame difference comparing the first depth image and the second depth image comprises:

presetting a difference threshold, wherein the difference threshold is determined by a scene and a depth camera;

and subtracting the pixels of the first depth image and the second depth image one by one, recording a difference value if the difference value of the pixels is greater than a threshold value, and recording as zero if the difference value of the pixels is not greater than the threshold value.

In one embodiment, obtaining a human mask includes:

presetting a human body connected domain limiting threshold, and judging the size relationship between each non-zero value connected domain in the depth change region graph and the limiting threshold;

if the non-zero value connected domain is larger than the limited threshold value, determining that the non-zero connected domain is a human body connected domain, and calculating the mass center of the human body connected domain;

if the non-zero value connected domain is not larger than the limited threshold value, determining the non-zero connected domain as a non-human body connected domain and abandoning the non-zero connected domain;

performing region growing by taking each centroid as a reference point to obtain a first human body region mask;

carrying out human body example segmentation on the second depth image, and combining the human body mask of each example obtained by segmentation to obtain a second human body region mask;

and fusing the first human body area mask and the second human body area mask to obtain a human body mask.

In one embodiment, obtaining the human body region image through the human body mask includes:

intercepting a human body area in the second depth image by using a human body mask, and filtering a background part in the image to obtain a human body area image;

and preprocessing the human body region image.

In one embodiment, using the human body detection frame, obtaining a single human body region image comprises:

keeping the central point of the human body detection frame unchanged, and expanding the range of the human body detection frame in equal proportion;

and intercepting the preprocessed human body area image according to the expanded human body detection frame to obtain a single human body area image, and zooming the single human body area image to a preset size.

In one embodiment, the input to the keypoint detection model, the output of a plurality of human keypoints, comprises:

inputting a key point detection model by a single human body region image;

the key point detection model outputs a plurality of human body key point thermodynamic diagrams, and each human body key point thermodynamic diagram corresponds to one human body key point.

In one embodiment, obtaining the confidence of each human body key point, and obtaining the final human body key point according to the confidence comprises:

searching the peak position of each human body key point thermodynamic diagram, determining the peak position as the detection position of the human body key point corresponding to the human body key point thermodynamic diagram, and determining the peak value as the confidence coefficient of the human body key point;

presetting a confidence threshold, and judging the magnitude relation between the confidence of each human body key point and the confidence threshold;

if the confidence coefficient of the human body key point is larger than the confidence coefficient threshold value, outputting the coordinate and the confidence coefficient of the human body key point;

if the confidence coefficient of the human body key point is not greater than the confidence coefficient threshold value, discarding the human body key point;

and obtaining the final key points of the human body.

In one embodiment, the method comprises inputting the plurality of human key points into a key point detection model, and outputting a plurality of human key points, and further comprises:

inputting the single human body region image into the key point detection model;

the method comprises the steps of carrying out down-sampling on an input single human body area image to obtain a first characteristic diagram;

respectively carrying out linear interpolation and downsampling on the first feature map to obtain a second feature map and a third feature map with different resolutions, starting a corresponding first resolution branch, a corresponding second resolution branch and a corresponding third resolution branch, and respectively carrying out residual block processing;

respectively carrying out first multi-resolution cross fusion on the first resolution branch, the second resolution branch and the third resolution branch, wherein each branch needs to add the corresponding feature graph and the feature graphs corresponding to all the other branches, and each branch after complete fusion is respectively processed by a residual block again;

performing linear interpolation amplification on the feature maps corresponding to the first resolution branch and the second resolution branch, and performing second multi-resolution cross fusion on the feature maps of the third resolution branch;

and outputting a plurality of human body key point thermodynamic diagrams according to the feature diagram obtained after the second multi-resolution cross fusion, wherein each human body key point thermodynamic diagram corresponds to one human body key point.

A human body key point detection device, the device comprising:

the detection module is used for detecting whether a human body exists in the current scene;

the first acquisition module is used for acquiring a first depth image in the current scene and covering the first depth image of the previous scene, in which the human body is not detected, if the human body is not detected;

the second acquisition module is used for acquiring a second depth image in the current scene if the human body is detected, and storing the second depth image in a human body detection frame corresponding to the second depth image;

the contrast module is used for carrying out frame difference contrast on the acquired first depth image and the acquired second depth image to acquire a depth change area map;

the intercepting module is used for obtaining a human body mask and obtaining a human body area image through the human body mask;

the key point acquisition module is used for obtaining a single human body region image by using the human body detection frame, inputting the single human body region image into the key point detection model and outputting a plurality of human body key points;

and the judging module is used for obtaining the confidence coefficient of each human body key point and obtaining the final human body key point according to the confidence coefficient.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the steps of the method being performed when the computer program is executed by the processor.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.

The invention provides a human body key point detection method, a human body key point detection device and a computer storage medium, and provides a multi-resolution human body key point detection method based on a human body mask, which has the following effects:

1. the method integrates two human body mask extraction methods based on depth dynamic frame difference region growth and Yolact instance segmentation to obtain complete human body mask information, and the mask is used for filtering the complex background interference information in the corresponding image, so that the problem that the human body key points are mistakenly detected on the complex background is solved, and the accuracy of the algorithm is greatly improved.

2. The invention innovatively uses a multi-resolution multi-network branch parallel structure, so that the key point detection model can give consideration to local fine granularity and whole-person global information, and can focus on the relation among all key points in the whole-person global while detecting quasi-local key points, thereby solving the problem that the key points are wrongly detected to surrounding persons, and improving the accuracy of the key points.

3. The multi-resolution network branches in the human body key point detection network structure are only mutually crossed and fused twice and are respectively positioned in the middle and the end of the whole network, so that the branch networks are meant to carry out cross fusion after fully analyzing images, and the interference caused by other branches in frequent multi-scale fusion is avoided. The method reduces the calculation consumption required by the network model, accelerates the running speed, reduces the interference among all resolution branches and improves the algorithm accuracy.

4. According to the invention, the RGB color image and the depth image in the scene are obtained by using the RGBD depth camera, the depth information is introduced on the basis of the RGB color texture information, and the depth information is added into an additional image channel, so that the information range which can be sensed by a human key point detection algorithm is expanded, and the accuracy of the algorithm is increased.

Drawings

FIG. 1 is a schematic diagram illustrating the steps of a method for detecting key points in a human body according to an embodiment;

FIG. 2 is a schematic overall flowchart of a method for detecting key points in a human body according to an embodiment;

FIG. 3 is a schematic diagram of a detection model structure of a method for detecting key points in a human body according to an embodiment;

FIG. 4 is a block diagram of an embodiment of a human keypoint detection apparatus;

FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In an embodiment, as shown in fig. 1 to 2, a method for detecting key points of a human body includes the following steps:

s100, detecting whether a human body exists in the current scene or not, if not, acquiring a first depth image in the current scene, covering the first depth image of the previous scene, in which the human body is not detected, and if so, acquiring a second depth image in the current scene, and storing a human body detection frame corresponding to the second depth image.

In this embodiment, a human body in the current scene is detected by using a pre-human body detection algorithm, and whether a human body exists in the current scene is determined.

Then, the following two-direction determination process is performed based on the detection result.

If no human body is detected in the current scene, acquiring a first depth image in the current scene according to the situation that the current scene is free of human bodies, reserving files, covering the first depth image of the previous scene, in which no human body is detected,

and if the human body is detected in the current scene, acquiring a second depth image in the current scene according to the condition that the human body is normally present in the current scene, reserving a file, and storing the related information of the human body detection frame corresponding to the second depth image. Preferably, the second depth image is an RGBD image.

S200, comparing the frame difference of the collected first depth image and the second depth image to obtain a depth change area image.

In this embodiment, the first depth image without human body and the second depth image with human body acquired in step S100 are subjected to frame difference comparison to obtain a depth change area map.

Specifically, for frame difference processing, firstly, a difference threshold is preset according to the characteristics of a scene and a depth camera, then, pixels of the first depth image and pixels of the second depth image are subtracted one by one, if the difference of the pixels is greater than the threshold, the difference is recorded, and if the difference of the pixels is not greater than the threshold, the difference is recorded as zero.

And S300, obtaining a human body mask, and obtaining a human body region image through the human body mask.

In this embodiment, the human body mask is obtained by a fusion method, and a human body region image is obtained by the human body mask.

Specifically, obtaining the human body mask includes:

and (4) presetting a human body connected domain size limiting threshold, calculating the size of the non-zero value connected domain in the depth change region obtained in the step (S200), and judging the size relation between each non-zero value connected domain in the depth change region image and the limiting threshold.

If the non-zero value connected domain is larger than the limited threshold value, determining that the non-zero connected domain is a human body connected domain, and respectively calculating the mass center of each human body connected domain, wherein the calculation method is shown as the following formula:

wherein x_c，y_cRepresenting the coordinate of the mass center of the human body connected domain, I is the number of pixel points contained in the human body connected domain, x_i，y_iIs the coordinate of the ith pixel point in the connected domain.

And if the non-zero value connected domain is not larger than the limited threshold value, determining that the non-zero connected domain is a non-human body connected domain and abandoning the non-zero connected domain.

And taking the mass center of each human body connected domain as a reference point starting point, and performing region growth by using a region growth algorithm to obtain a first human body region mask.

Preferably, the adopted region growing algorithm is specifically that firstly, a growing threshold is set according to the precision and the characteristics of an actually used depth camera, whether the difference between the depth values of the upper pixel point, the lower pixel point, the left pixel point and the right pixel point which are nearest to the reference point is smaller than the growing threshold or not is judged, if the difference is smaller than the growing threshold, the pixel point is counted into the growing region where the reference point is located, if the difference is not smaller than the growing threshold, the pixel point is not located in the growing region where the reference point is located, then the pixel point which is newly added into the growing region is set as the reference point, and the steps are repeated until all the pixel points in the growing region do not have nearest pixel points which can grow.

And using a Yolact deep learning human body example segmentation network to segment the human body examples of the second depth image, extracting human body masks of all examples, and combining the obtained human body masks of all examples together to obtain a second human body area mask, wherein the second human body area mask comprises the human body masks of all human bodies in the second depth image.

And carrying out binarization processing on the mask of the first human body region, setting non-zero-value pixel points as 1, and setting zero-value pixel points as 0. And then fusing the binarized first human body region mask and the second human body region mask to obtain a complete human body mask.

Further, obtaining a human body region image through a human body mask, including: and multiplying the human body mask obtained by using the method with the second depth image, intercepting a human body region in the second depth image, filtering a background part in the image to obtain a human body region image, and preprocessing the human body region image.

Specifically, the preprocessing includes performing numerical truncation on the depth channel image in the human body region image, truncating the depth channel image to a suitable range according to the characteristics of the scene where the depth channel image is located, and then performing unified normalization processing on the human body region image. In one embodiment, in the unmanned store overhead view scenario, because the maximum photographable depth of the unmanned store ceiling overhead camera is 3500, the depth channel value is preferably truncated to the range of [0,3500], and values greater than 3500 are considered noise. In the embodiments of different scenes, the range of depth channel truncation can be adjusted according to the actual situation of the scene.

S400, obtaining a single human body region image by using the human body detection frame, inputting the single human body region image into a key point detection model, and outputting a plurality of human body key points;

in this embodiment, the human body region image obtained in step S300 is processed using the human body detection frame corresponding to the second depth image stored in step S100 to obtain a single human body region image, preferably, the single human body region image is an RGBD four-channel image, and the obtained single human body region image is input to the keypoint detection model to output a plurality of human body keypoints.

Specifically, using the stored human body detection frame corresponding to the current second depth image, firstly, keeping the central point of the human body detection frame unchanged, expanding the range of the human body detection frame to 1.25 times of the original detection range, then according to the expanded range of the human body detection frame, intercepting a single human body region image from the preprocessed human body region image, and scaling the single human body region image to a preset size, preferably to a size of 256 × 256 resolution.

And inputting the zoomed single human body region image into a pre-trained multi-resolution key point detection model, outputting a plurality of human body key point thermodynamic diagrams by the model, wherein each human body key point thermodynamic diagram corresponds to one human body key point. Preferably, the model outputs 10 human body key point thermodynamic diagrams, resulting in 10 human body key points. Wherein, the thermodynamic diagrams of 10 human body key points sequentially correspond to a left ear key point, a right ear key point, a left shoulder key point, a right shoulder key point, a left elbow key point, a right elbow key point, a left wrist key point, a right wrist key point, a left crotch key point and a right crotch key point of a human body in each diagram.

In one embodiment, the network structure of the multi-resolution keypoint detection model of the present invention uses the residual block and the BottelNeck residual block in the ResNet network as basic constituent units, and the network structure includes the following steps, as shown in fig. 3:

an incoming scaled 256 x 4 image of a single person's body region is received.

The input single person body region map is down-sampled using two successive convolution kernels 3 x 3 and 2 steps of convolution, down-sampling the image to 64 x 64.

The 64 x 64 signature is input into the restet network base unit BottelNet residual block.

And respectively carrying out linear interpolation and down sampling on the feature maps output by the BottelNet to respectively obtain two feature maps with different resolutions of 128 × 128 and 32 × 32, and starting network branches with the two different resolutions.

Three different resolution branches (32 x 32, 64 x 64, 128 x 128) are each processed through 4 ResNet labeled residual blocks, while keeping the resolution of each different resolution branch unchanged.

For the first multi-resolution cross fusion, three branches with different resolutions are completely fused, and each branch needs to add the feature map of the branch with the feature maps of all other branches. Before addition, the feature maps of different branches need to be scaled to the same size.

Specifically, for example, in the 128 × 128 resolution branch, the input feature map has 128 × 128 resolution, and the input features are obtained by amplifying 64 × 64 and 32 × 32 feature maps to 128 × 128 through 2-fold and 4-fold linear interpolation, and then adding the amplified features to the original 128 × 128 features. In contrast, downsampling is performed using 3 × 3 convolution when the feature map is reduced, downsampling to one-half uses one 3 × 3 downsampling convolution, and downsampling to one-quarter uses two 3 × 3 downsampling convolutions.

After complete reciprocal fusion, each branch passes through 4 ResNet standard residual blocks again.

And amplifying the 64 × 64 and 32 × 32 resolution branches to 128 × 128 by linear interpolation by 2 times and 4 times respectively, and then fusing and adding the amplified products with the 128 × 128 branch feature maps to realize the second multi-resolution cross fusion.

After fusion, the 128 × 128 feature maps are subjected to 1 × 10 convolution operation to obtain thermodynamic diagram outputs corresponding to 10 key points, and finally, the thermodynamic diagram outputs are amplified back to the input size 256 × 256 through linear interpolation.

And S500, obtaining the confidence coefficient of each human body key point, and obtaining the final human body key point according to the confidence coefficient.

In this embodiment, the confidence of each human body key point is obtained according to the human body key point thermodynamic diagram, and the final human body key point is obtained according to the confidence.

Specifically, the highest response point coordinate of each human body key point thermodynamic diagram, namely the peak position, is searched, the searched peak position is determined as the detection position of the human body key point corresponding to the human body key point thermodynamic diagram, and the numerical value of the peak is the confidence coefficient of the human body key point.

Presetting a confidence threshold, judging the magnitude relation between the confidence of each human body key point and the confidence threshold, and outputting the coordinate and the confidence of the human body key point if the confidence of the human body key point is greater than the confidence threshold; and if the confidence coefficient of the human body key point is not greater than the confidence coefficient threshold value, discarding the human body key point.

And finally outputting the human body key points, namely the obtained final human body key points.

It should be understood that, although the steps in the flowchart are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in the figures may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 4, there is provided a human body key point detecting device applying the above method, including: the system comprises a detection module 100, a first acquisition module 200, a second acquisition module 300, a comparison module 400, an interception module 500, a key point acquisition module 600 and a judgment module 700. Wherein:

the detecting module 100 is configured to detect whether a human body exists in a current scene.

The first acquiring module 200 is configured to acquire a first depth image in a current scene if a human body is not detected, and cover the first depth image of the previous scene where the human body is not detected.

And the second acquisition module 300 is configured to acquire a second depth image in the current scene if a human body is detected, and store a human body detection frame corresponding to the second depth image.

A comparison module 400, configured to perform frame difference comparison on the acquired first depth image and the second depth image, and obtain a depth change area map.

And the intercepting module 500 is configured to obtain a human body mask, and obtain a human body region image through the human body mask.

And a key point obtaining module 600, configured to obtain a single human body region image by using the human body detection frame, input the single human body region image to the key point detection model, and output a plurality of human body key points.

And the judging module 700 is configured to obtain a confidence level of each human body key point, and obtain a final human body key point according to the confidence level.

For the specific limitation of the human body key point detection device, reference may be made to the above limitation on the human body key point detection method, which is not described herein again. The various modules in the above-described apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a data management server, and its internal structure diagram may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer equipment is used for communicating with an external data source terminal through network connection so as to receive data uploaded by the data source terminal. The computer program is executed by a processor to implement a human keypoint detection method.

Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the above human keypoint detection method when executing the computer program.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The invention discloses a human body key point detection method, a human body key point detection device, computer equipment and a storage medium. According to the multi-resolution key point detection method based on the human body mask, depth information is introduced on the basis of the existing key point algorithm using the RGB image, the information quantity which can be sensed by an algorithm model is expanded, the interference of the complex noise information of the RGB color image on the algorithm can be effectively reduced, and the accuracy of the key point algorithm is improved.

The invention provides a method for extracting a human body mask with high precision aiming at the problem that background irrelevant information has large interference on a human body key point algorithm, the human body mask is used for filtering the background interference in an RGBD image, the problem that key points are wrongly positioned on the background under a complex background is solved, and the accuracy of a key point detection algorithm is improved.

The human body key point detection model structure provided by the invention is different from a serial network structure commonly used in the existing key point detection, and a parallel network structure is innovatively used in the invention and comprises a plurality of parallel network branches with different resolutions, so that the network model provided by the invention can give consideration to the global information of a human body and the local fine-grained information of limbs, the condition that key points are wrongly positioned on other limbs around a target human body is greatly reduced, the local limb key point detection is more accurate, and the performance of a human body key point detection algorithm is improved.

In order to improve the efficiency of the multi-resolution parallel network, the invention does not perform multi-scale fusion frequently as the most existing network models, but performs information fusion on each resolution network branch only at the middle part and the tail part of the network, so that each different resolution branch can be mutually fused after fully analyzing and processing own information, the mutual interference among different resolution branches is reduced, the efficiency and the running speed of the network model are also improved, the requirement on real-time performance in a plurality of scenes with limited calculation power is met, and the accuracy of the algorithm is also improved.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for detecting key points of a human body is characterized by comprising the following steps:

obtaining a single human body region image by using the human body detection frame, inputting the single human body region image into the key point detection model, and outputting a plurality of human body key points;

2. The method of claim 1, wherein frame difference comparing the first depth image and the second depth image comprises:

3. The method of claim 1, wherein the obtaining the human mask comprises:

presetting a human body connected domain limiting threshold value, and judging the size relationship between each non-zero value connected domain in the depth change area image and the limiting threshold value;

performing region growing by taking each centroid as a datum point to obtain a first human body region mask;

and fusing the first human body area mask and the second human body area mask to obtain the human body mask.

4. The method according to claim 3, wherein the obtaining the human body region image through the human body mask comprises:

intercepting a human body region in the second depth image by using the human body mask, and filtering a background part in the image to obtain a human body region image;

and preprocessing the human body region image.

5. The method of claim 4, wherein said using said human detection frame to obtain a single human body area image comprises:

6. The method of claim 5, wherein the inputting to a keypoint detection model, outputting a plurality of human keypoints, comprises:

inputting the single human body region image into a key point detection model;

7. The method according to claim 6, wherein the obtaining the confidence of each of the human key points and obtaining the final human key point according to the confidence comprises:

searching the peak position of each human key point thermodynamic diagram, determining the peak position as the detection position of the human key point corresponding to the human key point thermodynamic diagram, and determining the peak value as the confidence coefficient of the human key point;

if the confidence of the human body key point is greater than the confidence threshold, outputting the coordinate and the confidence of the human body key point;

if the confidence of the human body key point is not greater than the confidence threshold, discarding the human body key point;

and obtaining the final key points of the human body.

8. The method of claim 6, wherein the inputting to a keypoint detection model, outputting a plurality of human keypoints, further comprises:

down-sampling the input single human body area image to obtain a first characteristic diagram;

the first resolution branch, the second resolution branch and the third resolution branch are respectively subjected to first multi-resolution cross fusion, wherein each branch needs to add the corresponding feature map and the feature maps corresponding to the rest branches, and the completely fused branches are respectively subjected to residual block processing again;

and outputting a plurality of human key point thermodynamic diagrams according to the feature diagram obtained after the second multi-resolution cross fusion, wherein each human key point thermodynamic diagram corresponds to one human key point.

9. A human keypoint detection device, characterized in that it comprises:

the second acquisition module is used for acquiring a second depth image in the current scene if the human body is detected, and storing a human body detection frame corresponding to the second depth image;

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.