CN112232194A

CN112232194A - Single-target human body key point detection method, system, equipment and medium

Info

Publication number: CN112232194A
Application number: CN202011102253.1A
Authority: CN
Inventors: 姚志强; 周曦; 秦勤
Original assignee: Guangzhou Yuncongkaifeng Technology Co Ltd
Current assignee: Guangzhou Yuncongkaifeng Technology Co Ltd
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2021-01-15

Abstract

The invention provides a method, a system, equipment and a medium for detecting key points of a human body based on a single target, which are characterized in that one or more initial single target images are obtained; performing scene enhancement on one or more obtained initial single-target images, and performing learning training on a single-target human body key point detection algorithm based on the single-target images after the scene enhancement to generate a single-target human body detection model; and acquiring human key points and corresponding visibility information in one or more initial single-target images according to the generated single-target human detection model. The method and the device can not only detect the key points of the human body of the single-target image, but also optimize the single-target image under complex scenes such as fuzzy scenes, shielding scenes, truncation scenes, abnormal postures and the like, and improve the precision of detecting the key points of the human body under the complex scenes. The invention can also adopt models and operators with smaller parameter and calculation quantity, and utilizes the models to quantize and accelerate the model reasoning, thereby reducing the time consumption of reasoning.

Description

Single-target human body key point detection method, system, equipment and medium

Technical Field

The invention relates to the technical field of image recognition, in particular to a single-target human body key point detection method, a single-target human body key point detection system, single-target human body key point detection equipment and a single-target human body key point detection medium.

Background

In application scenarios such as pedestrian video structuring and pedestrian re-identification, after a human body detection result is obtained, human body posture information needs to be further obtained so as to calculate relevant information such as human body behaviors and human body integrity. And the estimation of the human body posture needs to determine the positions of a plurality of joint parts of the human body, such as the head, the trunk, the hands, the feet and the like. Therefore, the invention provides a method for describing the human body posture by using the form of human body key points, wherein one human body key point represents one human body joint. Since the position of a single human body region can be obtained through the human body detection module, single-target human body key point detection is required. However, the existing mainstream human body key point algorithm model is large, the time consumption of post-processing is long, and the method is not suitable for an actual video monitoring scene requiring quick response; moreover, the detection of key points of a human body is difficult under the conditions of fuzzy, shielding, truncation, abnormal postures and the like in an actual scene.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, it is an object of the present invention to provide a method, system, device and medium for detecting single target human body key points, which are used to solve the technical problems in the prior art.

In order to achieve the above objects and other related objects, the present invention provides a method for detecting key points of a single target human body, comprising the steps of:

acquiring one or more initial single-target images; wherein the single target image includes only a single target object;

performing scene enhancement on the one or more initial single-target images, and performing learning training on a single-target human body key point detection algorithm based on the single-target images after the scene enhancement to generate a single-target human body detection model; the scene includes at least one of: blurring, erasing, blocking, truncating, and gesturing;

and acquiring human key points and corresponding visibility information in one or more single-target images to be detected according to the single-target human detection model.

Optionally, the specific manner of performing scene enhancement on the one or more initial single-target images includes at least one of: the method comprises the steps of performing fuzzy enhancement on one or more initial single-target images by adopting Gaussian blur, performing fuzzy enhancement on one or more initial single-target images by adopting motion blur, performing shielding enhancement on one or more initial single-target images by adopting random erasure, performing shielding enhancement on one or more initial single-target images by adopting half-length shielding, performing truncation enhancement on one or more initial single-target images by adopting random cropping, and performing truncation enhancement on one or more initial single-target images by adopting half-length truncation.

Optionally, if random erasing is adopted to perform occlusion enhancement on one or more initial single-target images, the enhancement process includes:

randomly selecting a target area from the one or more initial single-target images, and judging whether the upper left-corner coordinate and the number of attempts of width and height of the target area are smaller than a first threshold value;

if the number of the key points of the human body contained in the target area is smaller than a first threshold value, calculating whether the number of the key points of the human body contained in the target area is smaller than a second threshold value; if the current time is not less than the second threshold value, reselecting the target area; if the visibility of the key points of the human body in the erasing area is not visible, covering the target area by using random pixel values; outputting a single target image subjected to shielding enhancement;

and if the image is not less than the first threshold, directly outputting the single-target image.

Optionally, if the occlusion enhancement is performed on one or more initial single-target images by using a half-length occlusion, the enhancement flow includes:

judging whether the one or more initial single-target images are shielded by using an actual object;

if the actual object is not used for shielding, generating a random pixel value area which is suitable for the size of the lower half of the human body of the target object; covering the lower human body of the target object with the generated random pixel value area;

if the actual object is used for shielding, randomly selecting a shielding object image, and zooming the shielding object image to be suitable for the size of the lower half body of the target object; covering the lower body of the target object by using the selected shelter image;

adjusting the visibility of key points of the human body lower half body shielding area of the target object to be invisible after the human body lower half body covering the target image; and outputting the single target image with the shielding enhancement completed.

Optionally, if the truncation of the half-length is used to perform truncation enhancement on one or more initial single-object images, the enhancement flow includes:

judging whether the one or more initial single-target images are subjected to up-down truncation or left-right truncation;

if the one or more initial single-target images are cut off up and down, judging whether the one or more initial single-target images are cut off the upper half body or the lower half body again; if the upper half body is cut off, extracting key points of the upper half body region after the upper half body is cut off; if the lower half body is cut off, extracting key points of the lower half body region after the lower half body is cut off;

if the one or more initial single-target images are subjected to left-right truncation, judging whether the single-target images are left-body truncation or right-body truncation again; if the left half body is cut off, extracting key points of the left half body region after the cutting off of the left half body is finished; if the right half body is cut off, extracting key points of the right half body region after the right half body is cut off;

calculating the minimum external frame of all extracted key points of the half body region, and expanding the minimum external frame;

trimming the expanded external frame to generate a cut-off diagram; and outputting the truncation image to complete truncation and enhancement of the single target image.

Optionally, the process of generating the single-target human detection model includes:

extracting human key points and human key point visibilities from the single-target image subjected to scene enhancement through a convolutional neural network;

regression is carried out on the extracted human key points, and the visibility of the human key points is classified;

calculating training loss based on the regressed human key points and the classified human key point visibilities, and learning a training single-target human key point detection algorithm according to the calculated training loss to generate the single-target human detection model.

Optionally, in the process of generating the single-target human body detection model, the method includes performing regression on the extracted human body key points through a regression head network, and predicting horizontal and vertical coordinates of the positions of the human body key points;

classifying the visibility of the human key points through a classified head network, and predicting the visibility category of the key points;

wherein the regression head network and/or the classification head network comprises at least one of: fully connected network, 1x1 convolutional network.

Optionally, in the process of generating a single-target human body detection model, the method further includes:

sequencing the training losses of the target samples according to a preset sequencing rule, and giving a target weight to the sequenced training losses according to a preset weight coefficient; the target sample consists of one or more single target images after scene enhancement;

carrying out weighted average on the training loss given with the target weight, and optimizing the human body posture in the target sample according to the training loss after weighted average;

and training a single-target human body detection algorithm through the target sample after the human body posture is optimized, and generating the single-target human body detection model.

Optionally, after the single-target human body model is generated, the method further includes packaging the single-target human body model to form a software development kit.

The invention also provides a single-target human body key point detection system, which comprises:

the image acquisition module is used for acquiring one or more initial single-target images; wherein the single target image includes only a single target object;

the image enhancement module is used for carrying out scene enhancement on the one or more initial single-target images; the scene includes at least one of: blurring, erasing, blocking, truncating, and gesturing;

the model training module is used for learning and training a single-target human body key point detection algorithm based on the single-target image after scene enhancement to generate a single-target human body detection model;

and the key point detection module is used for acquiring human key points and corresponding visibility information in one or more single-target images to be detected according to the single-target human detection model.

Optionally, the specific manner of scene enhancement performed on the single-target image by the image enhancement module includes at least one of: the method comprises the steps of performing fuzzy enhancement on one or more initial single-target images by adopting Gaussian blur, performing fuzzy enhancement on one or more initial single-target images by adopting motion blur, performing shielding enhancement on one or more initial single-target images by adopting random erasure, performing shielding enhancement on one or more initial single-target images by adopting half-length shielding, performing truncation enhancement on one or more initial single-target images by adopting random cropping, and performing truncation enhancement on one or more initial single-target images by adopting half-length truncation.

Optionally, after generating the single-target human body model, further packaging the single-target human body model; reasoning and quantifying the packaged single-target human body model;

and the output of the model after reasoning and quantification is converted into an output coordinate under an image coordinate system; and acquiring one or more human key points and corresponding visibility information in the initial single-target image according to the output coordinates.

acquiring one or more initial single-target images;

The invention also provides single-target human body key point detection equipment, which comprises:

The present invention also provides an apparatus comprising:

one or more processors; and

one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform a method as in any one of the above.

The invention also provides one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method as described in any one of the above.

As described above, the single-target-based human body key point detection method, system, device and medium provided by the present invention have the following beneficial effects: by acquiring one or more initial single-target images; performing scene enhancement on one or more obtained initial single-target images, and performing learning training on a single-target human body key point detection algorithm based on the single-target images after the scene enhancement to generate a single-target human body detection model; and acquiring human key points and corresponding visibility information in one or more initial single-target images according to the generated single-target human detection model. Wherein the single target image includes only a single target object; the scene includes at least one of: blur, wipe, block, cut, gesture. The method and the device can not only detect the key points of the human body of the single-target image, but also optimize the single-target image under complex scenes such as fuzzy scenes, shielding scenes, truncation scenes, abnormal postures and the like, and improve the precision of detecting the key points of the human body under the complex scenes.

Drawings

Fig. 1 is a schematic flow chart of a single-target human body key point detection method according to an embodiment;

FIG. 2 is a schematic diagram of a process for image enhancement using random erasure according to an embodiment;

FIG. 3 is a schematic flow chart of image enhancement using a half-length mask according to an embodiment;

FIG. 4 is a flowchart illustrating an embodiment of image enhancement using half-length truncation;

fig. 5 is a schematic hardware structure diagram of a single-target human body key point detection system according to an embodiment;

fig. 6 is a schematic hardware structure diagram of a terminal device according to an embodiment;

fig. 7 is a schematic diagram of a hardware structure of a terminal device according to another embodiment.

Description of the element reference numerals

M10 image acquisition module

M20 image enhancement module

M30 model training module

M40 key point detection module

1100 input device

1101 first processor

1102 output device

1103 first memory

1104 communication bus

1200 processing assembly

1201 second processor

1202 second memory

1203 communication assembly

1204 Power supply Assembly

1205 multimedia assembly

1206 Audio component

1207 input/output interface

1208 sensor assembly

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

Keypoint visibility: indicating whether the key points are visible; three categories are included: visible predictable, invisible unpredictable.

Referring to fig. 1, the present invention provides a method for detecting key points of a single target human body, comprising the following steps:

s100, acquiring one or more initial single-target images; wherein the single target image includes only a single target object;

s200, performing scene enhancement on one or more initial single-target images, and performing learning training on a single-target human body key point detection algorithm based on the single-target images subjected to scene enhancement to generate a single-target human body detection model; the scene includes at least one of: blurring, erasing, blocking, truncating, and gesturing;

s300, obtaining human key points and corresponding visibility information in one or more single-target images to be detected according to the single-target human detection model.

The method can not only detect the key points of the human body of the single-target image, but also optimize the single-target image under complex scenes such as fuzzy scenes, shielding scenes, truncation scenes, abnormal postures and the like, and improves the precision of detecting the key points of the human body under the complex scenes.

In accordance with the above, in an exemplary embodiment, the scene enhancement of the one or more initial single-object images includes at least one of: the method comprises the steps of performing fuzzy enhancement on one or more initial single-target images by adopting Gaussian blur, performing fuzzy enhancement on one or more initial single-target images by adopting motion blur, performing shielding enhancement on one or more initial single-target images by adopting random erasure, performing shielding enhancement on one or more initial single-target images by adopting half-length shielding, performing truncation enhancement on one or more initial single-target images by adopting random cropping, and performing truncation enhancement on one or more initial single-target images by adopting half-length truncation.

By way of example, if random erasure is used to perform occlusion enhancement on one or more initial single-target images, the enhancement flow is shown in fig. 2 and includes: acquiring one or more initial single-target images; randomly selecting a target area from one or more initial single-target images, and judging whether the upper left-corner coordinate and the width and height trial times of the target area are smaller than a first threshold value or not; if the number of the key points in the target area is smaller than the first threshold value, calculating whether the number of the key points in the target area containing the human body is smaller than a second threshold value; if the current time is not less than the second threshold value, reselecting the target area; if the value is smaller than the second threshold value, covering the target area by using a random pixel value, and adjusting the visibility of the human key points in the erasing area to be invisible; outputting a single target image subjected to shielding enhancement; and if the image is not less than the first threshold, directly outputting the single-target image. The first threshold may be set to 30, for example, and the second threshold may be 3, for example. In the embodiment of the present application, the random erasing refers to selecting a pixel block of a region in an image and replacing the pixel block with a random pixel value. The embodiment of the application applies random erasing to human body key point detection, makes corresponding improvement and adaptation, and selects the area where the key point is located aiming at the key point to simulate the effect that the human body joint point is shielded. In the specific processing, the number of key points in the erasing area is ensured not to exceed 3, because if the key points are shielded in a large area, the position coordinates of the key points are difficult to predict. The human key point type in this application can select the existing 14 human key points.

As another example, if the original single-target images are enhanced by using a half-length occlusion, the enhancement process is as shown in fig. 3 and includes: acquiring one or more initial single-target images; judging whether one or more initial single-target images are shielded by using an actual object; if the actual object is not used for shielding, generating a random pixel value area which is suitable for the size of the lower half of the human body of the target object; covering the lower body of the human body of the target object by using the generated random pixel value area; if the actual object is used for shielding, randomly selecting a shielding object image, and zooming the shielding object image to be suitable for the size of the lower half body of the target object; covering the lower body of the target object by using the selected shelter image; adjusting the visibility of key points of the human body lower half body shielding area of the target object to be invisible after the human body lower half body covering the target image; and outputting the single target image with the shielding enhancement completed. In an actual scene, the situation that the lower half of the body is easily shielded by objects such as tables, chairs and fences occurs, and for the scene, an online reinforcing strategy for the half-body shielding is provided in the embodiment of the application. The half-length occlusion is mainly processed aiming at the situation of the lower half-length occlusion, and random pixels or a certain occlusion object are selected by the half-length occlusion to cover the lower half-length pixel area of the original image. The lower half of the half-length is blocked by a larger area, and the key points are all regarded as invisible and unpredictable.

As another example, if the truncation enhancement is performed on one or more initial single-object images by using half-length truncation, the enhancement flow is as shown in fig. 4, and includes: acquiring one or more initial single-target images; judging whether one or more initial single-target images are subjected to up-down truncation or left-right truncation; if one or more initial single-target images are cut off up and down, judging whether the single-target images are cut off the upper half body or the lower half body again; if the upper half body is cut off, extracting key points of the upper half body region after the upper half body is cut off; if the lower half body is cut off, extracting key points of the lower half body region after the lower half body is cut off; if one or more initial single target images are cut left and right, and if one or more initial single target images are cut left and right, the left half body or the right half body is judged again; if the left half body is cut off, extracting key points of the left half body region after the cutting off of the left half body is finished; if the right half body is cut off, extracting key points of the right half body region after the right half body is cut off; calculating the minimum external frame of all extracted key points of the half body region, and expanding the minimum external frame; trimming the expanded external frame to generate a cut-off diagram; and outputting a truncation image to complete truncation and enhancement of the single target image. In the embodiment Of the present application, the half-length truncation is to simulate a situation in which a half Of a human body is displayed in a human body roi (region Of interest) region in an actual scene, in which the human body is largely blocked. The half body cut-off includes 4 kinds of upper body cut-off, lower body cut-off, left body cut-off and right body cut-off. Truncation differs from occlusion in that the truncated ROI region of the human body is a partial region of the human body after truncation, and occlusion includes the entire region as well as occlusion objects. The truncation is different from the cropping, in that the truncation is directed at the upper half area, the lower half area, the left half area and the right half area of the human body, the key points of the half area of the human body are ensured to be in the truncation area, and the random cropping area is relatively random and is limited in a relatively small area.

In an exemplary embodiment, the learning training of the single-target human body key point detection algorithm is performed based on the single-target image after the scene enhancement, and the process of generating the single-target human body detection model includes: extracting human key points and human key point visibilities from the single-target image subjected to scene enhancement through a convolutional neural network; regression is carried out on the extracted human key points, and the visibility of the human key points is classified; calculating training loss based on the regressed human key points and the classified human key point visibilities, and learning a single-target human key point detection algorithm according to the calculated training loss to generate a single-target human detection model. In the embodiment of the application, when the human key point detection and visibility classification features are extracted through the convolutional neural network model, in order to improve the speed, the convolutional neural network model can be a shallow-depth model, such as a ResNet18 model, a HRNet-w18-small model and the like. The feature extraction network includes, but is not limited to, ResNet, VGG, HRNet, MobileNet, and the like. The human body key point and the visibility in the embodiment of the application can be predicted by adopting two head branches respectively. Predicted branch taken networks include, but are not limited to: fully connected network, 1x1 convolutional network. The horizontal and vertical coordinates of the positions of the key points of the human body can be predicted through the regression head network. By classifying the head network, the human body key point visibility category can be predicted. The existing mainstream human body key point algorithms comprise Top-Down and Bottom-Up algorithms, and most of the methods adopt a heatmap thermodynamic diagram method. Generally, a heatmap method is usually adopted in the existing human body key point regression, but the heatmap method is time-consuming, has more post-processing flows and has a large influence on the time consumption. In the method, the heatmap scheme is replaced by a more direct full-connection network layer and/or a 1x1 convolutional network, so that post-processing is basically not needed, the time consumption of the post-processing is reduced, and the model reasoning speed is improved.

According to the above description, in the process of generating the single-target human body detection model, the method further includes: sequencing the training losses of the target samples according to a preset sequencing rule, and giving a target weight to the sequenced training losses according to a preset weight coefficient; the target sample consists of one or more single target images after scene enhancement; carrying out weighted average on the training loss given with the target weight, and optimizing the human body posture in the target sample according to the training loss after weighted average; and training a single-target human body detection algorithm by optimizing the target sample after the human body posture to generate a single-target human body detection model. Specifically, when training loss is calculated, the learning weight of the Hard posture sample can be increased by an OHEM (Online Hard Example mining) method; and/or increasing the learning weight of the Hard key points by an OKM (Online Hard key mining) method; wherein, to the training sample of a batch during training, if the size of batch is default value N (the numerical value of N can be customized according to actual conditions), then think that the sample that the training loss satisfies the default value in a batch belongs to difficult sample, target sample promptly. As an example, the target sample may be a human image with a complex human pose or a complex scene. If the learning weight of the difficult posture sample is increased by the difficult sample mining OHEM method, the size of the batch is N for the training sample of one batch during training, and it can be considered that the sample with large training loss in one batch belongs to the difficult sample, namely, the human body image with complicated human body posture or complicated scene. By way of example, the embodiment of the present application ranks the magnitude of the training loss of one batch sample first, and the larger the training loss is, the higher the ranking is. 1/3 samples with the training loss at the top of the sequence are selected and assigned with a weight coefficient of 3.0, 1/3-2/3 samples with the middle of the training loss at the middle of the sequence are selected and assigned with a weight coefficient of 2.0, 1/3 samples with the training loss at the back of the sequence are left and assigned with a weight coefficient of 1.0, and finally the training loss (loss) is the weighted average of a plurality of samples in a batch. If the hard key point learning weight is increased by the hard key point mining OKM method, each sample in training contains a human body, and each human body corresponds to 14 human body key points. Human body gestures can be complex, especially moving the agile hands and legs, with the extremities wrists and ankles moving more widely and with relative difficulty. For example, in the embodiment of the present invention, a weight coefficient of 3.0 is given to the training loss at the end key points such as the wrist and ankle, a weight coefficient of 2.0 is given to the training loss at the middle part of the four limbs such as the elbow and knee, the relative movement range of the remaining human joint parts is small, a weight coefficient of 1.0 is given to the training loss, and the final result of the training loss of a single sample is a weighted average of the training loss at the 14 key points. In the embodiment of the application, the difficulty degrees of the difficult samples, the difficult posture samples and the difficult key points are determined by listening to the corresponding training losses, and the greater the numerical value of the training losses, the more difficult the difficulty degrees corresponding to the numerical value are.

In an exemplary embodiment, after generating the single-target mannequin, the method further comprises packaging the single-target mannequin to form a software toolkit SDK. Reasoning and quantifying the packaged single-target human body model, and converting the output of the model after reasoning and quantifying into output coordinates under an image coordinate system; and acquiring one or more human body key points in the initial single-object image and corresponding visibility information according to the output coordinates. When the single-target human body model is inferred, the output of the model can be acquired through the TensorRT inference engine and the input human body ROI area through GPU model inference. In the embodiment of the application, Software package of an SDK (Software Development Kit) is realized on a single-target human body model, and quantization acceleration is performed through the model, so that the method reduces time consumed by actual reasoning of the model within an acceptable precision range, and realizes balance between precision and speed. According to the method and the device, models and operators with small parameter and calculation amount can be adopted, the model quantization is utilized to accelerate the inference of the single-target human body detection model, and the time consumption of the inference is reduced. In the embodiment of the application, model quantization: compression acceleration of the network can be achieved by converting the model weight data of float32 to the data type of fp16 or int8, reducing the number of bits required to represent each weight.

According to the above, before encapsulating the single-object image model, preprocessing one or more initial single-object images; the pretreatment at least comprises: acquiring human body ROI image region coordinates in one or more initial single-target images; and cutting the human body ROI image area, and taking the human body ROI image area as the input size of the single-target image model after changing the size of the human body ROI image area. In the embodiment of the present application, the human body ROI area: refers to a frame outside the area where the human body is located; generally indicated by a rectangular box. According to the method and the device, the human body ROI area can be determined through the coordinates of the upper left corner of the rectangular frame and the width and the height of the rectangular frame.

In summary, the invention provides a single-target human body key point detection method, and aims at the problems existing in the current algorithm, a set of human body key point Top-Down detection algorithm based on a depth model is designed to realize the detection of human body key points and visibility. The method can be optimized under complex scenes such as fuzzy scenes, shielding scenes, truncation scenes, abnormal postures and the like, and can improve the precision of detecting the key points of the human body under the complex scenes. And the method can also realize SDK software encapsulation aiming at the single-target human body detection model trained by learning, and realize the balance of precision and speed by quantizing and accelerating the encapsulated model, within the range of ensuring acceptable precision, reducing the time consumed by actual reasoning of the model. The method can adopt models and operators with small parameter and calculation amount, and simultaneously reduce the time consumption of reasoning of the existing algorithm model by using the model quantization acceleration model reasoning. In addition, the method designs special data enhancement methods such as half-length shading, half-length truncation and the like through various data enhancement methods. By combining difficult sample mining and difficult key point mining strategies, key point effects under complex scenes such as blurring, shielding, truncation and abnormal postures are greatly optimized; meanwhile, the reasoning speed of the model is accelerated by TensorRT GPU reasoning and int8 quantitative acceleration technology, and the reasoning speed can reach the sub-millisecond level on the GPU.

As shown in fig. 5, the present invention further provides a single-target human body key point detection system, which comprises:

an image acquisition module M10 for acquiring one or more initial single-target images; wherein the single target image includes only a single target object;

an image enhancement module M20, configured to perform scene enhancement on one or more initial single-object images; the scene includes at least one of: blurring, erasing, blocking, truncating, and gesturing;

the model training module M30 is used for learning and training a single-target human key point detection algorithm based on the single-target image after scene enhancement to generate a single-target human detection model;

and the key point detection module M40 is configured to obtain, according to the single-target human body detection model, human body key points and corresponding visibility information in one or more single-target images to be detected.

The system can not only detect the key points of the human body of the single-target image, but also optimize the single-target image under complex scenes such as fuzzy scenes, shielding scenes, truncation scenes, abnormal postures and the like, and improves the precision of detecting the key points of the human body under the complex scenes.

In summary, the invention provides a single-target human body key point detection system, and aims at the problems existing in the current algorithm, a set of human body key point Top-Down detection algorithm based on a depth model is designed to realize the detection of human body key points and visibility. The system can be optimized under complex scenes such as fuzzy scenes, shielding scenes, truncation scenes, abnormal postures and the like, and can improve the precision of detecting the key points of the human body under the complex scenes. Moreover, the system can also realize SDK software encapsulation aiming at a single-target human body detection model trained by learning, and realize the balance of precision and speed by quantizing and accelerating the encapsulated model, within the range of ensuring acceptable precision, reducing the time consumed by actual reasoning of the model. The system can adopt models and operators with small parameter and calculation quantity, and simultaneously reduce the time consumption of reasoning of the existing algorithm model by using the model quantization acceleration model reasoning. In addition, the system designs special data enhancement systems such as half-length shading and half-length truncation through various data enhancement systems. By combining difficult sample mining and difficult key point mining strategies, key point effects under complex scenes such as blurring, shielding, truncation and abnormal postures are greatly optimized; meanwhile, the reasoning speed of the model is accelerated by TensorRT GPU reasoning and int8 quantitative acceleration technology, and the reasoning speed can reach the sub-millisecond level on the GPU.

According to the above record, the present invention also provides a single target human body key point detection system, which is formed by a single target human body detection model generated by packaging. As an example, the system may be an SDK. The specific functions and technical effects of the system can be obtained by referring to the above embodiments, which are not described herein again.

The embodiment of the application also provides single-target human key point detection equipment, which comprises:

performing scene enhancement on one or more initial single-target images, and performing learning training on a single-target human body key point detection algorithm based on the single-target images subjected to scene enhancement to generate a single-target human body detection model; the scene includes at least one of: blurring, erasing, blocking, truncating, and gesturing;

In this embodiment, the single-target human body key point detection device executes the system or the method, and specific functions and technical effects are only referred to the above embodiments, which are not described herein again.

An embodiment of the present application further provides an apparatus, which may include: one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the methods of fig. 1-4. In practical applications, the device may be used as a terminal device, and may also be used as a server, where examples of the terminal device may include: the mobile terminal includes a smart phone, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a vehicle-mounted computer, a desktop computer, a set-top box, an intelligent television, a wearable device, and the like.

The present application further provides a non-volatile readable storage medium, where one or more modules (programs) are stored in the storage medium, and when the one or more modules are applied to a device, the device may be caused to execute instructions (instructions) of steps included in the methods in fig. 1 to 4 of the present application.

Fig. 6 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present application. As shown, the terminal device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103, and at least one communication bus 1104. The communication bus 1104 is used to implement communication connections between the elements. The first memory 1103 may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, and the first memory 1103 may store various programs for performing various processing functions and implementing the method steps of the present embodiment.

Alternatively, the first processor 1101 may be, for example, a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the processor 1101 is coupled to the input device 1100 and the output device 1102 through a wired or wireless connection.

Optionally, the input device 1100 may include a variety of input devices, such as at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware plug-in interface (e.g., a USB interface, a serial port, etc.) for data transmission between devices; optionally, the user-facing user interface may be, for example, a user-facing control key, a voice input device for receiving voice input, and a touch sensing device (e.g., a touch screen with a touch sensing function, a touch pad, etc.) for receiving user touch input; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, such as an input pin interface or an input interface of a chip; the output devices 1102 may include output devices such as a display, audio, and the like.

In this embodiment, the processor of the terminal device includes a function for executing each module of the speech recognition apparatus in each device, and specific functions and technical effects may refer to the above embodiments, which are not described herein again.

Fig. 7 is a schematic hardware structure diagram of a terminal device according to another embodiment of the present application. FIG. 7 is a specific embodiment of the implementation of FIG. 6. As shown, the terminal device of the present embodiment may include a second processor 1201 and a second memory 1202.

The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the method described in fig. 1 in the above embodiment.

The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, videos, and so forth. The second memory 1202 may include a Random Access Memory (RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

Optionally, a second processor 1201 is provided in the processing assembly 1200. The terminal device may further include: communication components 1203, power components 1204, multimedia components 1205, audio components 1206, input/output interfaces 1207, and/or sensor components 1208. The specific components included in the terminal device are set according to actual requirements, which is not limited in this embodiment.

The processing component 1200 generally controls the overall operation of the terminal device. The processing assembly 1200 may include one or more second processors 1201 to execute instructions to perform all or part of the steps of the method illustrated in fig. 1 described above. Further, the processing component 1200 can include one or more modules that facilitate interaction between the processing component 1200 and other components. For example, the processing component 1200 can include a multimedia module to facilitate interaction between the multimedia component 1205 and the processing component 1200.

The power supply component 1204 provides power to the various components of the terminal device. The power components 1204 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal device.

The multimedia components 1205 include a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

The audio component 1206 is configured to output and/or input speech signals. For example, the audio component 1206 includes a Microphone (MIC) configured to receive external voice signals when the terminal device is in an operational mode, such as a voice recognition mode. The received speech signal may further be stored in the second memory 1202 or transmitted via the communication component 1203. In some embodiments, audio component 1206 also includes a speaker for outputting voice signals.

The input/output interface 1207 provides an interface between the processing component 1200 and peripheral interface modules, which may be click wheels, buttons, etc. These buttons may include, but are not limited to: a volume button, a start button, and a lock button.

The sensor component 1208 includes one or more sensors for providing various aspects of status assessment for the terminal device. For example, the sensor component 1208 may detect an open/closed state of the terminal device, relative positioning of the components, presence or absence of user contact with the terminal device. The sensor assembly 1208 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor assembly 1208 may also include a camera or the like.

The communication component 1203 is configured to facilitate communications between the terminal device and other devices in a wired or wireless manner. The terminal device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot therein for inserting a SIM card therein, so that the terminal device may log onto a GPRS network to establish communication with the server via the internet.

As can be seen from the above, the communication component 1203, the audio component 1206, the input/output interface 1207 and the sensor component 1208 in the embodiment of fig. 7 may be implemented as the input device in the embodiment of fig. 6.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A single-target human body key point detection method is characterized by comprising the following steps:

2. The single-target human keypoint detection method of claim 1, wherein the specific way of performing scene enhancement on the one or more initial single-target images comprises at least one of: the method comprises the steps of performing fuzzy enhancement on one or more initial single-target images by adopting Gaussian blur, performing fuzzy enhancement on one or more initial single-target images by adopting motion blur, performing shielding enhancement on one or more initial single-target images by adopting random erasure, performing shielding enhancement on one or more initial single-target images by adopting half-length shielding, performing truncation enhancement on one or more initial single-target images by adopting random cropping, and performing truncation enhancement on one or more initial single-target images by adopting half-length truncation.

3. The single-target human keypoint detection method of claim 2, wherein if random erasure is used to perform occlusion enhancement on one or more initial single-target images, the enhancement procedure comprises:

4. The method for detecting single-target human key points according to claim 2, wherein if the occlusion enhancement is performed on one or more initial single-target images by using a half-length occlusion, the enhancement procedure comprises:

5. The method for detecting single-target human key points according to claim 2, wherein if the truncation enhancement is performed on one or more initial single-target images by using half-length truncation, the enhancement flow comprises:

6. The single-target human key point detection method of claim 1, wherein the process of generating the single-target human detection model comprises:

7. The method for detecting the single-target human key points according to claim 6, wherein in the process of generating the single-target human detection model, the method comprises the steps of performing regression on the extracted human key points through a regression head network, and predicting the horizontal and vertical coordinates of the positions of the human key points;

8. The method for detecting the single-target human key points according to claim 6, wherein in the process of generating the single-target human detection model, the method further comprises:

9. The method of any one of claims 1 to 8, further comprising packaging the single-target human body model to form a software development kit after generating the single-target human body model.

10. A single-target human body key point detection system is characterized by comprising:

11. The single-target human keypoint detection system of claim 10, wherein the specific way of scene enhancement of the single-target image by the image enhancement module comprises at least one of: the method comprises the steps of performing fuzzy enhancement on one or more initial single-target images by adopting Gaussian blur, performing fuzzy enhancement on one or more initial single-target images by adopting motion blur, performing shielding enhancement on one or more initial single-target images by adopting random erasure, performing shielding enhancement on one or more initial single-target images by adopting half-length shielding, performing truncation enhancement on one or more initial single-target images by adopting random cropping, and performing truncation enhancement on one or more initial single-target images by adopting half-length truncation.

12. The single target human keypoint detection system of claim 10, further comprising, after generating the single target human model, encapsulating the single target human model; reasoning and quantifying the packaged single-target human body model;

13. The single-target human keypoint detection system of claim 11, wherein if random erasure is used for occlusion enhancement of one or more initial single-target images, the enhancement procedure comprises:

acquiring one or more initial single-target images;

14. The single-target human keypoint detection system of claim 11, wherein if a half-length occlusion is used for occlusion enhancement of one or more initial single-target images, the enhancement procedure comprises:

acquiring one or more initial single-target images;

15. The single-target human keypoint detection system of claim 11, wherein if truncation enhancement is applied to one or more initial single-target images using half-length truncation, the enhancement procedure comprises:

acquiring one or more initial single-target images;

16. The utility model provides a human key point check out test set of single-target which characterized in that, including:

17. An apparatus, comprising:

one or more processors; and

one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of any of claims 1-9.

18. One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method of any of claims 1-9.