WO2021179822A1

WO2021179822A1 - Human body feature point detection method and apparatus, electronic device, and storage medium

Info

Publication number: WO2021179822A1
Application number: PCT/CN2021/073863
Authority: WO
Inventors: 吴佳涛
Original assignee: Oppo广东移动通信有限公司; 上海瑾盛通信科技有限公司
Priority date: 2020-03-12
Filing date: 2021-01-27
Publication date: 2021-09-16
Also published as: CN111414823A; CN111414823B

Abstract

The present application relates to the technical field of electronic devices, and disclosed are a human body feature point detection method and apparatus, an electronic device, and a storage medium. The method comprises: obtaining an image to be detected; performing down-sampling processing on said image to obtain first image features of said image; performing multi-scale feature extraction on the first image features to obtain a plurality of second image features of said image; and performing convolution operation on the plurality of second image features to obtain human body feature point position information and human body feature point connection information of said image. According to the present application, multi-scale feature extraction is performed on said image to obtain image features under different scales, and the human body feature point position information and the human body feature point connection information are obtained on the basis of the image features under different scales, thereby greatly improving the accuracy and efficiency of human body feature point detection.

Description

Detection method, device, electronic equipment and storage medium of human body feature points

Cross-references to related applications

This application claims the priority of the Chinese application with the application number CN202010171918.8 filed on March 12, 2020, which is hereby incorporated in its entirety by reference for all purposes.

Technical field

This application relates to the technical field of electronic equipment, and more specifically, to a method, device, electronic equipment, and storage medium for detecting human body feature points.

Background technique

With the continuous development of artificial intelligence technology, artificial intelligence technology has gradually been applied to the field of detection of human feature points. At present, when using artificial intelligence technology to detect the human body feature points in the image, it is necessary to first detect the human body in the image with the target detection algorithm, and then perform the human body feature point detection on the detected human body. The detection speed is consistent with the human body in the image. The number shows a linear growth relationship.

Summary of the invention

In view of the above-mentioned problems, this application proposes a detection method, device, electronic equipment and storage medium for human body feature points to solve the above-mentioned problems.

In the first aspect, an embodiment of the present application provides a method for detecting feature points of a human body. The method includes: acquiring an image to be detected; performing down-sampling processing on the image to be detected to obtain a first image of the image to be detected Feature; perform multi-scale feature extraction on the first image feature to obtain multiple second image features of the image to be detected; perform convolution operation on the multiple second image features to obtain the image to be detected The human body feature point location information and the human body feature point connection information.

In a second aspect, an embodiment of the present application provides a device for detecting feature points of a human body. The device includes: a to-be-detected image acquisition module for acquiring the to-be-detected image; a first image feature acquisition module for evaluating the to-be-detected image The detection image is subjected to down-sampling processing to obtain the first image feature of the image to be detected; the second image feature acquisition module is configured to perform multi-scale feature extraction on the first image feature to obtain a plurality of the images to be detected The second image feature; the human body feature point detection module is used to perform a convolution operation on the multiple second image features to obtain the human body feature point position information and the human body feature point connection information in the image to be detected.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, the memory is coupled to the processor, the memory stores instructions, and the instructions are executed when the instructions are executed by the processor. The processor executes the above method.

In a fourth aspect, an embodiment of the present application provides a computer readable storage medium, and the computer readable storage medium stores program code, and the program code can be invoked by a processor to execute the above method.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings that need to be used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can be obtained from these drawings without creative work.

FIG. 1 shows a schematic flowchart of a method for detecting human body feature points according to an embodiment of the present application;

FIG. 2 shows a schematic flowchart of a method for detecting human body feature points according to another embodiment of the present application;

FIG. 3 shows a schematic flowchart of step S260 of the method for detecting human body feature points shown in FIG. 2 of the present application;

FIG. 4 shows a schematic flowchart of a method for detecting human body feature points according to another embodiment of the present application;

FIG. 5 shows the overall framework diagram of the detection model provided by the embodiment of the present application;

FIG. 6 shows a block diagram of a module of a device for detecting human body feature points provided by an embodiment of the present application;

Fig. 7 shows a block diagram of an electronic device used in an embodiment of the present application to execute the method for detecting human body feature points according to the embodiment of the present application;

Fig. 8 shows a storage unit for storing or carrying program codes that implement the method for detecting human body feature points according to the embodiment of the present application.

Detailed ways

In order to enable those skilled in the art to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application.

Convolutional neural network is a kind of neural network that includes convolution calculation and has a certain deep structure. It is one of the representative algorithms of deep learning. The development of convolutional neural networks has generally included the following types of stacked layers: input layer, convolutional layer, pooling layer, normalization layer (also called Batch Norm layer), activation function layer, fully connected layer, output layer Wait. In the field of computer vision, the input layer is generally a RGB three-channel color image; the function of the convolutional layer is to extract features of the input data, and the calculation form is convolution operation, including weight coefficients and bias; the pooling layer is used to extract features Information is selected and filtered. Commonly used pooling methods include maximum pooling and average pooling; the normalization layer normalizes the input data so that the distribution of each feature is similar, and the network is easier to train; the activation function layer is used for Add nonlinear factors to the model to make the model have a stronger fitting ability; the fully connected layer is generally located in the last part of the convolutional neural network, and the input features are nonlinearly combined to obtain the output; the output layer outputs the type of results required by the model, For image classification problems, the output layer uses softmax (normalized exponential function, often used as an output layer in the field of deep learning to obtain a specified type of output) and other functions to output classification labels. For image semantic segmentation problems, the output layer directly outputs each pixel The classification results of the human body feature point detection problem, the output layer outputs the human body feature point heat map (different algorithm models may also output other heat maps for auxiliary feature point detection and allocation).

Human feature point detection, namely pose estimation, mainly detects some feature points of the human body, such as eyes, nose, elbows, shoulders, etc., and connects them in sequence in the order of feature points, and describes human body information through feature points. Expanded, it can also describe the posture, gait, behavior and other information of the human body. Human feature point detection is one of the basic algorithms of computer vision, and it has played a basic role in the research of other related fields of computer vision, such as behavior recognition, intelligent composition and other related fields. Existing human feature point detection algorithms based on deep learning can be divided into two directions, namely, a top-down detection method and a bottom-up detection method.

Among them, the top-down human feature point detection algorithm divides the human feature point detection task into two parts: human body detection and single-person human feature point detection, that is, each person in the image is detected individually through the target detection algorithm. Then, on the basis of the detection frame, the human body feature point detection is performed for a single person. The top-down method tends to have higher detection accuracy, but the detection speed of this method has a linear growth relationship with the number of people in the image, and additional target detection algorithms are needed as support.

Among them, the bottom-up method also includes two parts: multi-person feature point detection in the image and post-processing, that is, firstly, all feature points in the image need to be detected, and then related strategies are applied in the post-processing module to remove all feature points. The feature points of are assigned to different individuals, and representative algorithms include Openpose, PersonLab, etc. The detection accuracy of the bottom-up method is lower than that of the top-down method, but the detection speed is faster, and the detection time has nothing to do with the number of people in the image. The post-processing module is often composed of some logic strategies, such as greedy algorithms.

Among them, in addition to detecting the distribution heat map of feature points (also called heatmap), the Openpose algorithm also proposes a heat map representing the connection information of feature points: pafmap. The position with high confidence in the heat map indicates that there is a feature point connection at that location. The probability is also high. The heatmap and pafmap are used as the output of the algorithm model, and the greedy algorithm is used as the post-processing strategy to realize the assignment of multi-person feature points to independent character instances. The method has undergone two versions of evolution. In the first version released, the model structure is divided into a basic network and a heat map detection network. The heat map detection network contains multiple stages, and each stage is divided into two upper and lower stages. Branches, the network structure of each branch is exactly the same, but is responsible for learning different image information: one is responsible for learning the feature point distribution heat map heatmap, and the other is responsible for learning the feature point connection distribution heat map pafmap. The next stage takes the characteristic information of the basic network and the heatmap and pafmap detected by the previous stage as input. In the second version released, the heat map detection network is still divided into multiple stages, but the dual-branch structure is modified to a single branch. The first N stages are only responsible for learning the distributed heat map pafmap connected by feature points, and the latter M stages It is only responsible for learning the heatmap of the feature point distribution heatmap, and at the same time replaces the 7*7 convolution in the model with three 3*3 convolution residual connections, which reduces the amount of calculations and enriches the image perception field that the model can learn. .

However, the inventor found in the research that although the top-down human feature point detection algorithm does not require complicated post-processing, it cannot detect all the human feature points in the image at the same time, and can only use the target detection algorithm to detect the image in the first place. The human body is detected, and then the human body feature point detection is performed on the detected single person. The detection speed has a linear growth relationship with the number of human bodies in the image. All top-down human body feature point algorithms have the disadvantage of slow detection speed. Real-time detection cannot be achieved. When deploying on the mobile terminal, the model calculation is too high, the parameter amount is large, and the deployment is difficult. Although the detection speed of the Openpose algorithm has nothing to do with the number of people in the image, it does not require additional target detection algorithms for preprocessing. However, the algorithm model also has the disadvantages of high model complexity and large amount of calculation. The stacking of multiple stages does not significantly improve the accuracy of the model, but it brings a lot of redundant calculations. The second version has a single branch structure. Although the 3*3 residual connection method in the middle can increase the sensing field information, it brings a very small increase in accuracy, and at the same time causes a lot of waste of calculations. These design structures cause the model to be deployed on the mobile terminal, which will cause the model to calculate Too high, large amount of parameters, difficult deployment and other issues.

In response to the above problems, the inventor has discovered through long-term research and proposed the method, device, electronic equipment and storage medium for detecting human body feature points provided by the embodiments of this application. The multi-scale feature extraction of the image to be detected is used to obtain different information. Based on the image features at different scales, the position information of the human body feature points and the connection information of the human body feature points are obtained based on the image features at different scales, thereby greatly improving the accuracy and efficiency of the detection of human body feature points. Among them, the specific detection method of human body feature points will be described in detail in the subsequent embodiments.

Please refer to FIG. 1. FIG. 1 shows a schematic flowchart of a method for detecting human body feature points provided by an embodiment of the present application. Image features at different scales, and based on the image features at different scales, obtain the position information of the human body feature points and the connection information of the human body feature points, thereby greatly improving the accuracy and efficiency of the detection of human body feature points. In a specific embodiment, the method for detecting human body feature points is applied to the human body feature point detection device 200 as shown in FIG. 6 and the electronic device 100 equipped with the human body feature point detection device 200 (FIG. 7 ). The following will take an electronic device as an example to describe the specific process of this embodiment. Of course, it is understandable that the electronic device applied in this embodiment can be a mobile terminal, a smart phone, a tablet computer, a wearable electronic device, etc. Make a limit. The process shown in FIG. 1 will be described in detail below. The method for detecting human feature points may specifically include the following steps:

Step S110: Obtain an image to be detected.

In this embodiment, an image to be detected may be acquired, where the acquired image to be detected includes at least one human body. In some embodiments, the image to be detected may be a preview image collected by a camera of an electronic device, a photo taken by a camera of an electronic device and stored in an album, an image downloaded from the Internet and stored in an album, etc. There is no limitation here. In addition, in some embodiments, the acquired image to be detected may be a static image or a dynamic image, which is not limited herein.

Step S120: Perform down-sampling processing on the image to be detected to obtain a first image feature of the image to be detected.

In this embodiment, after the image to be detected is acquired, the image to be detected may be down-sampled to obtain the first image feature of the image to be detected. Among them, the image to be detected may be sequentially subjected to 2 times down-sampling processing until the obtained first image feature of the image to be detected meets the processing requirements. In some embodiments, the image to be detected may be sequentially subjected to 2 times down-sampling processing, a total of 4 times That is, 16 times down-sampling processing is performed on the image to be detected, so that the first image feature of the image to be detected includes sufficient abstract features without causing excessive feature extraction to meet processing requirements. Specifically, after acquiring the image to be detected, the image to be detected can be down-sampled twice, and then the image features obtained by the down-sampling process can be down-sampled by a factor of 4, and then down-sampling the image feature obtained by the down-sampling process by 4 times. The acquired image features are subjected to 8-fold down-sampling processing, and then the image features acquired through the 8-fold down-sampling processing are subjected to 16-fold down-sampling processing to obtain the first image feature of the image to be detected.

Of course, in some embodiments, the image to be detected can also be down-sampled by more times. For example, the image to be detected can also be down-sampled by 32 times and 64-fold, which is not limited here.

Wherein, in this embodiment, the number of obtained first image features of the image to be detected is multiple.

Step S130: Perform multi-scale feature extraction on the first image feature to obtain multiple second image features of the image to be detected.

Among them, because the down-sampling processing of the image to be detected is to sequentially perform 2 times down-sampling processing on the image to be detected, specifically, after the image to be detected is subjected to 2 times down-sampling processing, the image features obtained by the 2 times down-sampling processing are then performed Perform 4 times downsampling processing, etc., that is, the above downsampling is processed in a serial manner. The input of a certain convolutional layer can only be the output of the previous convolutional layer, which means that the convolutional layer can learn The feature information in the image feature can only be the single receptive field information represented by the output of the previous convolutional layer, that is, the scale and receptive field of the first image feature of the image to be detected obtained through down-sampling processing are relatively simple.

Therefore, in this embodiment, in order to improve the scale and receptive field of the obtained image features, multi-scale feature extraction can be performed on the first image feature of the image to be detected, so as to obtain the image characteristics of the image to be detected at different scales and different receptive fields. Multiple second image features. In some implementations, the first image feature can be processed through multiple convolutional layers with different convolution kernels in parallel. Specifically, the first image feature can be input to multiple convolutional layers with different convolution kernels to Make multiple convolutional layers with different convolution kernels process the first image feature separately and obtain the second image feature separately. It is understandable that because multiple convolutional layers in parallel use different sizes of convolution kernels, Based on the same input (first image feature), multiple second image features of different scales and different receptive fields can be output at the same time, and output together to the next layer as input, so that more scales and receptive fields of the image to be detected can be realized Of access.

Step S140: Perform a convolution operation on the multiple second image features to obtain the human body feature point position information and the human body feature point connection information in the image to be detected.

In this embodiment, after obtaining multiple second image features of the image to be detected, convolution operations may be performed on the multiple second image features to obtain the human body feature point position information (heatmap) and human body features of the image to be detected Click the connection information (pafmap). In some embodiments, after obtaining multiple second image features of the image to be detected, the multiple second image features can be divided into two branches for convolution operation, where one branch performs the convolution operation on the multiple second image features. The convolution operation outputs human body feature point position information, and the other branch performs a convolution operation on multiple second image features to output human body feature point connection information.

In some embodiments, after acquiring the human body feature point location information and the human body feature point connection information in the image to be detected, the human body feature point information may be obtained based on the human body feature point location information and the human body feature point connection information. Wherein, in this embodiment, after obtaining the human body feature point position information and the human body feature point connection information, the human body feature points at known positions can be connected based on the human body feature point connection information, thereby drawing and generating the human body feature point information.

According to an embodiment of the present application, the method for detecting feature points of a human body is to obtain an image to be detected, perform down-sampling processing on the image to be detected, obtain the first image feature of the image to be detected, and perform multi-scale feature extraction on the first image feature to obtain the image to be detected. Detect multiple second image features of the image at different scales and different receptive fields, and perform convolution operations on multiple second image features to obtain the human body feature point position information and the human body feature point connection information in the image to be detected, so as to pass Perform multi-scale feature extraction on the image to be detected to obtain image features at different scales and different receptive fields, and obtain human feature point position information and human feature point connection information based on image features at different scales and different receptive fields. Increase the accuracy and efficiency of human feature point detection.

Please refer to FIG. 2, which shows a schematic flowchart of a method for detecting human body feature points according to another embodiment of the present application. The process shown in FIG. 2 will be described in detail below. The method for detecting human feature points may specifically include the following steps:

Step S210: Obtain an image to be detected.

For the specific description of step S210, please refer to step S110, which will not be repeated here.

Step S220: Perform N1-fold down-sampling processing on the image to be detected to obtain features of the image to be processed, where N1=2 ^M1 and N1 is a positive integer.

In this embodiment, after the image to be detected is acquired, the image to be detected may be down-sampled by N1 times to obtain the features of the image to be processed. In some embodiments, the N1 times downsampling process of the image to be detected may be 16 times downsampling of the image to be detected, that is, the image to be detected is subjected to 4 times of 2 times downsampling in sequence, so as to realize the 16 times downsampling of the image to be detected. Sampling processing, at this time, N1=16, M1=4.

Step S230: Perform N2 times upsampling processing on the image feature to be processed to obtain the first image feature of the image to be detected, N2=2 ^M2 , N2<N1, and N2 is a positive integer.

Among them, in view of the N1 times downsampling of the image to be detected, in order to obtain more abstract features of the image to be detected, the image size after the N1 times downsampling of the image to be detected is often smaller. If you directly downsample the N1 times Multi-scale feature extraction is performed on the image features to be processed after sampling processing. Then, when convolution with a large convolution kernel is performed, it is easy to cause excessive extraction of image features and introduce too much unnecessary redundant information. For example, in order to obtain more abstract features of the image to be detected, the image to be detected is generally downsampled by 16 times. Correspondingly, the image size after the image to be detected is downsampled by 16 times will be smaller. The first feature image to be processed after the down-sampling process is subjected to multi-scale feature extraction. Then, when performing 7*7 convolution, it will easily cause excessive extraction of image features and introduce unnecessary redundant information.

Therefore, in this embodiment, after N1 times downsampling processing is performed on the image to be detected to obtain the image features to be processed, the image features to be processed can also be subjected to N2 times upsampling processing to determine the newly acquired image features as the image to be detected. The first image feature to avoid excessive extraction of image features and the introduction of unnecessary redundant information. In some embodiments, performing N1 times downsampling processing on the image to be detected may be 16 times downsampling processing for the image to be detected, and performing N2 times upsampling processing on the image features to be processed may be 2 times upsampling processing. In this case, N1= 16, M1=4, N2=2, M2=1, that is to say, after the image feature to be processed is subjected to 2 times upsampling, the first image feature can be restored to the image feature under 8 times downsampling, thereby ensuring Under the premise of acquiring more abstract features, avoid excessive extraction of image features and introduce unnecessary redundant information.

Step S240: Perform multi-scale feature extraction on the first image feature to obtain multiple second image features of the image to be detected.

For the specific description of step S240, please refer to step S130, which will not be repeated here.

Step S250: Perform down-sampling processing on the image to be detected, and obtain a third image feature of the image to be detected.

Among them, in order to further improve the feature scale information and receptive field that can be obtained, in addition to performing convolution operations on multiple second image features to obtain body feature point position information and body feature point connection information, you can also obtain additional information from the image to be detected The down-sampling process participates in the convolution operation in this method. This method can not only further increase the feature scale information and receptive field, but also increase the shallow accurate pixel position information, and improve the acquisition accuracy of the human body feature point position information and the body feature point connection information . Specifically, the multiple second image features obtained by performing multi-scale feature extraction on the first image feature of the image to be detected are abstract features of the image to be detected, and the third image feature obtained by down-sampling the image to be detected is the image to be detected. The shallow features of the image, that is, multiple second image features and third image features have different scales and different receptive fields. Therefore, the third image feature is involved in the convolution operation to obtain the position information of the human body feature points and the connection of the human body feature points In information, the scale and receptive field of the data can be increased. Furthermore, since the third image feature is a shallow image feature, and the pixel position information of the shallow image feature is more accurate, it can improve the acquired human feature point position information and The accuracy of the connection information of the feature points of the human body.

Therefore, in this embodiment, the image to be detected can also be down-sampled to obtain the third image feature of the image to be detected, and the third image feature is involved in the convolution operation. In some embodiments, the feature extraction of the image to be detected may be performed through a convolutional layer, which is not limited herein.

Among them, when performing channel connection of two image features, it is necessary to ensure that the image sizes corresponding to the two image features are consistent. Therefore, in this embodiment, if the channel connection between the first image feature and the third image feature is to be performed, it is necessary to ensure that the image size corresponding to the first image feature and the image size corresponding to the third image feature are one foot. For example, if the first image feature is obtained by downsampling 16 times on the image to be detected, the third image feature also needs to be obtained by downsampling 16 times. If the first image feature is obtained by down sampling 8 times on the image to be detected If it is obtained, the third image feature also needs to be obtained through 8-fold down-sampling.

Therefore, in this embodiment, after the image to be detected is acquired, the image to be detected may be down-sampled by N3 times to obtain the third image feature of the image to be detected. Wherein, the N3 times downsampling process of the image to be detected can be the 2 ^M1-M2 times downsampling process for the image to be detected, so that the image size corresponding to the third image feature obtained by the N3 times downsampling process of the image to be detected can be compared with the first The image size corresponding to one image feature is consistent, so as to provide a connection basis when a plurality of second to-be-processed image features and third image features are subsequently channel-connected.

In some embodiments, the first image feature may be obtained by performing N1 times downsampling processing on the image to be detected and then performing N2 times upsampling processing, where N1=2 ^M1 , N2=2 ^M2 , and the third image feature may be The image to be detected is obtained by performing feature extraction after N3 times downsampling, where N3=2 ^M1-M2 , that is, it can be ensured that the image size corresponding to the first image feature is consistent with the image size corresponding to the third image feature. For example, when N1=16 and N2=2, the image size corresponding to the first image feature is the image size corresponding to the 8 times downsampling process of the image to be detected. At this time, it can be determined that M1=4, M2=1, because N3=2 ^M1-M2 , it can be determined that N3=8, that is, the image size corresponding to the third image feature is also the image size corresponding to the image to be detected after 8 times downsampling processing.

Step S260: Perform a convolution operation on the plurality of second image features and the third image feature to obtain the human body feature point position information and the human body feature point connection information in the image to be detected.

In some embodiments, after obtaining multiple second image features and third image features of the image to be detected, a convolution operation may be performed on the multiple second image features and third image features to obtain the human body of the image to be detected Feature point location information and body feature point connection information. In some embodiments, after obtaining multiple second image features and third image features of the image to be detected, the multiple second image features and third image features may be divided into two branches for convolution operation, where: One branch performs convolution operations on multiple second image features and third image features to output human feature point position information, and the other branch performs convolution operations on multiple second image features and third image features to output human feature point connection information .

Please refer to FIG. 3, which shows a schematic flowchart of step S260 of the method for detecting human body feature points shown in FIG. 2 of the present application. The following will elaborate on the process shown in FIG. 3, and the method may specifically include the following steps:

Step S261: Channel connecting the plurality of second image features and the third image feature to obtain a fourth image feature.

In this embodiment, multiple second image features and third image features can be channel-connected to obtain the fourth image feature, and the fourth image feature is involved in the convolution operation to obtain the total human body feature of the image to be detected Point location information and body feature point connection information. In some embodiments, after multiple second image features and third image features are obtained, the multiple second image features and third image features can be channel-connected through the concat operator. For example, if multiple second image features are obtained The features include two second image features, respectively 19-dimensional and 38-dimensional, and the third image feature is 38-dimensional. After the channel concat, the output fourth image feature is 19+38+38=95-dimensional.

Step S262: Perform a convolution operation on the fourth image feature to obtain the human body feature point position information and the human body feature point connection information in the image to be detected.

In this embodiment, after the fourth image feature of the image to be detected is obtained, a convolution operation may be performed on the fourth image feature to obtain the position information of the human body feature point and the connection information of the human body feature point of the image to be detected. In some embodiments, after the fourth image feature of the image to be detected is obtained, the fourth image feature can be divided into two branches for convolution operation, and one branch performs convolution operation on the fourth image feature to output the human body feature Point position information, another branch performs a convolution operation on the fourth image feature to output the human body feature point connection information.

According to another embodiment of the present application, a method for detecting feature points of a human body is provided, an image to be detected is obtained, the image to be detected is subjected to N1 times down-sampling processing to obtain the image features to be processed, and the image to be processed is subjected to N2 times upsampling processing to obtain the to-be-detected image features The first image feature of the image, multi-scale feature extraction is performed on the first image feature, and multiple second image features of the image to be detected under different scales and different receptive fields are obtained. Feature extraction is performed on the image to be detected to obtain the image of the image to be detected. In the third image feature, a convolution operation is performed on a plurality of second image features and a third image feature to obtain the human body feature point position information and the human body feature point connection information in the image to be detected. Compared with the detection method of human body feature points shown in FIG. 1, this embodiment also performs N1 times downsampling processing on the image to be detected, and then performs N2 times upsampling processing to obtain the first image features to obtain more abstractions. At the same time, it can avoid the excessive extraction of image features and the introduction of unnecessary redundant information. In addition, this embodiment also performs a convolution operation based on a plurality of second image features and a third image feature extracted based on the image to be detected, so as to increase the receptive field of the image.

Please refer to FIG. 4, which shows a schematic flowchart of a method for detecting human body feature points according to another embodiment of the present application. The following will elaborate on the process shown in FIG. 4, and the method for detecting human feature points may specifically include the following steps:

Step S310: Obtain an image to be detected.

For the specific description of step S310, please refer to step S110, which will not be repeated here.

Step S320: Perform down-sampling processing on the image to be detected to obtain a first image feature of the image to be detected.

In this embodiment, a trained detection model may be used to process the acquired image to be detected, so as to output the human body feature point position information and the human body feature point connection information of the to be detected image. Among them, as shown in FIG. 5, FIG. 5 shows the overall framework diagram of the detection model provided by the embodiment of the present application. The detection model may include three main parts: a basic network module F, a multi-scale module M, and a heat map detection module. S.

Among them, after acquiring the image to be detected, the image to be detected can be input to the basic network module in the detection model, and the image to be detected is down-sampled through the basic network module to obtain the first image feature of the image to be detected, and the The first image feature is used as the input of the multi-scale module in the detection model. In some embodiments, the basic network module may include: Vgg, ResNet, Mobilenet, and other convolutional neural networks. If a deeper network model such as Vgg, ResNet is used, the computational complexity of the model will be increased, but higher If a lightweight network model such as Mobilenet is used, a certain detection accuracy will be lost, but a faster detection speed can be obtained, and complete real-time detection can be achieved.

Step S330: Input the first image feature into the multi-scale module of the detection model, and perform multi-scale feature extraction on the first image feature through the multi-scale module to obtain multiple second image features of the image to be detected .

Wherein, in this embodiment, after obtaining the first image feature output by the basic network module, the first image feature can be input to the multi-scale module of the detection model to perform multi-scale feature extraction on the first image feature through the multi-scale module , To obtain multiple second image features of the image to be detected. Among them, in some embodiments, the multi-scale module includes multiple convolutional layers in parallel, and the convolution kernel of each convolutional layer in the multiple convolutional layers is different, and each convolutional layer is used to obtain data from the first image. Extract the second image features of different scales and different receptive fields from the features. As a way, the multi-scale module can include 4 parallel convolutional layers, in order: 1*1 convolution, 3*3 convolution, 5*5 convolution, and 7*7 convolution, each convolution The size of the convolution kernel of the layers increases sequentially, and is responsible for extracting image information of different scales and receptive fields. The four parallel convolutional layers together form the multi-scale module.

Step S340: Input the plurality of second image features into the heat map detection module of the detection model, and perform a convolution operation on the plurality of second image features through the heat map detection module to obtain the output of the heat map detection module The human body feature point location information and the human body feature point connection information.

Wherein, in this embodiment, after obtaining multiple second image features output by the multi-scale module, multiple second image features can be input to the heat map detection module of the detection model, so that the multiple second image features can be detected by the heat map detection module. Two image features are subjected to convolution operation to obtain the position information of the human body feature points and the connection information of the human body feature points. In some embodiments, the third image feature output by the basic network module can also be obtained, and then multiple second image features and third image features can be channel-connected to obtain the fourth image feature, and then input into the heat map of the detection model The detection module is configured to perform a convolution operation on the fourth image feature through the heat map detection module to obtain the human body feature point position information and the human body feature point connection information.

In some embodiments, the heat map detection module includes only one convolution stage. The one convolution stage includes a first processing branch and a second processing branch. The first processing branch is used to detect and output human body feature point position information. , The second processing branch is used to detect and output the connection information of the human body feature points. In addition, the first processing branch includes two convolutional layers, and the second processing branch includes two convolutional layers.

Among them, in the Openpose model, the heat map detection module is serially connected by multiple stages to improve accuracy, but experiments have shown that neither heatmap detection nor pafmap detection does not require too many stages for correction. The concatenation of the stages not only brings a very limited increase in accuracy, but also brings a huge amount of parameters and calculations. In this embodiment, a multi-scale module is added, so that the image feature information input to the heat map detection module already contains very rich image feature information and scale information, which further makes it possible for the heat map detection module to reduce the number of stages, and only use A stage is enough to achieve high accuracy, and it can also greatly reduce the amount of calculations and parameters of the model, so that the model can be detected in real time on the mobile terminal. In addition, in this embodiment, the heat map detection module contains only one stage. In order to further reduce the amount of parameters and calculations, only two convolutional layers are used in each branch of the stage: a 3*3 convolution is responsible for the input The channel connection image feature is used for further feature extraction, and another 1*1 convolution is responsible for detecting the position information of the human body feature point/the connection information of the human body feature point, and output the feature map corresponding to the number of channels.

Regarding the trained detection model in the foregoing embodiment, the embodiment of the present application may further include training and correction of the detection model, wherein the training of the detection model may be performed in advance according to the acquired training data set, and subsequently Each time the detection is performed, the detection can be performed according to the detection model, and there is no need to train the detection model each time the detection is performed.

In some embodiments, training the detection model includes: obtaining a training data set, the training data set includes multiple images, and the human body feature point position information and the human body feature point connection information corresponding to each of the multiple images, based on In the training data set, each image is used as input data, and the position information of the human body feature points and the connection information of the human body feature points corresponding to each image are used as output data. The machine learning algorithm is used for training to obtain a trained detection model. Among them, the machine learning algorithm may include algorithms corresponding to the above-mentioned basic network module F, multi-scale module M, and heat map detection module S.

Among them, in the training process of the detection model, an objective function can be set, which is used to measure the difference between the detection result of the detection model and the real label. This function is called a loss function, also called a loss function. The goal of detection model training is to minimize this function. Setting different loss functions for the detection model means setting different learning goals for the training of the detection model.

In the present embodiment, the loss function contains two _{_{parts: L total = L heatmap + L}} pafmap, wherein, L _heatmap represents the feature point position in FIG heat _loss, L pafmap represents a feature point of attachment heat loss FIG.

Among them, the feature point location heat map loss is used to measure the loss between the detected feature point location heat map and the real feature point location heat map:

Where (i,j) represents the position of the pixel in the feature map, P _heat (i,j) represents the value of the detected feature point at the location (i,j) in the feature map, and G _heat (i,j) represents the real feature The value at position (i, j) in the point feature map, width and height respectively represent the width and height of the feature point map.

Feature point connection heat map loss is used to measure the loss between the detected feature point connection heat map and the real feature point connection heat map:

Where (i,j) represents the position of the pixel in the feature map, P _paf (i,j) represents the detected feature point connecting the value at the location (i,j) in the feature map, and G _paf (i,j) represents the real The value at the position (i, j) of the feature point connection feature map, width and height respectively represent the width and height of the feature point connection feature map.

In another embodiment of the present application, a method for detecting feature points of a human body is provided, an image to be detected is obtained, the image to be detected is down-sampled, the first image feature of the image to be detected is obtained, and the first image feature is input to the multi-scale module of the detection model , Perform multi-scale feature extraction on the first image feature through the multi-scale module to obtain multiple second image features of the image to be detected, input multiple second image features into the heat map detection module of the detection model, and use the heat map detection module to A convolution operation is performed on a plurality of second image features to obtain the human body feature point position information and the human body feature point connection information in the image to be detected. Compared with the method for detecting human body feature points shown in FIG. 1, this embodiment also detects the human body feature points of the image to be detected through the detection model, so as to improve the accuracy of detecting the human body feature points.

Please refer to FIG. 6. FIG. 6 shows a block diagram of a human body feature point detection apparatus 200 provided by an embodiment of the present application. The following will elaborate on the block diagram shown in FIG. 6, the human body feature point detection device 200 includes: a to-be-detected image acquisition module 210, a first image feature acquisition module 220, a second image feature acquisition module 230, and human body feature point detection Module 240, where:

The to-be-detected image acquisition module 210 is used to acquire the to-be-detected image.

The first image feature acquisition module 220 is configured to perform down-sampling processing on the image to be detected to obtain the first image feature of the image to be detected.

Further, the first image feature acquisition module 220 includes: a to-be-processed image feature acquisition sub-module and a first image feature acquisition sub-module, wherein:

The image feature obtaining submodule to be processed is used to perform N1 times downsampling processing on the image to be detected to obtain the image feature to be processed, where N1=2 ^M1 and N1 is a positive integer.

The first image feature acquisition sub-module is configured to perform N2 times upsampling processing on the image feature to be processed to obtain the first image feature of the image to be detected, N2=2 ^M2 , N2<N1, and N2 is a positive integer.

The second image feature acquisition module 230 is configured to perform multi-scale feature extraction on the first image feature to obtain multiple second image features of the image to be detected.

Further, the second image feature acquisition module 230 includes: a second image feature acquisition sub-module, wherein:

The second image feature acquisition sub-module is used to input the first image feature into the multi-scale module of the detection model, and perform multi-scale feature extraction on the first image feature through the multi-scale module to obtain the image to be detected Of multiple second image features.

The human body feature point detection module 240 is configured to perform a convolution operation on the multiple second image features to obtain the human body feature point position information and the human body feature point connection information in the image to be detected.

Further, the human body feature point detection module 240 includes: a third feature image acquisition sub-module and a first human body feature point detection sub-module, wherein:

The third feature image acquisition sub-module is configured to perform feature extraction on the image to be detected, and obtain the third image feature of the image to be detected.

Further, the third characteristic image acquisition sub-module includes: a third characteristic image acquisition unit, wherein:

The third feature image acquisition unit is configured to perform N3 times downsampling processing on the image to be detected, and the third image feature of the image to be detected, where N3=2 ^M1-M2 , and N3 is a positive integer.

The first human body feature point detection sub-module is used to perform convolution operations on the plurality of second image features and the third image feature to obtain the human body feature point position information in the image to be detected and the Human body feature point connection information.

Further, the first human body feature point detection sub-module includes: a fourth image feature obtaining unit and a human body feature point detection unit, wherein:

The fourth image feature obtaining unit is configured to channel-connect the plurality of second image features and the third image feature to obtain a fourth image feature.

The human body feature point detection unit is configured to perform a convolution operation on the fourth image feature to obtain the human body feature point position information and the human body feature point connection information in the image to be detected.

Further, the human body feature point detection module 240 includes: a second human body feature point detection sub-module, wherein:

The second human body feature point detection sub-module is used to input the multiple second image features into the heat map detection module of the detection model, and perform convolution operations on the multiple second image features through the heat map detection module, Obtain the human body feature point location information and the human body feature point connection information output by the heat map detection module.

Further, the device 200 for detecting human body feature points further includes: a training data set acquisition module and a model training module, wherein:

The training data set acquisition module is used to acquire a training data set, the training data set includes a plurality of images, and the human body feature point position information and the human body feature point connection information corresponding to each of the multiple images.

The model training module is configured to use each image as input data based on the training data set, and the human body feature point position information and human body feature point connection information corresponding to each image as output data, and train through machine learning algorithms , To obtain the trained detection model.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the device and module described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the several embodiments provided in this application, the coupling between the modules may be electrical, mechanical or other forms of coupling.

In addition, the functional modules in the various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or software functional modules.

Please refer to FIG. 7, which shows a structural block diagram of an electronic device 100 provided by an embodiment of the present application. The electronic device 100 may be an electronic device capable of running application programs, such as a smart phone, a tablet computer, or an e-book. The electronic device 100 in this application may include one or more of the following components: a processor 110, a memory 120, and one or more application programs, where one or more application programs may be stored in the memory 120 and configured to be composed of one Or multiple processors 110 execute, and one or more programs are configured to execute the method described in the foregoing method embodiment.

The processor 110 may include one or more processing cores. The processor 110 uses various interfaces and lines to connect various parts of the entire electronic device 100, and executes by running or executing instructions, programs, code sets, or instruction sets stored in the memory 120, and calling data stored in the memory 120. Various functions and processing data of the electronic device 100. Optionally, the processor 110 may adopt at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). A kind of hardware form to realize. The processor 110 may be integrated with one or a combination of a central processing unit (CPU), a graphics processing unit (GPU), a modem, and the like. Among them, the CPU mainly processes the operating system, user interface, and application programs; the GPU is used for rendering and drawing the content to be displayed; the modem is used for processing wireless communication. It can be understood that the above-mentioned modem may not be integrated into the processor 110, but may be implemented by a communication chip alone.

The memory 120 may include random access memory (RAM) or read-only memory (Read-Only Memory). The memory 120 may be used to store instructions, programs, codes, code sets or instruction sets. The memory 120 may include a program storage area and a data storage area, where the program storage area may store instructions for implementing the operating system and instructions for implementing at least one function (such as touch function, sound playback function, image playback function, etc.) , Instructions used to implement the following various method embodiments, etc. The data storage area can also store data created during use of the mobile terminal 100 (such as phone book, audio and video data, chat record data) and the like.

Please refer to FIG. 8, which shows a structural block diagram of a computer-readable storage medium provided by an embodiment of the present application. The computer-readable medium 300 stores program code, and the program code can be invoked by a processor to execute the method described in the foregoing method embodiment.

The computer-readable storage medium 300 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM. Optionally, the computer-readable storage medium 300 includes a non-transitory computer-readable storage medium. The computer-readable storage medium 300 has storage space for the program code 310 for executing any method steps in the above-mentioned methods. These program codes can be read from or written into one or more computer program products. The program code 310 may be compressed in a suitable form, for example.

In summary, the human body feature point detection method, device, electronic device, and storage medium provided in the embodiments of the present application acquire the image to be detected, perform down-sampling processing on the image to be detected, and obtain the first image feature of the image to be detected. Perform multi-scale feature extraction on the first image feature, obtain multiple second image features of the image to be detected, perform convolution operation on multiple second image features, and obtain the position information of the human body feature points in the image to be detected and the connection of the human body feature points Information, through the multi-scale feature extraction of the image to be detected, to obtain image features at different scales, and obtain human feature point position information and human feature point connection information based on image features at different scales, thereby greatly improving human body features Accuracy and efficiency of point detection.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application, not to limit them; although the application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the foregoing embodiments are modified, or some of the technical features thereof are equivalently replaced; these modifications or replacements do not drive the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

A method for detecting human body feature points, characterized in that the method includes:

Obtain the image to be detected;

Performing down-sampling processing on the image to be detected to obtain the first image feature of the image to be detected;

Performing multi-scale feature extraction on the first image feature to obtain multiple second image features of the image to be detected;

Performing a convolution operation on the multiple second image features to obtain the human body feature point position information and the human body feature point connection information in the image to be detected.
The method according to claim 1, wherein the performing a convolution operation on the multiple second image features to obtain the human body feature point position information and the human body feature point connection information in the to-be-detected image comprises :

Performing down-sampling processing on the image to be detected to obtain a third image feature of the image to be detected;

Performing a convolution operation on the plurality of second image features and the third image feature to obtain the human body feature point position information and the human body feature point connection information in the image to be detected.
The method according to claim 2, wherein the convolution operation is performed on the plurality of second image features and the third image feature to obtain the position of the human body feature point in the image to be detected The information and the connection information of the human body feature points include:

Channel-connecting the plurality of second image features and the third image feature to obtain a fourth image feature;

Performing a convolution operation on the fourth image feature to obtain the human body feature point position information and the human body feature point connection information in the image to be detected.
The method according to claim 3, wherein the channel connection of the plurality of second image features and the third image feature to obtain the fourth image feature comprises:

Channel connection of the plurality of second image features and the third image feature through a concat operator to obtain the fourth image feature.
The method according to claim 3 or 4, wherein the image size corresponding to the second image feature is the same as the image size corresponding to the third image feature.
The method according to claim 2, wherein the performing down-sampling processing on the image to be detected to obtain the first image feature of the image to be detected comprises:

Perform N1-fold down-sampling processing on the image to be detected to obtain features of the image to be processed, where N1=2 M1 and N1 is a positive integer;

Performing N2 times upsampling processing on the image feature to be processed to obtain the first image feature of the image to be detected, N2=2 M2 , N2<N1, and N2 is a positive integer.
The method according to claim 6, wherein the performing down-sampling processing on the image to be detected to obtain a third image feature of the image to be detected comprises:

Perform N3 times downsampling processing on the image to be detected to obtain the third image feature of the image to be detected, where N3=2 M1-M2 and N3 is a positive integer.
The method according to any one of claims 1-7, wherein the performing multi-scale feature extraction on the first image feature to obtain multiple second image features of the image to be detected comprises:

The first image feature is input into the multi-scale module of the detection model, and the multi-scale feature extraction is performed on the first image feature through the multi-scale module to obtain a plurality of second image features of the image to be detected.
The method according to claim 8, wherein the multi-scale module comprises a plurality of convolutional layers in parallel, and the convolution kernel of each convolutional layer in the plurality of convolutional layers is different, and the Each convolutional layer is used to extract second image features of different scales from the first image features.
The method according to any one of claims 1-7, wherein the convolution operation is performed on the multiple second image features to obtain the human body feature point position information and the human body feature in the image to be detected Point connection information, including:

Input the plurality of second image features into the heat map detection module of the detection model, and perform convolution operation on the plurality of second image features through the heat map detection module to obtain the output of the heat map detection module The human body feature point location information and the human body feature point connection information.
The method according to claim 10, wherein the heat map detection module includes a convolution stage, and the one convolution stage includes a first processing branch and a second processing branch, and the first processing branch is used for The position information of the human body feature points is detected and output, and the second processing branch is used to detect and output the connection information of the human body feature points.
The method according to claim 11, wherein the first processing branch includes two convolutional layers, and the second processing branch includes two convolutional layers.
The method according to any one of claims 1-7, wherein before the acquiring the image to be detected, the method further comprises:

Acquiring a training data set, the training data set including a plurality of images, and human body feature point position information and human body feature point connection information corresponding to each of the multiple images;

Based on the training data set, each image is used as input data, and the human body feature point position information and human body feature point connection information corresponding to each image are used as output data, and the machine learning algorithm is trained to obtain the trained detection Model.
The method according to any one of claims 1-13, wherein the convolution operation is performed on the multiple second image features to obtain the human body feature point position information and the human body feature in the image to be detected After clicking the connection information, it also includes:

Based on the human body feature point location information and the human body feature point connection information, the human body feature point information is obtained.
The method according to claim 14, wherein the obtaining human body feature point information based on the human body feature point location information and the body feature point connection information comprises:

Obtaining the position of the human body feature point based on the position information of the human body feature point;

Connect the human body feature points based on the human body feature point connection information, and draw and generate the human body feature point information.
The method according to any one of claims 1-15, wherein the performing multi-scale feature extraction on the first image feature to obtain multiple second image features of the image to be detected comprises:

The first image feature is input into a plurality of convolutional layers of different convolution kernels, so that the convolutional layers of the plurality of different convolution kernels process the first image features respectively to obtain the image to be detected Of multiple second image features.
A detection device for human body feature points, characterized in that the device comprises:

The acquisition module is used to acquire the image to be detected;

The first image feature acquisition module is configured to perform down-sampling processing on the image to be detected to obtain the first image feature of the image to be detected;

The second image feature acquisition module is configured to perform multi-scale feature extraction on the first image feature to obtain multiple second image features of the image to be detected;

The human body feature point detection module is configured to perform a convolution operation on the multiple second image features to obtain the human body feature point position information and the human body feature point connection information in the image to be detected.
The device according to claim 17, wherein the human body feature point detection module comprises:

The third feature image acquisition sub-module is configured to perform down-sampling processing on the image to be detected to acquire the third image feature of the image to be detected;

The first human body feature point detection sub-module is used to perform convolution operations on the plurality of second image features and the third image feature to obtain the human body feature point position information in the image to be detected and the Human body feature point connection information.
An electronic device, comprising a memory and a processor, the memory is coupled to the processor, the memory stores instructions, and the processor executes the instructions when the instructions are executed by the processor. The method described in any one of 1-16 is required.
A computer-readable storage medium, wherein the computer-readable storage medium stores program code, and the program code can be called by a processor to execute the method according to any one of claims 1-16 .