CN113221626A - Human body posture estimation method based on Non-local high-resolution network - Google Patents

Human body posture estimation method based on Non-local high-resolution network Download PDF

Info

Publication number
CN113221626A
CN113221626A CN202110241318.9A CN202110241318A CN113221626A CN 113221626 A CN113221626 A CN 113221626A CN 202110241318 A CN202110241318 A CN 202110241318A CN 113221626 A CN113221626 A CN 113221626A
Authority
CN
China
Prior art keywords
convolution
resolution
channels
network
convolution kernel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110241318.9A
Other languages
Chinese (zh)
Other versions
CN113221626B (en
Inventor
何宁
孙琪翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Union University
Original Assignee
Beijing Union University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Union University filed Critical Beijing Union University
Priority to CN202110241318.9A priority Critical patent/CN113221626B/en
Publication of CN113221626A publication Critical patent/CN113221626A/en
Application granted granted Critical
Publication of CN113221626B publication Critical patent/CN113221626B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/30Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features

Abstract

The invention discloses a human body posture estimation method based on a Non-local high-resolution network, wherein a Non-local network module is newly designed, residual connection allows any new Non-local network module to be inserted into any network without damaging the original network structure, namely, the algorithm can use the initial pre-training weight of the high-resolution network. The size of the input parameter can be maintained through non-local operation, the input value of the non-local network module can be changed, and the parameters are the same at the input end and the output end of the non-local network. In the 4 th stage of the high-resolution network, a non-local module is added to the module with the minimum resolution, and the number of channels is set to be 256, the reason why the non-local module is added in the stage is that the feature with the minimum resolution can be obtained in the stage, the smaller resolution includes the low-level feature with strong semantic information, and the non-local module is added to the module with the small resolution to highlight the main feature, so that the better experiment result is obtained.

Description

Human body posture estimation method based on Non-local high-resolution network
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to a human body posture estimation method based on a non-local high-resolution network, which can be applied to human body posture estimation and human body skeleton extraction.
Background
Human body Pose Estimation (Human position Estimation) is a process of detecting key parts or main joints of a Human body in a given image or video and finally outputting all or partial limb related parameters (relative position relation of all joint points) of the Human body, such as a Human body outline, the position and orientation of a head, the position and part type of joints of the Human body, and the like. Human behavior recognition based on computer vision technology is widely applied to numerous fields of human life, such as tasks of video monitoring, motion retrieval, man-machine interaction, intelligent home, medical care and the like. Human pose estimation is a fundamental problem in computer vision.
There are two main approaches considered from the perspective of achieving attitude estimation: top-down methods and bottom-up methods. The top-down approach relies on detectors to detect human instances, one bounding box for each person, and then reduces the problem to a single person's pose estimate. Fang et al propose a local Multi-person position Estimation (RMPE) framework to mitigate the impact of erroneous detection results on single-person position Estimation. ChenY et al propose a Cascade Pyramid Network (CPN) method, which combines RefineNet to further optimize the key points difficult to predict on the basis of a Feature Pyramid Network (Feature Pyramid Network). Similarly, Huang et al propose a Coarse-Fine network, which uses multi-scale supervision and fuses Coarse and Fine features to obtain the final prediction result. Xiao et al propose a very simple but very efficient network for multi-person pose estimation and tracking. When the human body posture estimation is carried out by the top-down method, detection frames with different sizes are unified to one scale for learning, so that the method is insensitive to the scale and relatively easy to predict for small-scale people; however, the requirement for target detection is high, and if detection has errors, the detection is difficult to recover; in addition, the calculation amount is in direct proportion to the number of people in the picture during operation, and the more people, the larger the calculation amount.
The bottom-up approach first finds all key points in the picture, such as all heads, left hand, knees, etc. These key points are then assembled into an individual. After the introduction of deep learning, Pishchulin et al proposed depcut to model the allocation problem as an Integer Linear Programming (ILP) problem in a Fully Connected Graph (full Connected Graph), and then much work was expanded around solving the ILP problem, and Iqbal et al proposed that the ILP problem could be solved locally, but not well. Insafutdinov et al propose Deepercut to estimate keypoints with deeper ResNet and to improve inference efficiency using image-dependent pairwise terms and an incremental optimization strategy. Levinkov et al propose that two local search algorithms can monotonically converge to local optimum, providing a feasible solution for ILP problems. Varadarajan et al propose a greedy allocation algorithm to reduce the complexity of the ILP problem in combination with the intrinsic structural information of the human body. Openpos proposed by Cao Z et al models the relationship between key points into joint gravity Fields (PAFs) to help better and faster assign key points, which greatly improves accuracy or speed compared with previous methods, and makes the speed of multi-person posture estimation reach real-time level for the first time. Some work follows to learn the keypoint locations and how to assign the keypoints at the same time to avoid complex post-assignment processing. Newell a et al propose that associated Embedding can predict a Tag Heatmap (Tag Heatmap) at the same time as the keypoints to help with keypoint group assignment. Nie et al propose a general Partition Network that uses global information obtained from the gravity embedding space to make local judgment, improving the accuracy and efficiency of the inference. The PersonLab proposed by Papandreou et al learns the keypoints and their relative displacements at the same time, and serves to directly aggregate the keypoints of the same person. The bottom-up method is not affected by other tasks, namely errors of target detection; the calculated amount is irrelevant to the number of people in the image, so that the efficiency is greatly improved; however, the method is sensitive to scale and difficult to predict for small-scale people.
The deep learning algorithm is based on a convolutional neural network, and the convolution operation aiming at the image at the present stage is to perform local convolution operation on the space. For example, in the commonly used 3 × 3 convolution, the convolution filter has 9 pixels, and the value of the target pixel is calculated with reference to only itself and the surrounding 8 pixels. This means that convolution can only use local information to compute the target pixel, which causes some bias because global information is not visible. Typically to alleviate this problem, larger convolution filters or deep networks with more convolution layers are used. Although this brings improvements, it also leads to problems of computational inefficiency and difficulty in optimization. Aiming at the problems, the non-local neural network module is designed firstly, then the module is fused with the high-resolution network of the current optimal Human body posture estimation backbone network, finally 3 NLHR network structures are designed, and the validity of a correlation algorithm is verified on MPII (MPII Human dose dataset) and COCO Human body key point data sets.
Disclosure of Invention
Aiming at the problems and the defects of the traditional convolutional neural network, an NLHR network structure utilizing a Non-local network module is provided, and a novel network structure NLHR (Non-local High-Resolution) network is designed, and the accuracy of human body posture estimation can be greatly improved under the condition that relatively small parameters are kept by the network structure. A human body posture estimation method based on a Non-local high-resolution network is characterized by comprising the following steps:
step 1, acquiring an image, and directly acquiring a local image through a function, wherein the RGB image is required to be removed.
And 2, detecting the human body by using the YOLOv3 network to obtain a human body surrounding frame bounding box (bbox).
Step 3, expanding the height or width of the human body detection frame to a fixed length-width ratio: high: the width is 4:3, and then the human detection frame is cut out of the image and readjusted to a fixed size of 256 × 192.
And 4, extracting human skeletons, putting the skeletons in the step 3 into an NLHR network, and running a statement of python tools/train py-cfg experiments/coco/hrnet/w 32-256 x192_ adam _ lr1e-3. yaml. Extracting human skeleton. There are 17 key points. The main contribution of the invention is to design a novel network NLHR network, and the network parameters and the flow are as follows:
the initial part of the network is a convolution with two layers of step size 2 and a convolution kernel of 3 x 3, so that the resolution is reduced to 1/4 at the input, the number of channels becomes 64, followed by the body structure of the network, which contains 4 stages, 4 parallel convolution branches. The resolutions are 1/4, 1/8, 1/16, 1/32, respectively.
The 1 st stage comprises 4 Bottleneck residual error units, the 1 st residual error unit is subjected to three-layer convolution to increase the channel number from 64 to 256, and then is subjected to 3 Bottleneck residual error units with the channel number of 256. Then, the signal enters a transition1 module, the module is divided into two branches, one branch passes through the convolution with the step length of 1 and the convolution kernel of 3 multiplied by 3, the resolution keeps 1/4 at the input time, the number of channels is changed to 32, and the channel is recorded as x0. The other one passes through the convolution with step size 2 and convolution kernel 3 × 3, the resolution is reduced to 1/8 at the time of input, the number of channels becomes 64, and is recorded as x1
Stage 2, x04 consecutive Basicblock, x with 32 incoming channels14 consecutive basicblocks with 64 lanes entered. Then enter a fusion phase, x0The split into two branches remains unchanged, the other branch passes through a convolution with step size 2 and convolution kernel 3 × 3, at which time the resolution is reduced to 1/8 and the number of channels becomes 64. x is the number of1Also, two branches remain unchanged, the other branch passes through a convolution with step size 1 and convolution kernel 1 × 1, at which time the resolution is raised to 1/4 and the number of channels becomes 32. Then two channels with the same number are merged into a new x0And x1. Next, the flow proceeds to transition2 block, where block x0Remains unchanged, x1Dividing into two branches, one branch is kept unchanged, the other branch passes through convolution with step length of 2 and convolution kernel of 3 multiplied by 3, at this time, resolution is reduced to 1/16, channel number is changed to 128, and is recorded as x2
Stage 3, x04 consecutive Basicblock, x with 32 incoming channels14 consecutive Basi with 64 entry channelscblock,x24 consecutive basicblocks with 128 channels in. Then enter a fusion phase, x0Dividing into 3 branches, keeping the 1 st branch unchanged; the 2 nd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, at this time, the resolution is reduced to 1/8, and the number of channels is changed to 64; the 3 rd pass step size is 2 and the convolution kernel is a convolution of 3 x 3, at which time the resolution drops to 1/16 and the number of passes becomes 128. x is the number of1Dividing the channel into 3 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd branch remains unchanged; the 3 rd pass step size is 2, the convolution kernel is a convolution of 3 × 3, the resolution is reduced to 1/16, and the number of channels becomes 128. x is the number of2Dividing the channel into 3, wherein the step length of the 1 st pass is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is raised to 1/8 at this time, and the number of channels becomes 64; the 3 rd branch remains unchanged. Then the channels with the same number are fused with each other to form a new x0,x1And x2. The transition3 phase is entered next, in which phase x0And x1Remains unchanged, x2Dividing into two branches, one branch is kept unchanged, the other branch passes through the convolution with the step length of 2 and the convolution kernel of 3 multiplied by 3, at this moment, the resolution is reduced to 1/32, the number of channels is changed to 256 and is recorded as x3
Stage 4, x04 consecutive Basicblock, x with 32 incoming channels14 consecutive Basicblock, x with 64 incoming lanes24 consecutive Basicblock, x with 128 incoming channels34 continuous Basicblocks with 256 channels enter the non-local network module, and the input and output sizes of the non-local network module are kept unchanged. Then enter a fusion phase, x0The number of the branches is 4, and the 1 st branch is kept unchanged; the 2 nd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, at this time, the resolution is reduced to 1/8, and the number of channels is changed to 64; the 3 rd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, the resolution is reduced to 1/16 at the moment, and the number of channels is changed to 128; the 4 th pass step size is 2, the convolution kernel is a convolution of 3 × 3, the resolution drops to 1/32, and the number of channels becomes 256. x is the number of1Is divided into 4 branches and 1 st branchBy the convolution with step size 1 and convolution kernel 1 × 1, the resolution is raised to 1/4, and the number of channels becomes 32; the 2 nd branch remains unchanged; the 3 rd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, the resolution is reduced to 1/16 at the moment, and the number of channels is changed to 128; the 4 th pass step size is 2, the convolution kernel is a convolution of 3 × 3, the resolution is reduced to 1/32, and the number of channels becomes 256. x is the number of2Dividing the channel into 4 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is raised to 1/8 at this time, and the number of channels becomes 64; the 3 rd branch remains unchanged; the 4 th pass step size is 2, the convolution kernel is a convolution of 3 × 3, the resolution is reduced to 1/32, and the number of channels becomes 256. x is the number of3Dividing the channel into 4 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd pass step is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/8 at this time, and the number of channels is changed to 64; the 3 rd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is increased to 1/32 at this time, and the number of channels becomes 128; the 4 th branch remains unchanged. Then the channels with the same number are fused with each other to form a new x0,x1,x2And x3. Next x0Keeping the same; x is the number of1By the convolution with step size 1 and convolution kernel 1 × 1, the resolution is raised to 1/4 and the number of channels becomes 32; x is the number of2By the convolution with step size 1 and convolution kernel 1 × 1, the resolution is raised to 1/4 and the number of channels becomes 32; x is the number of3With step size 1, the convolution kernel is a convolution of 1 × 1, at which time the resolution is raised to 1/4 and the number of channels becomes 32. Then 4 were fused to each other.
And the last layer is next, the fused features are convolved by a step length of 1 and a convolution kernel of 1 multiplied by 1, and the number of channels corresponding to the number of the key points of the data set is output. In summary, the corresponding network structure parameter table can be referred to in table 1, the square brackets in the square brackets are the module structures, the number multiplied by the module structures is the number of the residual error units, and the last number is the number of the modules.
TABLE 1 high resolution network architecture parameters
Figure BDA0002962327840000051
A non-local network module:
essentially high resolution networks are standard methods of using computer vision tasks, as are conventional networks: CNN (Convolition Neural networks) convolutional Neural networks. Convolutional neural networks are limiting, for example in a 5 x 5 convolution, the convolution filter has 25 pixels, and the value of the target pixel is computed with reference to itself and the surrounding 24 pixels. This means that convolution can only use local information to compute the target pixel, which can cause errors because global information is not visible. There are of course many ways to alleviate this problem, such as using larger convolution filters or depth networks of more convolution layers. However, these methods tend to result in larger quantities and the improvement in results is limited. To address this problem, the algorithm herein introduces the concept of non-local mean.
The non-local mean value is a classic filtering algorithm proposed by Buades et al, and is a novel image denoising technology essentially, the method fully utilizes redundant information in an image, can maintain the detail characteristics of the image to the maximum extent while denoising, and the core idea is that the estimated value of the current pixel is obtained by weighted averaging of pixels with similar neighborhood structures in the image. When calculating the output of each pixel position, the correlation is calculated with all positions in the image instead of only the neighborhood, and then the correlation is used as a weight to represent the similarity between other positions and the current position to be calculated. The non-local mean is defined as follows: given a discrete noise image v ═ { v (I) | I ∈ I }, for pixel I, the estimate NL [ v ] (I) is computed as a weighted average of all pixels in the image, as shown in equation (1):
Figure BDA0002962327840000061
wherein the weight family w (i, j) }jDependent on between pixels i and jSimilarity and satisfies w (i, j) 1 and sigmajThe similarity between two pixels i and j, w (i, j) ═ 1, depends on the intensity gray level vector v (N)i) And v (N)j) In which N iskRepresenting a square neighborhood of fixed size centered on pixel k. The similarity is defined as a decreasing function of a weighted Euclidean excitation
Figure BDA0002962327840000062
Wherein a is>0 is the standard deviation of the sum of gaussians.
In order to combine the concept of non-local mean with deep learning, important concept variance and covariance in statistics and machine learning are introduced. They are concepts defined for random variables, and variance describes the deviation of a single random variable from its mean; while covariance describes the similarity between two random variables, if the distributions of two random variables are similar, their covariance is large. Otherwise, their covariance is small. If the pairing covariance between all pixels is calculated by taking each pixel in the feature map as a random variable, the value of each predicted pixel can be enhanced or reduced according to the similarity between each predicted pixel and other pixels in the image.
In order to realize global reference to each pixel, Wang et al, in combination with the above characteristics of the non-local mean, propose a generalized, simple non-local network module that can be directly embedded into the current network, and can capture the wide-range dependence in the image, first give a specific formula for the non-local network module:
Figure BDA0002962327840000063
where i is an index of an output location (in space, time or space-time) whose response is to be computed. j is an index that enumerates all possible locations. x is the input signal (image, sequence, video; usually their characteristics) and y is the output signal of the same size as x. The binary function f calculates the correlation coefficient (representing the relationship, e.g. near) between i and all jSimilarity). The unary function g computes a representation of the input signal at the j position, the response being normalized by a factor c (x). Non-local operations, as opposed to convolution operations, take into account all position factors
Figure BDA0002962327840000071
While convolution operations accumulate the input weights in a local neighborhood.
The function g uses a linear embedding function g (x)j)=WgxjWherein W isgIs a learnable weight matrix, implemented in experiments with 1 × 1 convolutional layers. The structure of the non-local network module is shown as the non-local network module
In the 4 th stage of the high-resolution network, a non-local module is added to the module with the minimum resolution, and the number of channels is set to be 256, and the reason why the non-local module is added in this stage is that the feature with the minimum resolution can be obtained in this stage, the smaller resolution includes the low-level feature with strong semantic information, and the non-local module is added to the module with the small resolution feature, so that the main feature can be highlighted, and the better experimental result can be obtained. The specific structure is shown in a NLHRv3 network diagram.
Step 4, the mean square error between the key point heat map obtained through the network and the result of the truth heat map is obtained, wherein P represents the key point, l represents the position information of the key point, and y'pHeat map information, y, representing predicted keypoints, ppHeat map information representing real keypoints. As shown in formula (1):
Figure BDA0002962327840000072
and 5, generating a json file containing the corresponding heat map and the key point information.
Step 6, operating a statement: python visualization/plot _ co.py- -rendering output/co/pool _ once _ hrnet/w32_256x192_ adam _ lr1e-3/results/keypoints _ val2017_ results _0.json- -save visualization/results-
And 7, obtaining a human body skeleton diagram, and overlapping and displaying the human body skeleton diagram and the original image as a result diagram.
Compared with the prior art, the invention has the following advantages:
1. the invention designs a non-local network module, which is specifically defined as a formula (4):
zi=Wzyi+xi(4)
wherein y isiGiven by equation (3), "+ xi"denotes residual concatenation [9]When W iszWhen the initialization of (1) is 0, the residual connection allows any new non-local network module to be inserted into any network without destroying the original network structure, i.e., the present algorithm can use the initial pre-training weights of the high-resolution network. It can be known from formula (2) that the non-local operation can keep the size of the input parameter, the non-local network module can change the input value, and the input end and the output end of the non-local network, the parameter is the same.
2. The invention designs and verifies the NLHR network, a non-local module is added on the module with the minimum resolution at the 4 th stage of the high-resolution network, the number of channels is set to be 256, the reason why the non-local module is added at the stage is that the stage can obtain the feature with the minimum resolution, the smaller resolution comprises the low-level feature with strong semantic information, and the non-local module is added on the module with the small resolution feature, so that the main feature can be highlighted, and the better experimental result can be obtained. The specific structure is shown in an NLHR network.
Drawings
FIG. 1 flow chart of the present invention
FIG. 2 is a non-local network block diagram
FIG. 3 is a simplified diagram of a non-local network block diagram
Fig. 4NLHR network structure diagram
FIG. 5 is a diagram of the result of estimation of human body pose according to the present invention
Detailed Description
Aiming at the problems and the defects of the traditional convolutional neural network, an NLHR network structure utilizing a Non-local network module is provided, and a novel network structure NLHR (Non-local High-Resolution) network is designed, and the accuracy of human body posture estimation can be greatly improved under the condition that relatively small parameters are kept by the network structure. A human body posture estimation method based on a Non-local high-resolution network is characterized by comprising the following steps:
step 1, acquiring an image, directly acquiring a local image through a function, and removing an RGB image
And 2, detecting the human body by using the YOLOv3 network to obtain a bounding box (bbox) of the human body.
Step 3, expanding the height or width of the human body detection frame to a fixed length-width ratio: high: the width is 4:3, and then the human detection frame is cut out of the image and readjusted to a fixed size of 256 × 192.
And 4, extracting human skeletons, putting the skeletons in the step 3 into an NLHR network, and running a statement of python tools/train py-cfg experiments/coco/hrnet/w 32-256 x192_ adam _ lr1e-3. yaml. Extracting human skeleton. There are 17 key points. The main contribution of the invention is to design a novel network NLHR network, and the network parameters and the flow are as follows:
the initial part of the network is a convolution with two layers of step size 2 and a convolution kernel of 3 x 3, so that the resolution is reduced to 1/4 at the input, the number of channels becomes 64, followed by the body structure of the network, which contains 4 stages, 4 parallel convolution branches. The resolutions are 1/4, 1/8, 1/16, 1/32, respectively.
The 1 st stage comprises 4 Bottleneck residual error units, the 1 st residual error unit is subjected to three-layer convolution to increase the channel number from 64 to 256, and then is subjected to 3 Bottleneck residual error units with the channel number of 256. Then, the signal enters a transition1 module, the module is divided into two branches, one branch passes through the convolution with the step length of 1 and the convolution kernel of 3 multiplied by 3, the resolution keeps 1/4 at the input time, the number of channels is changed to 32, and the channel is recorded as x0. The other one passes through the convolution with step size 2 and convolution kernel 3 × 3, the resolution is reduced to 1/8 at the time of input, the number of channels becomes 64, and is recorded as x1
Stage 2, x04 consecutive Basicblock, x with 32 incoming channels1Access channel4 consecutive basicblocks, number 64. Then enter a fusion phase, x0The split into two branches remains unchanged, the other branch passes through a convolution with step size 2 and convolution kernel 3 × 3, at which time the resolution is reduced to 1/8 and the number of channels becomes 64. x is the number of1Also, two branches remain unchanged, the other branch passes through a convolution with step size 1 and convolution kernel 1 × 1, at which time the resolution is raised to 1/4 and the number of channels becomes 32. Then two channels with the same number are merged into a new x0And x1. Next, the flow proceeds to transition2 block, where block x0Remains unchanged, x1Dividing into two branches, one branch is kept unchanged, the other branch passes through convolution with step length of 2 and convolution kernel of 3 multiplied by 3, at this time, resolution is reduced to 1/16, channel number is changed to 128, and is recorded as x2
Stage 3, x04 consecutive Basicblock, x with 32 incoming channels14 consecutive Basicblock, x with 64 incoming lanes24 consecutive basicblocks with 128 channels in. Then enter a fusion phase, x0Dividing into 3 branches, keeping the 1 st branch unchanged; the 2 nd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, at this time, the resolution is reduced to 1/8, and the number of channels is changed to 64; the 3 rd pass step size is 2 and the convolution kernel is a convolution of 3 x 3, at which time the resolution drops to 1/16 and the number of passes becomes 128. x is the number of1Dividing the channel into 3 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd branch remains unchanged; the 3 rd pass step size is 2, the convolution kernel is a convolution of 3 × 3, the resolution is reduced to 1/16, and the number of channels becomes 128. x is the number of2Dividing the channel into 3, wherein the step length of the 1 st pass is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is raised to 1/8 at this time, and the number of channels becomes 64; the 3 rd branch remains unchanged. Then the channels with the same number are fused with each other to form a new x0,x1And x2. The transition3 phase is entered next, in which phase x0And x1Remains unchanged, x2Two branches, one of which remains unchanged and the other of which passes a convolution with a step size of 2 and a convolution kernel of 3 × 3, are resolvedThe rate drops to 1/32, and the number of channels becomes 256, which is denoted as x3
Stage 4, x04 consecutive Basicblock, x with 32 incoming channels14 consecutive Basicblock, x with 64 incoming lanes24 consecutive Basicblock, x with 128 incoming channels34 continuous Basicblocks with 256 channels enter the non-local network module, and the input and output sizes of the non-local network module are kept unchanged. Then enter a fusion phase, x0The number of the branches is 4, and the 1 st branch is kept unchanged; the 2 nd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, at this time, the resolution is reduced to 1/8, and the number of channels is changed to 64; the 3 rd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, the resolution is reduced to 1/16 at the moment, and the number of channels is changed to 128; the 4 th pass step size is 2, the convolution kernel is a convolution of 3 × 3, the resolution drops to 1/32, and the number of channels becomes 256. x is the number of1Dividing the channel into 4 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd branch remains unchanged; the 3 rd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, the resolution is reduced to 1/16 at the moment, and the number of channels is changed to 128; the 4 th pass step size is 2, the convolution kernel is a convolution of 3 × 3, the resolution is reduced to 1/32, and the number of channels becomes 256. x is the number of2Dividing the channel into 4 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is raised to 1/8 at this time, and the number of channels becomes 64; the 3 rd branch remains unchanged; the 4 th pass step size is 2, the convolution kernel is a convolution of 3 × 3, the resolution is reduced to 1/32, and the number of channels becomes 256. x is the number of3Dividing the channel into 4 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd pass step is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/8 at this time, and the number of channels is changed to 64; the 3 rd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is increased to 1/32 at this time, and the number of channels becomes 128; the 4 th branch remains unchanged. Then the channels with the same number are fused with each other to form a new x0,x1,x2And x3. Next x0Keeping the same; x is the number of1By the convolution with step size 1 and convolution kernel 1 × 1, the resolution is raised to 1/4 and the number of channels becomes 32; x is the number of2By the convolution with step size 1 and convolution kernel 1 × 1, the resolution is raised to 1/4 and the number of channels becomes 32; x is the number of3With step size 1, the convolution kernel is a convolution of 1 × 1, at which time the resolution is raised to 1/4 and the number of channels becomes 32. Then 4 were fused to each other.
And the last layer is next, the fused features are convolved by a step length of 1 and a convolution kernel of 1 multiplied by 1, and the number of channels corresponding to the number of the key points of the data set is output. In summary, the corresponding network structure parameter table can be referred to in table 1, the square brackets in the square brackets are the module structures, the number multiplied by the module structures is the number of the residual error units, and the last number is the number of the modules.
TABLE 1 high resolution network architecture parameters
Figure BDA0002962327840000111
A non-local network module:
essentially high resolution networks are standard methods of using computer vision tasks, as are conventional networks: CNN (Convolition Neural networks) convolutional Neural networks. Convolutional neural networks are limiting, for example in a 5 x 5 convolution, the convolution filter has 25 pixels, and the value of the target pixel is computed with reference to itself and the surrounding 24 pixels. This means that convolution can only use local information to compute the target pixel, which can cause errors because global information is not visible. There are of course many ways to alleviate this problem, such as using larger convolution filters or depth networks of more convolution layers. However, these methods tend to result in larger quantities and the improvement in results is limited. To address this problem, the algorithm herein introduces the concept of non-local mean.
The non-local mean value is a classic filtering algorithm proposed by Buades et al, and is a novel image denoising technology essentially, the method fully utilizes redundant information in an image, can maintain the detail characteristics of the image to the maximum extent while denoising, and the core idea is that the estimated value of the current pixel is obtained by weighted averaging of pixels with similar neighborhood structures in the image. When calculating the output of each pixel position, the correlation is calculated with all positions in the image instead of only the neighborhood, and then the correlation is used as a weight to represent the similarity between other positions and the current position to be calculated. The non-local mean is defined as follows: given a discrete noise image v ═ { v (I) | I ∈ I }, for pixel I, the estimate NL [ v ] (I) is computed as a weighted average of all pixels in the image, as shown in equation (1):
Figure BDA0002962327840000112
wherein the family of weights w (i, j) }jDepends on the similarity between pixels i and j and satisfies 0 ≦ w (i, j ≦ 1 and ΣjThe similarity between two pixels i and j, w (i, j) ═ 1, depends on the intensity gray level vector v (N)i) And v (N)j) In which N iskRepresenting a square neighborhood of fixed size centered on pixel k. The similarity is defined as a decreasing function of a weighted Euclidean excitation
Figure BDA0002962327840000121
Wherein a is>0 is the standard deviation of the sum of gaussians.
In order to combine the concept of non-local mean with deep learning, important concept variance and covariance in statistics and machine learning are introduced. They are concepts defined for random variables, and variance describes the deviation of a single random variable from its mean; while covariance describes the similarity between two random variables, if the distributions of two random variables are similar, their covariance is large. Otherwise, their covariance is small. If the pairing covariance between all pixels is calculated by taking each pixel in the feature map as a random variable, the value of each predicted pixel can be enhanced or reduced according to the similarity between each predicted pixel and other pixels in the image.
In order to realize global reference to each pixel, Wang et al, in combination with the above characteristics of the non-local mean, propose a generalized, simple non-local network module that can be directly embedded into the current network, and can capture the wide-range dependence in the image, first give a specific formula for the non-local network module:
Figure BDA0002962327840000122
where i is an index of an output location (in space, time or space-time) whose response is to be computed. j is an index that enumerates all possible locations. x is the input signal (image, sequence, video; usually their characteristics) and y is the output signal of the same size as x. The binary function f calculates the correlation coefficient (representing the relationship, e.g., the degree of approximation) between i and all j. The unary function g computes a representation of the input signal at the j position, the response being normalized by a factor c (x). Non-local operations, as opposed to convolution operations, take into account all position factors
Figure BDA0002962327840000123
While convolution operations accumulate the input weights in a local neighborhood.
The function g uses a linear embedding function g (x)j)=WgxjWherein W isgIs a learnable weight matrix, implemented in experiments with 1 × 1 convolutional layers. The structure of the non-local network module is shown as the non-local network module
In the 4 th stage of the high-resolution network, a non-local module is added to the module with the minimum resolution, and the number of channels is set to be 256, and the reason why the non-local module is added in this stage is that the feature with the minimum resolution can be obtained in this stage, the smaller resolution includes the low-level feature with strong semantic information, and the non-local module is added to the module with the small resolution feature, so that the main feature can be highlighted, and the better experimental result can be obtained. The specific structure is shown in a NLHRv3 network diagram.
Step 4, mixingMean square error between the key point heat map and the truth heat map obtained through the network, wherein P represents the key point, l represents the key point position information, y'pHeat map information, y, representing predicted keypoints, ppHeat map information representing real keypoints. As shown in formula (1):
Figure BDA0002962327840000131
and 5, generating a json file containing the corresponding heat map and the key point information.
Step 6, operating a statement: python visualization/plot _ co.py- -rendering output/co/pool _ once _ hrnet/w32_256x192_ adam _ lr1e-3/results/keypoints _ val2017_ results _0.json- -save visualization/results-
And 7, obtaining a human body skeleton diagram, and overlapping and displaying the human body skeleton diagram and the original image as a result diagram.
Further experiments prove that:
here, experiments were first performed on the MPII dataset with 3 versions of NLHR networks, respectively, and then the optimal version of the network was verified again on the COCO dataset.
PCKh evaluation criteria [17 ]. Detection accuracy is obtained by making explicit boundary definitions for each person in the test image. And (3) giving a candidate area in a boundary frame (h, w) containing the coordinate position of the original key point, controlling related thresholds to obtain different accuracy rates to judge whether the predicted key point is reasonably positioned, and selecting the threshold r to be 0.5. PCKh is used for replacing the size of a head frame with the size of a human body trunk, the scale is used for normalizing the distance of other parts, and the distance adopts an Euclidean distance. If the Euclidean distance between the detection key point and the label key point is within the threshold value range, the detection result is correct. Taking the k-th human body key point as an example, the calculation method of PCKh is as follows:
Figure BDA0002962327840000132
wherein, PCKh (k)) Is the value of the kth key point PCKh, the average value of the accumulated sums is the final result, and the label value of the kth class human body key point in the ith picture is
Figure BDA0002962327840000133
The predicted result corresponding to the key point is
Figure BDA0002962327840000134
Total number of samples N in the size of the header box shCombining a distance normalization coefficient r as a judgment condition, the smaller the threshold value is, the stricter the evaluation criterion is, and PCKh @0.5 means that when r is 0.5, the distance threshold value is 0.5shA comparison of the true value to the predicted value distance is made.
OKS evaluation criteria [18 ]. Based on the COCO evaluation index OKS, AP is the normalization of different key point types and human sizes, and is the average key point similarity among key points, and between [0 and 1], the prediction tends to 1 as the prediction is closer to the original value, and otherwise tends to 0. OKS is defined as formula (6)
Figure BDA0002962327840000141
Wherein d is the true value coordinate θ(p)And the predicted coordinates
Figure RE-GDA0003146790310000142
The euclidean distance between them,
Figure RE-GDA0003146790310000143
Figure RE-GDA0003146790310000144
s is the area occupied by the human body in the image, kiTo normalize the factor, δ (v)i>0) Indicating that the keypoint visibility is greater than 0. For the estimation of the human body posture AP as the average accuracy, the calculation method is as formula (7):
Figure BDA0002962327840000145
wherein t is taken as (0.50,0.55,. multidot.0.90, 0.95) of threshold processing of given OKS, and the prediction accuracy is calculated by OKS indexes of people in all pictures in the test set.
Results of the experiment
Table 1 experimental results on MPII validation set
Figure BDA0002962327840000146
Table 2 experimental results on COCO validation set
Figure BDA0002962327840000147
Figure BDA0002962327840000151
TABLE 3 parameters, GFLOPs and experimental results of human body posture estimation network
Figure BDA0002962327840000152

Claims (7)

1. A human body posture estimation method based on a Non-local high-resolution network is characterized by comprising the following steps: the method comprises the following steps:
step 1, acquiring an image, directly acquiring a local image through a function, and removing RGB of the image;
step 2, detecting a human body by using a YOLOv3 network to obtain a human body surrounding frame bbox;
step 3, expanding the height or width of the human body detection frame to a fixed length-width ratio: high: cutting the human body detection frame from the image, and readjusting the human body detection frame to be 256 × 192 in fixed size;
step 4, extracting the human skeleton, namely putting the human skeleton in the step 3 into an NLHR network, and operating a sentence to extract the human skeleton; there are 17 key points; the NLHR network parameters and flow are as follows:
the initial part of the network is a convolution with two layers of step length of 2 and convolution kernel of 3 multiplied by 3, so that the resolution is reduced to 1/4 when in input, the number of channels is changed to 64, and then the main structure of the network is obtained, and the network main body comprises 4 stages and 4 parallel convolution branches; resolutions of 1/4, 1/8, 1/16 and 1/32;
step 4, the mean square error between the key point heat map and the truth value heat map obtained through the network is calculated;
step 5, generating a json file containing the corresponding heat map and the key point information;
and 6, operating the sentence to obtain a human skeleton diagram, and displaying the human skeleton diagram and the original image in a superposition mode.
2. The human body posture estimation method based on Non-local high resolution network according to claim 1, characterized in that: in the NLHR network of the step 4, the 1 st stage comprises 4 Bottleneck residual error units, the 1 st residual error unit is subjected to three-layer convolution to increase the number of channels from 64 to 256, and then is subjected to 3 Bottleneck residual error units with the number of channels being 256; the data then enters a transition1 block, which is divided into two branches, one branch passes through a convolution with a step size of 1 and a convolution kernel of 3 × 3, the resolution remains 1/4 at the input, the number of channels becomes 32, and is marked as x0(ii) a The other one passes through the convolution with step size 2 and convolution kernel 3 × 3, the resolution is reduced to 1/8 at the time of input, the number of channels becomes 64, and is recorded as x1
3. The human body posture estimation method based on Non-local high resolution network according to claim 1, characterized in that: stage 2, x, in the NLHR network of step 404 consecutive Basicblock, x with 32 incoming channels14 continuous Basicblocks with 64 entering channels; then enter a fusion phase, x0Dividing the two branches into two branches and keeping the two branches unchanged, wherein the step length of the other branch is 2, the convolution kernel is convolution of 3 multiplied by 3, the resolution is reduced to 1/8 at the moment, and the number of channels is changed to 64; x is the number of1Also two branches remain unchanged, the other oneThe branch passing step length is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; then two channels with the same number are merged into a new x0And x1(ii) a Next, the flow proceeds to transition2 block, where block x0Remains unchanged, x1Dividing into two branches, one branch is kept unchanged, the other branch passes through the convolution with the step length of 2 and the convolution kernel of 3 multiplied by 3, at this moment, the resolution is reduced to 1/16, the number of channels is changed to 128, and the number is recorded as x2
4. The human body posture estimation method based on Non-local high resolution network according to claim 1, characterized in that: stage 3, x in the NLHR network of step 404 consecutive Basicblock, x with 32 incoming channels14 consecutive Basicblock, x with 64 incoming lanes24 continuous basicblocks with 128 channels; then enter a fusion phase, x0Dividing into 3 branches, keeping the 1 st branch unchanged; the 2 nd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, at this time, the resolution is reduced to 1/8, and the number of channels is changed to 64; the 3 rd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, the resolution is reduced to 1/16 at the moment, and the number of channels is changed to 128; x is the number of1Dividing the channel into 3 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd branch remains unchanged; the 3 rd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, the resolution is reduced to 1/16 at the moment, and the number of channels is changed to 128; x is the number of2Dividing the channel into 3 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is raised to 1/8 at this time, and the number of channels becomes 64; the 3 rd branch remains unchanged; then the channels with the same number are fused with each other to form a new x0,x1And x2(ii) a The transition3 phase is entered next, in which phase x0And x1Remains unchanged, x2Dividing into two branches, one branch is kept unchanged, the other branch passes through the convolution with the step length of 2 and the convolution kernel of 3 multiplied by 3, at this moment, the resolution is reduced to 1/32, the number of channels is changed to 256 and is recorded as x3
5. The human body posture estimation method based on Non-local high resolution network according to claim 1, characterized in that: in the NLHR network of step 4, stage 4, x04 consecutive Basicblock, x with 32 incoming channels14 consecutive Basicblock, x with 64 incoming lanes24 consecutive Basicblock, x with 128 incoming channels34 continuous basic blocks with 256 channels enter, then pass through a non-local network module, and keep the input and output sizes unchanged after passing through the network module; then enter a fusion phase, x0The number of the branches is 4, and the 1 st branch is kept unchanged; the 2 nd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, at this time, the resolution is reduced to 1/8, and the number of channels is changed to 64; the 3 rd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, the resolution is reduced to 1/16 at the moment, and the number of channels is changed to 128; the 4 th pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, at this time, the resolution is reduced to 1/32, and the number of channels is changed to 256; x is the number of1Dividing the channel into 4 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd branch remains unchanged; the 3 rd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, the resolution is reduced to 1/16 at the moment, and the number of channels is changed to 128; the 4 th pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, at this time, the resolution is reduced to 1/32, and the number of channels is changed to 256; x is the number of2Dividing the channel into 4 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is raised to 1/8 at this time, and the number of channels becomes 64; the 3 rd branch remains unchanged; the 4 th pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, at this time, the resolution is reduced to 1/32, and the number of channels is changed to 256; x is the number of3Dividing the channel into 4 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is raised to 1/8 at this time, and the number of channels becomes 64; the 3 rd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is increased to 1/32, and the number of channels is changed to 128; the 4 th branch remains unchanged; then the channels with the same number are fused with each other to form a new x0,x1,x2And x3(ii) a Next x0Keeping the same; x is the number of1By the convolution with step size 1 and convolution kernel 1 × 1, the resolution is raised to 1/4, and the number of channels becomes 32; x is the number of2By the convolution with step size 1 and convolution kernel 1 × 1, the resolution is raised to 1/4, and the number of channels becomes 32; x is the number of3By the convolution with step size 1 and convolution kernel 1 × 1, the resolution is raised to 1/4, and the number of channels becomes 32; then 4 were fused to each other.
6. The human body posture estimation method based on Non-local high resolution network according to claim 1, characterized in that: in the NLHR network of the step 4, the last layer of fused features outputs the number of channels corresponding to the number of the key points of the data set through convolution with the step length of 1 and the convolution kernel of 1 multiplied by 1; in summary, the corresponding network structure parameter table can be referred to in table 1, the square brackets indicate the module structure, the number multiplied by the module structure is the number of residual error units, and the last number is the number of modules.
7. The human body posture estimation method based on Non-local high resolution network of claim 5, characterized in that: in the non-local network module: the reason why the non-local module is added at this stage is because the lowest resolution feature is obtained at this stage, the smaller resolution includes the low-level feature with strong semantic information, and the non-local module is added to the module with the low resolution feature.
CN202110241318.9A 2021-03-04 2021-03-04 Human body posture estimation method based on Non-local high-resolution network Active CN113221626B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110241318.9A CN113221626B (en) 2021-03-04 2021-03-04 Human body posture estimation method based on Non-local high-resolution network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110241318.9A CN113221626B (en) 2021-03-04 2021-03-04 Human body posture estimation method based on Non-local high-resolution network

Publications (2)

Publication Number Publication Date
CN113221626A true CN113221626A (en) 2021-08-06
CN113221626B CN113221626B (en) 2023-10-20

Family

ID=77084763

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110241318.9A Active CN113221626B (en) 2021-03-04 2021-03-04 Human body posture estimation method based on Non-local high-resolution network

Country Status (1)

Country Link
CN (1) CN113221626B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229445A (en) * 2018-02-09 2018-06-29 深圳市唯特视科技有限公司 A kind of more people's Attitude estimation methods based on cascade pyramid network
CN108334847A (en) * 2018-02-06 2018-07-27 哈尔滨工业大学 A kind of face identification method based on deep learning under real scene
CN109523470A (en) * 2018-11-21 2019-03-26 四川长虹电器股份有限公司 A kind of depth image super resolution ratio reconstruction method and system
US20190220993A1 (en) * 2018-01-18 2019-07-18 Samsung Electronics Co., Ltd. Pose estimation method, method of displaying virtual object using estimated pose, and apparatuses performing the same
CN110175575A (en) * 2019-05-29 2019-08-27 南京邮电大学 A kind of single Attitude estimation method based on novel high-resolution network model
CN110175566A (en) * 2019-05-27 2019-08-27 大连理工大学 A kind of hand gestures estimating system and method based on RGBD converged network
CN110930306A (en) * 2019-10-28 2020-03-27 杭州电子科技大学 Depth map super-resolution reconstruction network construction method based on non-local perception
CN110969105A (en) * 2019-11-22 2020-04-07 清华大学深圳国际研究生院 Human body posture estimation method
US10701394B1 (en) * 2016-11-10 2020-06-30 Twitter, Inc. Real-time video super-resolution with spatio-temporal networks and motion compensation
CN111460928A (en) * 2020-03-17 2020-07-28 中国科学院计算技术研究所 Human body action recognition system and method
CN111597976A (en) * 2020-05-14 2020-08-28 杭州相芯科技有限公司 Multi-person three-dimensional attitude estimation method based on RGBD camera
CN112131959A (en) * 2020-08-28 2020-12-25 浙江工业大学 2D human body posture estimation method based on multi-scale feature reinforcement
CN112232134A (en) * 2020-09-18 2021-01-15 杭州电子科技大学 Human body posture estimation method based on hourglass network and attention mechanism
CN112232106A (en) * 2020-08-12 2021-01-15 北京工业大学 Two-dimensional to three-dimensional human body posture estimation method

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10701394B1 (en) * 2016-11-10 2020-06-30 Twitter, Inc. Real-time video super-resolution with spatio-temporal networks and motion compensation
US20190220993A1 (en) * 2018-01-18 2019-07-18 Samsung Electronics Co., Ltd. Pose estimation method, method of displaying virtual object using estimated pose, and apparatuses performing the same
CN108334847A (en) * 2018-02-06 2018-07-27 哈尔滨工业大学 A kind of face identification method based on deep learning under real scene
CN108229445A (en) * 2018-02-09 2018-06-29 深圳市唯特视科技有限公司 A kind of more people's Attitude estimation methods based on cascade pyramid network
CN109523470A (en) * 2018-11-21 2019-03-26 四川长虹电器股份有限公司 A kind of depth image super resolution ratio reconstruction method and system
CN110175566A (en) * 2019-05-27 2019-08-27 大连理工大学 A kind of hand gestures estimating system and method based on RGBD converged network
CN110175575A (en) * 2019-05-29 2019-08-27 南京邮电大学 A kind of single Attitude estimation method based on novel high-resolution network model
CN110930306A (en) * 2019-10-28 2020-03-27 杭州电子科技大学 Depth map super-resolution reconstruction network construction method based on non-local perception
CN110969105A (en) * 2019-11-22 2020-04-07 清华大学深圳国际研究生院 Human body posture estimation method
CN111460928A (en) * 2020-03-17 2020-07-28 中国科学院计算技术研究所 Human body action recognition system and method
CN111597976A (en) * 2020-05-14 2020-08-28 杭州相芯科技有限公司 Multi-person three-dimensional attitude estimation method based on RGBD camera
CN112232106A (en) * 2020-08-12 2021-01-15 北京工业大学 Two-dimensional to three-dimensional human body posture estimation method
CN112131959A (en) * 2020-08-28 2020-12-25 浙江工业大学 2D human body posture estimation method based on multi-scale feature reinforcement
CN112232134A (en) * 2020-09-18 2021-01-15 杭州电子科技大学 Human body posture estimation method based on hourglass network and attention mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张聪聪;何宁;: "基于关键帧的双流卷积网络的人体动作识别方法", 南京信息工程大学学报(自然科学版) *
温博阁;: "基于时序卷积非局部平均神经网络对地铁司机动作边界预测的研究", 铁道机车与动车 *
路昊;石敏;李昊;朱登明;: "基于深度学习的动态场景相机姿态估计方法", 高技术通讯 *

Also Published As

Publication number Publication date
CN113221626B (en) 2023-10-20

Similar Documents

Publication Publication Date Title
CN109800628B (en) Network structure for enhancing detection performance of SSD small-target pedestrians and detection method
CN108830252B (en) Convolutional neural network human body action recognition method fusing global space-time characteristics
CN111460926B (en) Video pedestrian detection method fusing multi-target tracking clues
CN102682302B (en) Human body posture identification method based on multi-characteristic fusion of key frame
CN110929593B (en) Real-time significance pedestrian detection method based on detail discrimination
CN105069434B (en) A kind of human action Activity recognition method in video
CN111259786A (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
CN110633632A (en) Weak supervision combined target detection and semantic segmentation method based on loop guidance
CN111767847B (en) Pedestrian multi-target tracking method integrating target detection and association
CN110765906A (en) Pedestrian detection algorithm based on key points
CN110163117B (en) Pedestrian re-identification method based on self-excitation discriminant feature learning
US20190108400A1 (en) Actor-deformation-invariant action proposals
CN113239801B (en) Cross-domain action recognition method based on multi-scale feature learning and multi-level domain alignment
CN109697727A (en) Method for tracking target, system and storage medium based on correlation filtering and metric learning
CN111242985B (en) Video multi-pedestrian tracking method based on Markov model
CN114842553A (en) Behavior detection method based on residual shrinkage structure and non-local attention
CN115187786A (en) Rotation-based CenterNet2 target detection method
CN113269038B (en) Multi-scale-based pedestrian detection method
CN113850221A (en) Attitude tracking method based on key point screening
CN110309729A (en) Tracking and re-detection method based on anomaly peak detection and twin network
CN111862147B (en) Tracking method for multiple vehicles and multiple lines of human targets in video
Wang et al. Summary of object detection based on convolutional neural network
CN116416503A (en) Small sample target detection method, system and medium based on multi-mode fusion
CN116109673A (en) Multi-frame track tracking system and method based on pedestrian gesture estimation
CN113221626A (en) Human body posture estimation method based on Non-local high-resolution network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant