CN116363750A - Human body posture prediction method, device, equipment and readable storage medium - Google Patents

Human body posture prediction method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN116363750A
CN116363750A CN202310251993.9A CN202310251993A CN116363750A CN 116363750 A CN116363750 A CN 116363750A CN 202310251993 A CN202310251993 A CN 202310251993A CN 116363750 A CN116363750 A CN 116363750A
Authority
CN
China
Prior art keywords
human body
key point
thermodynamic diagram
predicted
key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310251993.9A
Other languages
Chinese (zh)
Inventor
魏格格
薛楠
吴田富
夏桂松
张良培
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202310251993.9A priority Critical patent/CN116363750A/en
Publication of CN116363750A publication Critical patent/CN116363750A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4007Scaling of whole images or parts thereof, e.g. expanding or contracting based on interpolation, e.g. bilinear interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a human body posture prediction method, a device, equipment and a readable storage medium. The method comprises the following steps: extracting a feature map of an input picture; obtaining an offset, a thermodynamic diagram and a central thermodynamic diagram of the human body key points based on the feature diagram, obtaining predicted human body key points according to the offset and the central thermodynamic diagram, generating a local key point expansion window by taking the predicted human body key points as the center, and converting the key point expansion window into a key point attraction field; generating a global key point expansion window by using the predicted human key points and the thermodynamic diagram, and carrying out convolution operation by using the key point attraction field as a convolution check global key point expansion window to obtain a corrected key point thermodynamic diagram; obtaining predicted two-dimensional coordinates of the human body joint point from the corrected key point thermodynamic diagram; and respectively encoding and decoding by utilizing the space and time sequence information of the two-dimensional coordinates of the predicted human body joint point to obtain the three-dimensional information of the predicted human body joint point. By the method and the device, the human body posture prediction precision is improved.

Description

Human body posture prediction method, device, equipment and readable storage medium
Technical Field
The present invention relates to the field of computer vision, and in particular, to a human body posture prediction method, apparatus, device, and readable storage medium.
Background
One of the important tasks in the field of computer vision research is human skeleton key point detection, and particularly, the computer can sense the positions of all skeleton key points of a human body, so that a foundation is provided for a plurality of practical scenes such as further action recognition, action abnormality detection, intelligent monitoring, automatic driving and the like.
The object of human skeleton key point detection is to take a picture as input and output the coordinates of each skeleton key point of each human body in the picture and the real world. Currently, the main human body key point detection technology based on deep learning can be divided into two kinds, namely top-down and bottom-up methods. The top-down method firstly detects the target frames of the human body, then carries out single human body posture estimation aiming at each target frame, and the method has higher precision, but the calculation amount of the posture estimation of the single human body is proportional to the number of the target frames, so the calculation efficiency is not high, and meanwhile, the method is limited by the precision of human body target detection. While the bottom-up approach generally involves two steps, namely first detecting human keypoints and then grouping the keypoints. One representative of such methods is to predict the keypoint locations based on the keypoint centers and offsets, which avoids complex groupings of keypoints with faster computation speeds, but the disadvantage of this approach is also apparent, i.e., inaccurate estimates of offset vectors farther from the keypoint centers, and thus overall accuracy of the final model is not high.
Disclosure of Invention
Aiming at the defects of the existing technology for predicting the key point based on the key point center point and the offset, the invention provides a human body posture prediction method, a device, equipment and a readable storage medium.
In a first aspect, the present invention provides a human body posture prediction method, including:
extracting an input picture through a convolutional neural network to obtain a feature map;
obtaining an offset of a human body key point, a human body key point thermodynamic diagram and a human body key point central thermodynamic diagram based on the extracted feature diagram, obtaining a predicted human body key point according to the offset of the human body key point and the human body key point central thermodynamic diagram, and generating a local key point expansion window by taking the predicted human body key point as a center;
converting the local key point expansion window into a key point attraction field;
generating a global key point expansion window by using the predicted human key points and the human key point thermodynamic diagram, and performing convolution operation by using the key point attraction field as a convolution check global key point expansion window to obtain a corrected key point thermodynamic diagram;
obtaining predicted two-dimensional coordinates of the human body joint point from the corrected key point thermodynamic diagram;
and respectively encoding and decoding by utilizing the space and time sequence information of the two-dimensional coordinates of the predicted human body joint point to obtain the three-dimensional information of the predicted human body joint point.
Optionally, the step of obtaining the offset, the thermodynamic diagram of the human body key point and the thermodynamic diagram of the human body key point center based on the extracted feature map includes:
the feature map is subjected to dimension reduction through two parallel branches respectively, and then convolution processing is carried out by using a convolution layer with the convolution kernel size of 1x 1;
wherein, one branch outputs a human body key point thermodynamic diagram and a human body key point central thermodynamic diagram,by using
Figure BDA0004128205380000021
The representation, wherein H is a thermodynamic diagram, H, w depend on the sampling step size of the backbone network, and the obtained low-resolution thermodynamic diagram is up-sampled by bilinear interpolation to obtain a high-resolution thermodynamic diagram, using +_>
Figure BDA0004128205380000022
A representation; the other branch outputs the offset of 17 human body key points by +.>
Figure BDA0004128205380000023
And (3) representing.
Optionally, the step of obtaining the predicted human body key point according to the offset of the human body key point and the thermodynamic diagram of the center of the human body key point, and generating the local key point expansion window by taking the predicted human body key point as the center includes:
performing non-maximum value inhibition processing on the human body key point central thermodynamic diagram, and selecting the first N points with high scores as candidate key point central points, wherein N is a positive integer;
obtaining N candidate human body postures through the candidate key point center points and the offset of the human body key points, and simultaneously removing the human body postures with the human body key point center thermodynamic diagram score smaller than a threshold value;
for N candidate human body postures, 17 key points of each candidate human body posture are taken as predicted human body key points;
generating local key point expansion windows by taking each predicted human body key point as center
Figure BDA0004128205380000024
Optionally, the step of converting the local keypoint expansion window into a keypoint attraction field includes:
processing the feature map by using a convolution layer, batch normalization and correction linear unit to obtain a dimension C 3 Is a feature map of (1);
expanding windows using local keypointsFrom dimension C 3 Obtaining a local key point expansion window characteristic diagram through bilinear interpolation in the characteristic diagram of the (2);
the local key point expanded window feature map is processed using three different convolution layers, batch normalization, and modified linear units to obtain the key point attraction field.
Optionally, the step of generating a global key point expansion window by using the predicted human key points and the human key point thermodynamic diagram, and performing convolution operation by using the key point attraction field as a convolution check on the global key point expansion window to obtain a corrected key point thermodynamic diagram includes:
generating a global key point expansion window for each predicted human key point;
expanding a window from dimension C using the generated global keypoint 3 Bilinear interpolation in the feature map of (2) to obtain a global key point expansion window feature map;
weighting calculation is carried out on the global key point expansion window feature map by utilizing Gaussian verification, so that a new global key point expansion window feature map is obtained;
and using the key point attraction field as a convolution check to carry out convolution operation on the new global key point expansion window characteristic diagram, so as to realize fusion of local and global context information and obtain a corrected thermodynamic diagram of the key point.
Optionally, the step of obtaining the two-dimensional coordinates of the predicted human body node from the corrected key point thermodynamic diagram includes:
selecting the first two positions with the highest scores in the corrected key point thermodynamic diagram for each predicted human body key point, then weighting the scores of the two positions, calculating a first product of the scores of the two weighted positions, multiplying the first product by the score of the corresponding candidate key point center point to obtain a second product, and taking the second product as the confidence score of the predicted human body key point;
and taking the position of the key point with the highest confidence score as the two-dimensional coordinate of the predicted human body joint point.
Optionally, the step of encoding and decoding by using the space and time sequence information of the two-dimensional coordinates of the predicted human body joint point to obtain the three-dimensional information of the predicted human body joint point includes:
performing spatial scale coding by taking the two-dimensional coordinates of the predicted human body joint points as key values and implicit vectors serving as indexes to obtain output characteristics;
encoding the output characteristics through a time scale to obtain an output array;
and obtaining the three-dimensional coordinates of the predicted human body joint point according to the output array regression.
In a second aspect, the present invention also provides a human body posture predicting apparatus, comprising:
the extraction module is used for extracting the input picture through a convolutional neural network to obtain a feature map;
the first generation module is used for obtaining the offset of the human body key points, the thermodynamic diagram of the human body key points and the central thermodynamic diagram of the human body key points based on the extracted feature diagram, obtaining predicted human body key points according to the offset of the human body key points and the central thermodynamic diagram of the human body key points, and generating a local key point expansion window by taking the predicted human body key points as the center;
the conversion module is used for converting the local key point expansion window into a key point attraction field;
the second generation module is used for generating a global key point expansion window by using the predicted human key points and the human key point thermodynamic diagram, and carrying out convolution operation by using the key point attraction field as a convolution check global key point expansion window to obtain a corrected key point thermodynamic diagram;
the third generation module is used for obtaining the two-dimensional coordinates of the predicted human body joint point from the corrected key point thermodynamic diagram;
and the encoding and decoding module is used for encoding and decoding by utilizing the space and time sequence information of the two-dimensional coordinates of the predicted human body joint point respectively to obtain the three-dimensional information of the predicted human body joint point.
In a third aspect, the present invention also provides a human body posture predicting device comprising a processor, a memory, and a human body posture predicting program stored on the memory and executable by the processor, wherein the human body posture predicting program, when executed by the processor, implements the steps of the human body posture predicting method as described above.
In a fourth aspect, the present invention also provides a readable storage medium having stored thereon a human body posture prediction program, wherein the human body posture prediction program, when executed by a processor, implements the steps of the human body posture prediction method as described above.
In the invention, for an input picture, a feature map is extracted through a convolutional neural network; obtaining an offset of a human body key point, a human body key point thermodynamic diagram and a human body key point central thermodynamic diagram based on the extracted feature diagram, obtaining a predicted human body key point according to the offset of the human body key point and the human body key point central thermodynamic diagram, and generating a local key point expansion window by taking the predicted human body key point as a center; converting the local key point expansion window into a key point attraction field; generating a global key point expansion window by using the predicted human key points and the human key point thermodynamic diagram, and performing convolution operation by using the key point attraction field as a convolution check global key point expansion window to obtain a corrected key point thermodynamic diagram; obtaining predicted two-dimensional coordinates of the human body joint point from the corrected key point thermodynamic diagram; and respectively encoding and decoding by utilizing the space and time sequence information of the two-dimensional coordinates of the predicted human body joint point to obtain the three-dimensional information of the predicted human body joint point. The invention effectively utilizes the characteristic information, combines the characteristic information in the front and rear stages and the global and local information, thereby outputting more abundant characteristic information, improving the positioning effect of key points of the human body and further improving the prediction precision of the human body posture.
Drawings
FIG. 1 is a flow chart of an embodiment of a human body posture prediction method according to the present invention;
FIG. 2 is a schematic diagram of functional modules of an embodiment of a human body posture predicting device according to the present invention;
fig. 3 is a schematic hardware structure of a human body posture predicting device according to an embodiment of the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
In a first aspect, an embodiment of the present invention provides a human body posture prediction method.
In an embodiment, referring to fig. 1, fig. 1 is a flowchart illustrating an embodiment of a human body posture prediction method according to the present invention. As shown in fig. 1, the human body posture prediction method includes:
step S10, extracting an input picture through a convolutional neural network to obtain a feature map;
in this embodiment, for a given input picture, a low-dimensional feature map is extracted by a convolutional neural network. The feature extraction network used may be HRNets, and the spatial dimension of the final feature map output depends on the overall step size of the feature extraction network.
For example, assuming that the original picture has a size of h×w, the length and width of the features of the original input become s times as large as the original features each time the pooling layer is passed. Taking an original image as input, and obtaining a feature image F epsilon R with feature dimension of C through calculation of a series of convolution layers, correction linear units and pooling layers C×h×w The spatial dimension h×w depends on the overall step size of the feature extraction module.
Step S20, obtaining an offset of a human body key point, a human body key point thermodynamic diagram and a human body key point central thermodynamic diagram based on the extracted feature diagram, obtaining a predicted human body key point according to the offset of the human body key point and the human body key point central thermodynamic diagram, and generating a local key point expansion window by taking the predicted human body key point as a center;
in this embodiment, a feature map F εR is obtained C×h×w Then, firstly, a convolution layer with the convolution kernel size of 1x1 and a batch normalization and correction linear unit are used for reducing the dimension of the feature map, and then, the convolution layer with the convolution kernel size of 1x1 is used for carrying out convolution processing to output two branches. Wherein branch one outputs offThermodynamic diagrams of key points and key point centers using
Figure BDA0004128205380000061
To show that the obtained low-resolution thermodynamic diagram is up-sampled by bilinear interpolation to obtain a high-resolution thermodynamic diagram, which is used for the following purposes
Figure BDA0004128205380000062
To represent.
The two branches output the offset of 17 human body key point positions by
Figure BDA0004128205380000063
To represent.
A series of candidate human body gestures are obtained based on the calculated key point central thermodynamic diagram and the offset of the key point positions. Specifically, first, non-maximum suppression processing (window size 3×3) is performed on the obtained central point thermodynamic diagram, and the top N points with high scores are selected as candidates for the central point of the key point. Then, the candidate key point center points and the key point offset are used for obtaining N candidate human body posture estimates, and simultaneously, the human body posture with the key point center thermodynamic diagram score smaller than a given threshold value is removed (0.01 is taken in the embodiment).
A partial window of human body key point expansion is generated, specifically, for N calculated candidate human body postures, 17 key points of each human body posture respectively generate a partial grid (11 x11 in the embodiment) taking the positions of the points as the center for compensating the information of estimating the human body posture deletion by using the offset of the center point. This grid is defined as a window of expansion of key points of the human body
Figure BDA0004128205380000064
To represent.
Further, in an embodiment, the step of obtaining the offset, the thermodynamic diagram of the human body key point and the thermodynamic diagram of the human body key point center based on the extracted feature map includes:
the feature map is reduced in dimension by two parallel branches respectively,then, a convolution layer with the convolution kernel size of 1x1 is used for convolution processing; wherein, one branch outputs human body key point thermodynamic diagram and human body key point central thermodynamic diagram, using
Figure BDA0004128205380000065
The representation, wherein H is a thermodynamic diagram, H, w depend on the sampling step size of the backbone network, and the obtained low-resolution thermodynamic diagram is up-sampled by bilinear interpolation to obtain a high-resolution thermodynamic diagram, using +_>
Figure BDA0004128205380000066
A representation; the other branch outputs the offset of 17 human body key points by +.>
Figure BDA0004128205380000067
And (3) representing.
In this embodiment, the specific flow is as follows:
Figure BDA0004128205380000068
Figure BDA0004128205380000069
wherein F is E R C×h×w For the feature map, C is the dimension of the feature, and C1 and C2 are predefined parameters (c1=32 and c2=256 are commonly used in the algorithm). Conv is convolution processing, BN is batch normalization, and ReLU is a regular linear unit.
Further, in an embodiment, the step of obtaining the predicted human body key point according to the offset of the human body key point and the thermodynamic diagram of the center of the human body key point, and generating the local key point expansion window by taking the predicted human body key point as the center includes:
performing non-maximum value inhibition processing on the human body key point central thermodynamic diagram, and selecting the first N points with high scores as candidate key point central points, wherein N is a positive integer; obtaining N candidate human body poses through the offset of the candidate key point center point and the human body key pointSimultaneously removing human body gestures of which the central thermodynamic diagram score of the key points of the human body is smaller than a threshold value; for N candidate human body postures, 17 key points of each candidate human body posture are taken as predicted human body key points; generating local key point expansion windows by taking each predicted human body key point as center
Figure BDA0004128205380000071
In this embodiment, the specific flow is as follows:
Figure BDA0004128205380000072
step S30, converting the local key point expansion window into a key point attraction field;
in this embodiment, the feature map is first processed by using a convolution layer, a batch normalization, and a modified linear unit to obtain a dimension C 3 (64 in this embodiment) and then obtaining a local key point expanded window feature map from the processed feature map by bilinear interpolation using a local geometric grid.
The method comprises the steps of generating a key point attraction field by using a convolution information transmission module, specifically, processing the local key point expansion window characteristic map obtained in the step one by using three different convolution layers and batch normalization and correction linear units to obtain the key point attraction field. Wherein instead of batch normalization, normalization of the attention mechanism is used in order to emphasize the specificity of the different gesture instances in the convolution information delivery module.
Further, in an embodiment, step S30 includes:
processing the feature map by using a convolution layer, batch normalization and correction linear unit to obtain a dimension C 3 Is a feature map of (1); expanding windows from dimension C using local keypoints 3 Obtaining a local key point expansion window characteristic diagram through bilinear interpolation in the characteristic diagram of the (2); processing the local key point expansion window characteristic map by using three different convolution layers, batch normalization and modified linear units to obtain key point absorptionAnd (5) guiding a field.
In this embodiment, the specific flow is as follows:
Figure BDA0004128205380000073
Figure BDA0004128205380000081
wherein Cin, cout, C6 are all dimensions.
Step S40, a global key point expansion window is generated by using the predicted human key points and the human key point thermodynamic diagram, and a corrected key point thermodynamic diagram is obtained by performing convolution operation by using the key point attraction field as a convolution check global key point expansion window;
in this embodiment, for each predicted human critical point, an expanded global grid (axa) is generated for the grid
Figure BDA0004128205380000082
This grid can be understood as a global key point expansion window. We then use the generated global keypoint expansion window to bilinear interpolate from the predicted global thermodynamic diagram to derive a global keypoint window feature map. Finally, the Gaussian kernel is used for re-weighting calculation to obtain a global key point expansion window characteristic diagram
Figure BDA0004128205380000083
The specific flow is as follows:
Figure BDA0004128205380000084
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004128205380000085
is Gaussian kernel->
Figure BDA0004128205380000086
And performing convolution operation by using the obtained key point attraction field as a convolution check global key point expansion window characteristic map to obtain a corrected key point thermodynamic diagram. Specifically, the learned key point attraction field is used as a convolution check global key point expansion window characteristic diagram to carry out convolution operation, so that fusion of local and global context information is realized, and finally, corrected 17 key point thermodynamic diagrams are obtained.
Further, in an embodiment, step S40 includes:
generating a global key point expansion window for each predicted human key point; expanding a window from dimension C using the generated global keypoint 3 Bilinear interpolation in the feature map of (2) to obtain a global key point expansion window feature map; weighting calculation is carried out on the global key point expansion window feature map by utilizing Gaussian verification, so that a new global key point expansion window feature map is obtained; and using the key point attraction field as a convolution check to carry out convolution operation on the new global key point expansion window characteristic diagram, so as to realize fusion of local and global context information and obtain a corrected thermodynamic diagram of the key point.
Step S50, obtaining the two-dimensional coordinates of the predicted human body joint point from the corrected key point thermodynamic diagram;
further, in an embodiment, step S50 includes:
selecting the first two positions with the highest scores in the corrected key point thermodynamic diagram for each predicted human body key point, then weighting the scores of the two positions, calculating a first product of the scores of the two weighted positions, multiplying the first product by the score of the corresponding candidate key point center point to obtain a second product, and taking the second product as the confidence score of the predicted human body key point; and taking the position of the key point with the highest confidence score as the two-dimensional coordinate of the predicted human body joint point.
In this embodiment, the specific flow is as follows:
Figure BDA0004128205380000091
Figure BDA0004128205380000092
is the predicted human body key point, N' is the number of pose instances that are ultimately predicted in one image I.
And step S60, encoding and decoding are respectively carried out by utilizing the space and time sequence information of the two-dimensional coordinates of the predicted human body joint point, so as to obtain the three-dimensional information of the predicted human body joint point.
In this embodiment, the two-dimensional coordinates of the predicted human body node are used as key values to perform spatial scale encoding with the implicit vector as an index, so as to obtain the output feature. And for the output characteristics, performing time scale coding on the implicit vector serving as a key value and an index to obtain an output array Y. And (3) returning the position Y of the joint point in the real three-dimensional world through weighted average of the Y and a multi-layer perceptron to obtain the predicted three-dimensional information of the human body joint point.
Further, in one embodiment, step S60 includes:
performing spatial scale coding by taking the two-dimensional coordinates of the predicted human body joint points as key values and implicit vectors serving as indexes to obtain output characteristics; encoding the output characteristics through a time scale to obtain an output array; and obtaining the three-dimensional coordinates of the predicted human body joint point according to the output array regression.
Compared with the prior art, the embodiment has the following advantages:
1. the method solves the defect of low accuracy of the method for predicting the human body posture based on the center of the key point and the offset, and provides a method for constructing a local expansion window around the predicted key point to realize further fine positioning of the key point, wherein the whole constructed key point detection and correction network is an end-to-end network.
2. The embodiment provides a novel local and global information adaptation module for obtaining a structured human body posture estimation effect and realizing fusion of local and global structural information.
According to the embodiment, the problem of multi-person gesture estimation with serious self-shielding condition can be solved through two-stage training.
Further, the loss function of the full convolution network according to this embodiment includes the following parts:
the keypoint thermodynamic diagram loses function. By using
Figure BDA0004128205380000093
The thermodynamic diagram truth values representing each keypoint and the center of the keypoint, generated by modeling the mean and variance of a given dataset with a gaussian model. The elements in the set with dimensions 10×h×w are denoted by p= (i, x). Thermodynamic diagram for predicted key point and center +.>
Figure BDA0004128205380000101
Calculate it and +.>
Figure BDA0004128205380000102
As a loss function of the key point thermodynamic diagram. The following are provided:
Figure BDA0004128205380000103
where w (x) represents the weights of the foreground and background pixels. For foreground pixels, w (x) =1; for background element, w (x) =0.1.
Key point offset field loss function. By using
Figure BDA0004128205380000104
True value representing offset, c GT Representing the non-empty set of keypoint centers in the truth. For predicted keypoint offset field +.>
Figure BDA0004128205380000105
The loss function is expressed as follows:
Figure BDA0004128205380000106
wherein the method comprises the steps of
Figure BDA0004128205380000107
Represents the area of the human body centered on the pixel p, and β represents the cutoff threshold (e.g., 1/9)
OKS loss function. In predicting each human pose, a local keypoint attractive field needs to be learned as a convolution kernel to correct the global keypoint thermodynamic diagram. Specifically, for a given N in a picture GT Labeling true values of individual body gestures, calculating similarity scores between key points and true values of each candidate in the local key point expansion window, and obtaining similarity score tensors
Figure BDA0004128205380000108
The tensor is then truncated with a threshold value of 0.5, i.e. +.>
Figure BDA0004128205380000109
Then, taking an average of the first three dimensions of the cut-off similarity score tensor to obtain a matching score of the similarity score tensor and the labeling true value of each human body gesture, and selecting one human body gesture true value n with the largest matching score * Matching score +.>
Figure BDA00041282053800001012
To represent. Calculating a similarity score for each key point of the predicted human body pose based on the true value of that selected human body pose, using s k (k∈[1,17]) To represent. Thus, the loss function of the final critical point attraction field is defined as:
Figure BDA00041282053800001010
finally the overall loss function of the whole network is defined as
Figure BDA00041282053800001011
For balancing values between different penalty terms, e.g.Lambda was taken as 0.01. The defined loss function is used for measuring the difference between the predicted human body posture and the true value, taking the difference as an error signal and solving the partial derivatives of all parameters in the convolution layer through a back propagation algorithm; and updating parameters of the neural network according to the calculation result.
In a second aspect, the embodiment of the invention further provides a human body posture prediction device.
In an embodiment, referring to fig. 2, fig. 2 is a schematic functional block diagram of a human body posture predicting device according to an embodiment of the invention. As shown in fig. 2, the human body posture predicting apparatus includes:
the extraction module 10 is used for extracting an input picture through a convolutional neural network to obtain a feature map;
the first generating module 20 is configured to obtain an offset of a human body key point, a thermodynamic diagram of the human body key point, and a central thermodynamic diagram of the human body key point based on the extracted feature map, obtain a predicted human body key point according to the offset of the human body key point and the central thermodynamic diagram of the human body key point, and generate a local key point expansion window with the predicted human body key point as a center;
a conversion module 30 for converting the local keypoint expansion window into a keypoint attractive field;
a second generating module 40, configured to generate a global key point expansion window by using the predicted human key points and the human key point thermodynamic diagram, and perform a convolution operation by using the key point attraction field as a convolution check global key point expansion window to obtain a corrected key point thermodynamic diagram;
a third generating module 50, configured to obtain two-dimensional coordinates of the predicted human body node from the corrected key point thermodynamic diagram;
the encoding and decoding module 60 is configured to encode and decode by using the space and time sequence information of the two-dimensional coordinates of the predicted human body node respectively, so as to obtain the three-dimensional information of the predicted human body node.
Further, in an embodiment, the first generating module 20 is configured to:
the feature map is subjected to dimension reduction through two parallel branches respectively, and then convolution processing is carried out by using a convolution layer with the convolution kernel size of 1x 1;
wherein, one branch outputs human body key point thermodynamic diagram and human body key point central thermodynamic diagram, using
Figure BDA0004128205380000111
The representation, wherein H is a thermodynamic diagram, H, w depend on the sampling step size of the backbone network, and the obtained low-resolution thermodynamic diagram is up-sampled by bilinear interpolation to obtain a high-resolution thermodynamic diagram, using +_>
Figure BDA0004128205380000112
A representation; the other branch outputs the offset of 17 human body key points by +.>
Figure BDA0004128205380000113
And (3) representing.
Further, in an embodiment, the first generating module 20 is configured to:
performing non-maximum value inhibition processing on the human body key point central thermodynamic diagram, and selecting the first N points with high scores as candidate key point central points, wherein N is a positive integer;
obtaining N candidate human body postures through the candidate key point center points and the offset of the human body key points, and simultaneously removing the human body postures with the human body key point center thermodynamic diagram score smaller than a threshold value;
for N candidate human body postures, 17 key points of each candidate human body posture are taken as predicted human body key points;
generating local key point expansion windows by taking each predicted human body key point as center
Figure BDA0004128205380000121
Further, in an embodiment, the conversion module 30 is configured to:
processing the feature map by using a convolution layer, batch normalization and correction linear unit to obtain a dimension C 3 Is a feature map of (1);
expanding windows from dimension C using local keypoints 3 Is characterized by (a)Obtaining a local key point expansion window characteristic diagram through medium bilinear interpolation;
the local key point expanded window feature map is processed using three different convolution layers, batch normalization, and modified linear units to obtain the key point attraction field.
Further, in an embodiment, the second generating module 40 is configured to:
generating a global key point expansion window for each predicted human key point;
expanding a window from dimension C using the generated global keypoint 3 Bilinear interpolation in the feature map of (2) to obtain a global key point expansion window feature map;
weighting calculation is carried out on the global key point expansion window feature map by utilizing Gaussian verification, so that a new global key point expansion window feature map is obtained;
and using the key point attraction field as a convolution check to carry out convolution operation on the new global key point expansion window characteristic diagram, so as to realize fusion of local and global context information and obtain a corrected thermodynamic diagram of the key point.
Further, in an embodiment, the third generating module 50 is configured to:
selecting the first two positions with the highest scores in the corrected key point thermodynamic diagram for each predicted human body key point, then weighting the scores of the two positions, calculating a first product of the scores of the two weighted positions, multiplying the first product by the score of the corresponding candidate key point center point to obtain a second product, and taking the second product as the confidence score of the predicted human body key point;
and taking the position of the key point with the highest confidence score as the two-dimensional coordinate of the predicted human body joint point.
Further, in an embodiment, the codec module 60 is configured to:
performing spatial scale coding by taking the two-dimensional coordinates of the predicted human body joint points as key values and implicit vectors serving as indexes to obtain output characteristics;
encoding the output characteristics through a time scale to obtain an output array;
and obtaining the three-dimensional coordinates of the predicted human body joint point according to the output array regression.
The function implementation of each module in the human body posture prediction device corresponds to each step in the human body posture prediction method embodiment, and the function and implementation process of each module are not described in detail herein.
In a third aspect, embodiments of the present invention provide a human body posture prediction apparatus, which may be an apparatus having a data processing function such as a personal computer (personal computer, PC), a notebook computer, a server, or the like.
Referring to fig. 3, fig. 3 is a schematic hardware configuration diagram of a human body posture predicting apparatus according to an embodiment of the present invention. In an embodiment of the present invention, the human posture prediction apparatus may include a processor 1001 (e.g., a central processing unit Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein the communication bus 1002 is used to enable connected communications between these components; the user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard); the network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., WIreless-FIdelity, WI-FI interface); the memory 1005 may be a high-speed random access memory (random access memory, RAM) or a stable memory (non-volatile memory), such as a disk memory, and the memory 1005 may alternatively be a storage device independent of the processor 1001. Those skilled in the art will appreciate that the hardware configuration shown in fig. 3 is not limiting of the invention and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
With continued reference to fig. 3, an operating system, a network communication module, a user interface module, and a human body posture prediction program may be included in a memory 1005, which is one type of computer storage medium in fig. 3. The processor 1001 may call a human body posture prediction program stored in the memory 1005, and execute the human body posture prediction method provided by the embodiment of the present invention.
In a fourth aspect, embodiments of the present invention also provide a readable storage medium.
The human body posture predicting program is stored on the readable storage medium, and when the human body posture predicting program is executed by the processor, the steps of the human body posture predicting method are realized.
The method implemented when the human body posture prediction program is executed may refer to various embodiments of the human body posture prediction method of the present invention, and will not be described herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising several instructions for causing a terminal device to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (10)

1. A human body posture prediction method, characterized in that the human body posture prediction method comprises:
extracting an input picture through a convolutional neural network to obtain a feature map;
obtaining an offset of a human body key point, a human body key point thermodynamic diagram and a human body key point central thermodynamic diagram based on the extracted feature diagram, obtaining a predicted human body key point according to the offset of the human body key point and the human body key point central thermodynamic diagram, and generating a local key point expansion window by taking the predicted human body key point as a center;
converting the local key point expansion window into a key point attraction field;
generating a global key point expansion window by using the predicted human key points and the human key point thermodynamic diagram, and performing convolution operation by using the key point attraction field as a convolution check global key point expansion window to obtain a corrected key point thermodynamic diagram;
obtaining predicted two-dimensional coordinates of the human body joint point from the corrected key point thermodynamic diagram;
and respectively encoding and decoding by utilizing the space and time sequence information of the two-dimensional coordinates of the predicted human body joint point to obtain the three-dimensional information of the predicted human body joint point.
2. The human body posture prediction method of claim 1, wherein the step of obtaining the offset of the human body key point, the human body key point thermodynamic diagram and the human body key point center thermodynamic diagram based on the extracted feature map comprises:
the feature map is subjected to dimension reduction through two parallel branches respectively, and then convolution processing is carried out by using a convolution layer with the convolution kernel size of 1x 1;
wherein, one branch outputs human body key point thermodynamic diagram and human body key point central thermodynamic diagram, using
Figure FDA0004128205370000011
The representation, wherein H is a thermodynamic diagram, H, w depend on the sampling step size of the backbone network, and the obtained low-resolution thermodynamic diagram is up-sampled by bilinear interpolation to obtain a high-resolution thermodynamic diagram, using +_>
Figure FDA0004128205370000012
A representation; the other branch outputs the offset of 17 human body key points by +.>
Figure FDA0004128205370000013
And (3) representing.
3. The human body posture prediction method of claim 2, wherein the step of obtaining predicted human body key points based on the deviation amount of the human body key points and the thermodynamic diagram of the centers of the human body key points, and generating the local key point expansion window centering on the predicted human body key points comprises:
performing non-maximum value inhibition processing on the human body key point central thermodynamic diagram, and selecting the first N points with high scores as candidate key point central points, wherein N is a positive integer;
obtaining N candidate human body postures through the candidate key point center points and the offset of the human body key points, and simultaneously removing the human body postures with the human body key point center thermodynamic diagram score smaller than a threshold value;
for N candidate human body postures, 17 key points of each candidate human body posture are taken as predicted human body key points;
generating local key point expansion windows by taking each predicted human body key point as center
Figure FDA0004128205370000021
4. A method of predicting body poses as recited in claim 3 wherein said step of converting the local keypoint expansion window into a keypoint attractive field comprises:
linear cell pair with convolutional layer, batch normalization, correctionProcessing the feature map to obtain a dimension C 3 Is a feature map of (1);
expanding windows from dimension C using local keypoints 3 Obtaining a local key point expansion window characteristic diagram through bilinear interpolation in the characteristic diagram of the (2);
the local key point expanded window feature map is processed using three different convolution layers, batch normalization, and modified linear units to obtain the key point attraction field.
5. The human body posture prediction method of claim 4, wherein the generating a global key point expansion window using the predicted human body key points and the human body key point thermodynamic diagram, and the convolving the global key point expansion window using the key point attraction field as a convolving check to obtain a modified key point thermodynamic diagram comprises:
generating a global key point expansion window for each predicted human key point;
expanding a window from dimension C using the generated global keypoint 3 Bilinear interpolation in the feature map of (2) to obtain a global key point expansion window feature map;
weighting calculation is carried out on the global key point expansion window feature map by utilizing Gaussian verification, so that a new global key point expansion window feature map is obtained;
and using the key point attraction field as a convolution check to carry out convolution operation on the new global key point expansion window characteristic diagram, so as to realize fusion of local and global context information and obtain a corrected thermodynamic diagram of the key point.
6. The method of claim 5, wherein the step of deriving the two-dimensional coordinates of the predicted human body node from the modified keypoint thermodynamic diagram comprises:
selecting the first two positions with the highest scores in the corrected key point thermodynamic diagram for each predicted human body key point, then weighting the scores of the two positions, calculating a first product of the scores of the two weighted positions, multiplying the first product by the score of the corresponding candidate key point center point to obtain a second product, and taking the second product as the confidence score of the predicted human body key point;
and taking the position of the key point with the highest confidence score as the two-dimensional coordinate of the predicted human body joint point.
7. The human body posture prediction method of claim 6, wherein the step of encoding and decoding using spatial and temporal information of two-dimensional coordinates of the predicted human body node respectively, to obtain three-dimensional information of the predicted human body node comprises:
performing spatial scale coding by taking the two-dimensional coordinates of the predicted human body joint points as key values and implicit vectors serving as indexes to obtain output characteristics;
encoding the output characteristics through a time scale to obtain an output array;
and obtaining the three-dimensional coordinates of the predicted human body joint point according to the output array regression.
8. A human body posture predicting device, characterized in that the human body posture predicting device comprises:
the extraction module is used for extracting the input picture through a convolutional neural network to obtain a feature map;
the first generation module is used for obtaining the offset of the human body key points, the thermodynamic diagram of the human body key points and the central thermodynamic diagram of the human body key points based on the extracted feature diagram, obtaining predicted human body key points according to the offset of the human body key points and the central thermodynamic diagram of the human body key points, and generating a local key point expansion window by taking the predicted human body key points as the center;
the conversion module is used for converting the local key point expansion window into a key point attraction field;
the second generation module is used for generating a global key point expansion window by using the predicted human key points and the human key point thermodynamic diagram, and carrying out convolution operation by using the key point attraction field as a convolution check global key point expansion window to obtain a corrected key point thermodynamic diagram;
the third generation module is used for obtaining the two-dimensional coordinates of the predicted human body joint point from the corrected key point thermodynamic diagram;
and the encoding and decoding module is used for encoding and decoding by utilizing the space and time sequence information of the two-dimensional coordinates of the predicted human body joint point respectively to obtain the three-dimensional information of the predicted human body joint point.
9. A human body posture prediction device, characterized in that it comprises a processor, a memory, and a human body posture prediction program stored on the memory and executable by the processor, wherein the human body posture prediction program, when executed by the processor, implements the steps of the human body posture prediction method according to any one of claims 1 to 7.
10. A readable storage medium, characterized in that a human body posture prediction program is stored on the readable storage medium, wherein the human body posture prediction program, when executed by a processor, implements the steps of the human body posture prediction method according to any one of claims 1 to 7.
CN202310251993.9A 2023-03-13 2023-03-13 Human body posture prediction method, device, equipment and readable storage medium Pending CN116363750A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310251993.9A CN116363750A (en) 2023-03-13 2023-03-13 Human body posture prediction method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310251993.9A CN116363750A (en) 2023-03-13 2023-03-13 Human body posture prediction method, device, equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN116363750A true CN116363750A (en) 2023-06-30

Family

ID=86915571

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310251993.9A Pending CN116363750A (en) 2023-03-13 2023-03-13 Human body posture prediction method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN116363750A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116631010A (en) * 2023-07-17 2023-08-22 粤港澳大湾区数字经济研究院(福田) Interactive key point detection method and related device
CN116645699A (en) * 2023-07-27 2023-08-25 杭州华橙软件技术有限公司 Key point detection method, device, terminal and computer readable storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116631010A (en) * 2023-07-17 2023-08-22 粤港澳大湾区数字经济研究院(福田) Interactive key point detection method and related device
CN116631010B (en) * 2023-07-17 2023-10-31 粤港澳大湾区数字经济研究院(福田) Interactive key point detection method and related device
CN116645699A (en) * 2023-07-27 2023-08-25 杭州华橙软件技术有限公司 Key point detection method, device, terminal and computer readable storage medium
CN116645699B (en) * 2023-07-27 2023-09-29 杭州华橙软件技术有限公司 Key point detection method, device, terminal and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN111563508B (en) Semantic segmentation method based on spatial information fusion
CN112926396B (en) Action identification method based on double-current convolution attention
CN116363750A (en) Human body posture prediction method, device, equipment and readable storage medium
CN112232134B (en) Human body posture estimation method based on hourglass network and attention mechanism
CN110942471A (en) Long-term target tracking method based on space-time constraint
CN111507184B (en) Human body posture detection method based on parallel cavity convolution and body structure constraint
CN115205336A (en) Feature fusion target perception tracking method based on multilayer perceptron
CN114529793A (en) Depth image restoration system and method based on gating cycle feature fusion
CN117523645B (en) Face key point detection method and device, electronic equipment and storage medium
Shi et al. Lightweight Context-Aware Network Using Partial-Channel Transformation for Real-Time Semantic Segmentation
CN114550014A (en) Road segmentation method and computer device
CN111753670A (en) Human face overdividing method based on iterative cooperation of attention restoration and key point detection
CN115471718A (en) Construction and detection method of lightweight significance target detection model based on multi-scale learning
CN114913541A (en) Human body key point detection method, device and medium based on orthogonal matching pursuit
CN113610856A (en) Method and device for training image segmentation model and image segmentation
CN117237858B (en) Loop detection method
Niu et al. Designing compact convolutional filters for lightweight human pose estimation
CN113343762B (en) Human body posture estimation grouping model training method, posture estimation method and device
CN116612288B (en) Multi-scale lightweight real-time semantic segmentation method and system
CN116030347B (en) High-resolution remote sensing image building extraction method based on attention network
CN116486101B (en) Image feature matching method based on window attention
CN117392180B (en) Interactive video character tracking method and system based on self-supervision optical flow learning
CN110458092B (en) Face recognition method based on L2 regularization gradient constraint sparse representation
Zhang et al. Satellite component tracking and segmentation based on position information encoding
CN116978124A (en) Lip language identification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination