CN112633220A - Human body posture estimation method based on bidirectional serialization modeling - Google Patents

Human body posture estimation method based on bidirectional serialization modeling Download PDF

Info

Publication number
CN112633220A
CN112633220A CN202011610311.1A CN202011610311A CN112633220A CN 112633220 A CN112633220 A CN 112633220A CN 202011610311 A CN202011610311 A CN 202011610311A CN 112633220 A CN112633220 A CN 112633220A
Authority
CN
China
Prior art keywords
human body
network
posture
attitude
posture estimation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011610311.1A
Other languages
Chinese (zh)
Other versions
CN112633220B (en
Inventor
刘振广
封润洋
陈豪明
王勋
钱鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Gongshang University
Original Assignee
Zhejiang Gongshang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Gongshang University filed Critical Zhejiang Gongshang University
Priority to CN202011610311.1A priority Critical patent/CN112633220B/en
Publication of CN112633220A publication Critical patent/CN112633220A/en
Application granted granted Critical
Publication of CN112633220B publication Critical patent/CN112633220B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a human body posture estimation method based on bidirectional serialization modeling, which takes continuous 3 frames as input, fully utilizes time sequence information of a video to calculate the approximate space range of each joint, and then regresses the specific position of each joint from a smaller range, thereby better solving the problems of inherent shielding, motion blurring and the like in a human body posture estimation task, leading the generalization of a model to be stronger and having higher accuracy. The method fully utilizes the time sequence information of the video, enhances the reasoning capability of the model, can better estimate key parts of the human body, and has important significance in industries needing to extract the posture in real time for analysis, such as security protection, short video platforms and the like.

Description

Human body posture estimation method based on bidirectional serialization modeling
Technical Field
The invention belongs to the technical field of human body posture estimation, and particularly relates to a human body posture estimation method based on bidirectional serialization modeling.
Background
Human body posture estimation is a leading research field in computer vision, and aims to locate human body key parts (such as wrists and ankles) in pictures or videos so as to realize human body posture estimation. Human body posture estimation is a bridge for communicating a machine and a person, has great practical significance, is widely applied to a plurality of fields, such as the field of stage animation, and can generate real-time interactive animation effect by identifying the posture action of the person; in the field of automatic driving, traffic accidents can be avoided in advance by predicting the motion trend of pedestrians; in the field of security protection, abnormal behaviors can be detected by identifying a specific attitude sequence.
Currently, human posture estimation methods are mainly classified into two types: (1) from top to bottom, all human body positions in the picture are detected firstly, and the human body is marked by a rectangular bounding box usually; then identifying the joints of each human body through a human body joint part detector; and then mapping the cut character posture information back to the original picture by affine transformation, thereby realizing the estimation of all human body postures in the picture. The top-down method separates the human position detection task from the human joint detection task, and focuses on the posture estimation method, so that the method has high accuracy, but the detection time is positively correlated with the number of human beings in the picture, and the method needs to use a target detection technology, and the detection quality of the position coordinate directly influences the final result of the posture estimation. (2) From bottom to top, firstly detecting joint position information of all human bodies in the picture, and then clustering joint coordinates belonging to the same person, thereby carrying out posture estimation on all human bodies in the picture. The bottom-up method has high efficiency, the detection time is less influenced by the number of the human objects in the picture, but the accuracy is slightly behind.
The mainstream human body posture estimation methods include a network architecture designed for a static picture from top to bottom and from bottom to top, which is good at human body posture estimation in a single-frame picture. Generally, 1/25 seconds, which is 1 frame, is very short, so that the image between two frames of the video does not change very much, and has very high similarity, and because of the rich geometric consistency between adjacent frames of the video, such extra clues can be used to correct key points which are difficult to predict, such as occlusion or motion blur.
The traditional image-based attitude estimation method cannot effectively utilize the extra information, so that the situations of frequent high entanglement, mutual occlusion, motion blur and the like of people in a video sequence cannot be processed, and a good result is difficult to obtain in video attitude estimation. To address this problem, the document [ floating ConvNets for Human dose Estimation in Videos- [ CODE ] -pfister.t, charles.j & zisserman.a (ICCV 2015) ] proposes to compute dense optical flow information between each two frames and then correct the initial Pose estimate using flow-based time information; when the optical flow can be correctly calculated, the method achieves a good effect, however, the calculation of the optical flow is greatly affected by the picture quality, the shading and the like, all optical flow information cannot be accurately calculated in the video, and the calculation of the optical flow information often needs a large amount of calculation support. Some researchers also propose to use a Long Short-Term Memory network (LSTM) to directly model a video to capture timing information, however, due to the structural limitation of the LSTM network, this method can only obtain a good effect when people in a video frame are sparse, and when the method is used in a complex scene, the situations of occlusion, motion blur, etc. cannot be handled.
Disclosure of Invention
In view of the above, the present invention provides a human body posture estimation method based on bi-directional serialization modeling, which takes continuous 3 frames as input, calculates the approximate spatial range of each joint by fully utilizing the time sequence information of the video, and then regresses the specific position of the joint from a smaller range, thereby better solving the problems of occlusion, motion blur, etc. inherent in the human body posture estimation task, and making the generalization of the model stronger and have higher accuracy.
A human body posture estimation method based on bidirectional serialization modeling comprises the following steps:
(1) collecting a video data set for estimating the human body posture and preprocessing the video data set;
(2) regarding a section of complete video in a video data set, taking continuous 3 frames of video images as a group of samples, and manually marking coordinates of each key part of a human body in the video images;
(3) constructing a bidirectional continuous convolutional neural network, and training the convolutional neural network by using a large number of samples to obtain a human body posture estimation model;
(4) and inputting continuous 3 frames of video images to be estimated into the human body posture estimation model, and outputting to obtain the posture estimation result of the person in the 2 nd frame of video image, namely the coordinates of each key part of the human body.
Further, in the step (1), for each frame of video image in the video data set, the position coordinates of a human body ROI (region of interest, i.e. human figure position bounding box) in the image are detected by the YOLOv5 algorithm, and the ROI is enlarged by 25%.
Further, the bidirectional continuous convolution neural network is composed of a Backbone network, a posture time merging network, a posture residual error merging network and a posture correction network, wherein the Backbone network is used for preliminarily calculating a posture characteristic vector h of a human body in three frames of video images of input samplesi-1、hi、hi+1The three feature vectors are superposed to obtain a vector phi (h) which is respectively input to an attitude time merging network and an attitude residual error fusion network, the attitude time merging network is used for coding the approximate space range of each joint of the human body to obtain a feature vector xi (h), the attitude residual error fusion network is used for calculating an attitude residual error vector psi (h) of the human body, and the xi (h) and the feature vector eta superposed with the psi (h) are input to the attitude correction network to be calculated to obtain a human body attitude prediction result.
Further, the attitude time combination network is formed by stacking three Residual blocks (Residual blocks), a vector phi (h) is recombined according to a joint sequence and then serves as the input of the network, and a feature vector xi (h) is output; the attitude residual fusion network is formed by stacking five residual blocks, firstly, attitude characteristic vectors of a second frame and a first frame and attitude characteristic vectors of the second frame and a third frame in a sample are respectively subjected to difference, meanwhile, tensor zeta is obtained through cascade (association) with weight and is used as input of the network, an attitude residual vector psi (h) is output, and a specific expression of the tensor zeta is as follows:
Figure BDA0002872121250000031
further, the residual block is formed by sequentially connecting a convolution layer with the size of 3 × 3, a batch normalization layer and a Relu activation layer, the residual block in the attitude time merging network adopts packet convolution, and the packet number groups is 17 (according to the key point standard of the COCO data set, there are 17 key points in total); the residual block in the posture residual fusion network does not use packet convolution, and the packet number groups is 1.
Furthermore, the posture correction network is composed of five parallel deformable convolutions, the expansion rates of the five deformable convolutions are respectively 3, 6, 9, 12 and 15, each deformable convolution takes the result of stacking the feature vector xi (h) and the eta as input, predicted Gaussian heat maps are output, and the five Gaussian heat maps output by the five convolutions respectively are averaged to obtain the human body posture prediction result.
Further, the process of training the bidirectional continuous convolutional neural network in the step (3) is divided into two steps: firstly, training a Backbone network, then fixing Backbone network parameters, training a posture time merging network, a posture residual error merging network and a posture correction network.
Further, the specific process of training the backhaul network is as follows: inputting human body ROI in all video images of a sample into a Backbone network one by one, calculating a loss function L1 between a human body posture prediction result output by the whole bidirectional continuous convolutional neural network and artificial marking information corresponding to the sample, and repeatedly updating parameters of the Backbone network through back propagation according to a loss function L1 until the loss function L1 converges, wherein the expression of the loss function L1 is as follows:
Figure BDA0002872121250000041
wherein: n is the number of labeled human body key parts, Hgt_iTransforming the coordinates of manually marked ith key part of ROI in a group of samples to generate the result of superposition of Gaussian heatmaps Hpred_iGenerating a Gaussian heatmap superimposed result by transforming coordinates of the ith key part of the ROI of all human bodies in a group of samples and predicting output by a bidirectional continuous convolutional neural network2Denotes the L2 norm, viAnd whether the ith key part has a label in the sample image or not is shown, if so, the value is 1, otherwise, the value is 0.
Further, the specific process of training the posture time merging network, the posture residual error merging network and the posture correction network is as follows: firstly, fixing trained Backbone network parameters, then inputting human body ROI in all video images of a sample into the Backbone network one by one, calculating a loss function L2 between a human body posture prediction result output by the whole bidirectional continuous convolutional neural network and artificial marking information corresponding to the sample, and repeatedly updating parameters of a posture time merging network, a posture residual error merging network and a posture correction network through back propagation according to a loss function L2 until a loss function L2 converges, wherein the expression of the loss function L2 is as follows:
Figure BDA0002872121250000042
wherein: n is the number of labeled human body key parts, Ggt_iArtificially marking coordinates of the ith key part of human ROI in the 2 nd frame video image of a group of samples to generate a Gaussian heat map G through conversionpred_iPredicting a Gauss heatmap generated by transformation of output coordinates by a bi-directional continuous convolutional neural network for an ith key part of a human ROI in a 2 nd frame video image of a set of samples, | |2Denotes the L2 norm, viAnd whether the ith key part has a label in the sample image or not is shown, if so, the value is 1, otherwise, the value is 0.
Further, the specific implementation process of the step (4) is as follows: inputting the human body ROI of the same person in continuous 3 frames of video images to be estimated into a human body posture estimation model to output to obtain a Gaussian heat map, converting and calculating the Gaussian heat map to obtain the coordinate information of key parts of the same person in a 2 nd frame of video image, mapping the coordinate information into the 2 nd frame of video image, and sequentially linking the key parts to generate a prediction result of a human body skeleton, thereby realizing human body posture estimation.
The invention relates to a human body posture estimation method based on bidirectional continuity, which mainly uses deformable convolution networks with different voidage rates as a prediction model; the deformable convolution network is a variant of the traditional convolution neural network, convolution kernels of the traditional convolution neural network are square, general objects such as human bodies and the like are not square, the traditional convolution network has certain limitation, and the deformable convolution network can obtain convolution kernels in any shapes by learning offset parameters of each pixel of the convolution kernels, so that the deformable convolution network can be better suitable for objects in various shapes; each convolutional layer adopts different void ratios corresponding to different receptive fields, the larger the void ratio is, the larger the receptive field is, information biased to the whole situation can be captured, otherwise, the smaller the void ratio can capture more exquisite local information; therefore, the design of the deformable convolution network is more reasonable for estimating the human body posture in the video.
The invention fully utilizes the time sequence information of the video, enhances the reasoning capability of the model, can better estimate the key parts of the human body, has important significance in industries needing to extract the posture in real time for analysis, such as security protection, short video platforms and the like, and has the following beneficial technical effects:
1. the method better estimates the key points of the shielded and motion blurred image through an accurate attitude estimation algorithm, and has the characteristics of more accuracy and rapidness in detection.
2. The method is designed aiming at the video, better accords with various application scenes, and simultaneously adopts packet convolution, cavity convolution and the like, so that better effect can be obtained by measuring fewer parameters, and the attitude estimation can be applied in real time.
Drawings
FIG. 1 is a flowchart illustrating a human body posture estimation method according to the present invention.
FIG. 2 is a schematic diagram of a Residual Block structure and its stacking method.
FIG. 3 is a schematic structural diagram of a bi-directional continuous convolutional neural network according to the present invention.
Detailed Description
In order to more specifically describe the present invention, the following detailed description is provided for the technical solution of the present invention with reference to the accompanying drawings and the specific embodiments.
As shown in FIG. 1, the human body posture estimation method based on bidirectional continuity of the invention comprises the following steps:
(1) and collecting and selecting a human body posture estimation video data set, and preprocessing the data set.
In the embodiment, the Posetrack data set is adopted as training data and is used for a human body posture tracking task, and people can be shielded and the motion of many videos is blurred, so that the difficulty of estimating the human body posture of the videos is greatly increased. This embodiment is a top-down approach, so the data set needs to be preprocessed: firstly, detecting the position boundary box of each person in the frame to be estimated through a YOLO V5 detection algorithm, and then enlarging each boundary box by 25% to cut the front frame and the rear frame to obtain three frame images of the same person.
(2) And constructing a bidirectional continuous convolution neural network model as a human body posture estimation model.
As shown in fig. 3, a bidirectional continuous convolutional neural network (dcpos) is mainly composed of the following parts: the system comprises a backhaul network module, a posture time merging module PTM, a posture residual error fusion module PRF and a posture correction network module PCN. In the embodiment, the backhaul network module adopts a high-resolution network HRNet to preliminarily calculate the human postures in the three input pictures to obtain the characteristic vector hi-1、hi、hi+1The three vectors are superposed to obtain a vector phi (h), and two parallel branches are input; and the attitude time merging module encodes the approximate spatial range xi (h) of each joint, the attitude residual fusion module obtains an attitude residual vector psi (h), and then the characteristic vector xi (h), the characteristic vector eta formed by superposing the characteristic vector xi (h) and the characteristic vector psi (h) are input into the attitude correction network to obtain a final attitude prediction result.
The attitude time merging module consists of three stacked Residual blocks (Residual blocks), a group of samples obtain a characteristic vector phi (h) through a backhaul network, and the characteristic vector phi (h) is recombined according to the joint sequence and is used as the input of the module to output a characteristic vector xi (h); each residual block adopts packet convolution, and the parameter groups are 17 (according to the key point standard of the COCO data set, there are 17 key points in total).
The attitude residual fusion module is composed of five stacked residual blocks, firstly attitude feature vectors of a first frame and a second frame and attitude feature vectors of a third frame and a second frame of the group of samples are respectively subjected to difference, simultaneously, a tensor zeta is obtained through weight cascade and is used as the module input, and an attitude residual vector psi (h) is output, wherein the tensor xi can be formalized as:
Figure BDA0002872121250000061
as shown in fig. 2, the residual block is composed of a 3 × 3 convolution layer, a batch normalization layer, and a Relu activation layer; the difference lies in that the groups parameter in the three residual block convolution layers forming the PTM module is 17, the corresponding PRF module does not use the packet convolution, and the groups parameter in the convolution layers is 1 at the moment.
The posture correction network consists of five parallel deformable convolutions and the expansion rates are set as follows: 3. 6, 9, 12 and 15, each deformable convolution takes the feature vector xi (h) and eta stack as input, a predicted Gaussian heat map is output, and finally five heat maps are averaged to obtain a final prediction result.
(3) Inputting the data preprocessed in the step (1) into a model, and updating parameters and training the model by taking the L distance as a loss function.
The DCPose adopts a separate training method, firstly trains a Backbone network, then fixes the Backbone network, and trains other part of networks.
In the DCPose, each frame of a video is taken as a current frame to be estimated, one frame is taken from front to back to be divided into a plurality of sub-picture sequences, the length of each sub-picture sequence is 3, label information of key point positions of all human bodies exists in each sub-picture sequence, and then each divided sub-picture sequence is taken as the input of the DCPose.
The Backbone network firstly loads official pre-training model parameters, then inputs a group of sub-picture sequences, outputs attitude characteristic vectors, and calculates the mean square error with the real attitude vectors to obtain the loss value of each frame, wherein the expression of a loss function L is as follows:
Figure BDA0002872121250000071
wherein: hgt_iThe result obtained by superposing Gaussian heatmaps generated by converting real coordinates of the ith key part of all people in the subsequence is Hpred_iSuperimposed gaussian heatmap results generated for coordinate transformation of all person's ith key site predictions in subsequences |2Represents L2 norm, N is the number of key parts marked by human body, viAnd whether the coordinate has a label is shown, if so, the value is 1, otherwise, the value is 0.
After the training of the Backbone network is finished, the parameters of the Backbone network are fixed, each sub-picture sequence is input into a DCPose network, and an attitude feature vector phi (h) with the dimensionality of [4,51,96 and 72] is obtained through the Backbone network; then inputting a PTM network to obtain a feature vector xi (h) with the dimensionality of [4,17,96 and 72], inputting a PRF network to obtain a feature vector psi (h) with the dimensionality of [4,128,96 and 72 ]; then inputting the superposed vector eta of the feature vector xi (h) and the feature vectors xi (h) and psi (h) into the PCN network together, wherein the dimensionality is [4,145,96 and 72], each deformable convolution layer outputs a posture feature vector, and the dimensionality is [4,17,96 and 72 ]; the final gaussian heatmap was obtained by averaging 5 different pose feature vectors.
During DCPose training, L2 Loss is mainly adopted, and in each picture sequence input into the bidirectional continuous convolutional neural network, the frame 2 is really needed to estimate the posture, so that the Loss value cannot be calculated through the frames 1 and 3; the 2 nd frame loss function calculation is basically the same as the loss function in the backhaul network training, and the only difference is that Hgt_iThe result of Gaussian heat map generated by transforming the real coordinates of the ith key part of the person in the 2 nd frame of the sample Hpred_iThe result of the Gaussian heatmap generated by the coordinate transformation of the i key parts of the human object in the 2 nd frame is predicted; by making full use of the bidirectional information of the previous and subsequent frames, the network has more accurate prediction capability.
(4) After the model training is finished, inputting a test set and outputting a human body posture estimation result, wherein the specific implementation process is as follows:
4.1 the test set is input into the trained model to obtain the Gaussian heatmap of each frame.
4.2 through a Gaussian heat map coordinate conversion algorithm, calculating from the final Gaussian heat map in the step 4.1 to obtain the coordinate information of key parts of the human body, then mapping the coordinate information to an original picture to obtain the positions of the key parts, and finally linking the key parts according to the sequence to generate a prediction result of the human skeleton so as to achieve the target of human posture estimation.
The embodiments described above are presented to enable a person having ordinary skill in the art to make and use the invention. It will be readily apparent to those skilled in the art that various modifications to the above-described embodiments may be made, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.

Claims (10)

1. A human body posture estimation method based on bidirectional serialization modeling comprises the following steps:
(1) collecting a video data set for estimating the human body posture and preprocessing the video data set;
(2) regarding a section of complete video in a video data set, taking continuous 3 frames of video images as a group of samples, and manually marking coordinates of each key part of a human body in the video images;
(3) constructing a bidirectional continuous convolutional neural network, and training the convolutional neural network by using a large number of samples to obtain a human body posture estimation model;
(4) and inputting continuous 3 frames of video images to be estimated into the human body posture estimation model, and outputting to obtain the posture estimation result of the person in the 2 nd frame of video image, namely the coordinates of each key part of the human body.
2. The human body posture estimation method according to claim 1, characterized in that: in the step (1), for each frame of video image in the video data set, detecting the position coordinates of the human body ROI in the image by using the YOLOv5 algorithm, and enlarging the ROI by 25%.
3. The human body posture estimation method according to claim 1, characterized in that: the bidirectional continuous convolution neural network consists of a Backbone network, a posture time merging network, a posture residual error merging network and a posture correction network, wherein the Backbone network is used for preliminarily calculating a posture characteristic vector h of a human body in three frames of video images of input samplesi-1、hi、hi+1The three feature vectors are superposed to obtain a vector phi (h) which is respectively input to an attitude time merging network and an attitude residual error fusion network, the attitude time merging network is used for coding the approximate space range of each joint of the human body to obtain a feature vector xi (h), the attitude residual error fusion network is used for calculating an attitude residual error vector psi (h) of the human body, and the xi (h) and the feature vector eta superposed with the psi (h) are input to the attitude correction network to be calculated to obtain a human body attitude prediction result.
4. The human body posture estimation method according to claim 3, characterized in that: the attitude time merging network is formed by stacking three residual blocks, a vector phi (h) is recombined according to a joint sequence and then is used as the input of the network, and a characteristic vector xi (h) is output; the attitude residual error fusion network is formed by stacking five residual error blocks, firstly, attitude characteristic vectors of a second frame and a first frame and attitude characteristic vectors of a second frame and a third frame in a sample are respectively subjected to difference, simultaneously, a tensor zeta is obtained through weight cascade and is used as an input of the network, an attitude residual error vector psi (h) is output, and a specific expression of the tensor zeta is as follows:
Figure FDA0002872121240000011
5. the human body posture estimation method according to claim 4, characterized in that: the residual block is formed by sequentially connecting a convolution layer with the size of 3 multiplied by 3, a batch normalization layer and a Relu activation layer, the residual block in the attitude time merging network adopts packet convolution, and the packet number groups is 17; the residual block in the posture residual fusion network does not use packet convolution, and the packet number groups is 1.
6. The human body posture estimation method according to claim 3, characterized in that: the posture correction network is composed of five parallel deformable convolutions, the expansion rates of the five deformable convolutions are respectively 3, 6, 9, 12 and 15, each deformable convolution takes the result of stacking the feature vector xi (h) and eta as input, predicted Gaussian heat maps are output, and the five Gaussian heat maps output by the five convolutions respectively are averaged to obtain a human body posture prediction result.
7. The human body posture estimation method according to claim 3, characterized in that: the process of training the bidirectional continuous convolutional neural network in the step (3) is divided into two steps: firstly, training a Backbone network, then fixing Backbone network parameters, training a posture time merging network, a posture residual error merging network and a posture correction network.
8. The human body posture estimation method according to claim 7, characterized in that: the specific process of training the backhaul network is as follows: inputting human body ROI in all video images of a sample into a Backbone network one by one, calculating a loss function L1 between a human body posture prediction result output by the whole bidirectional continuous convolutional neural network and artificial marking information corresponding to the sample, and repeatedly updating parameters of the Backbone network through back propagation according to a loss function L1 until the loss function L1 converges, wherein the expression of the loss function L1 is as follows:
Figure FDA0002872121240000021
wherein: n is the number of labeled human body key parts, Hgt_iTransforming the coordinates of manually marked ith key part of ROI in a group of samples to generate the result of superposition of Gaussian heatmaps Hpred_iConverting coordinates predicted and output by all human body ROI (region of interest) key parts in a group of samples through a bidirectional continuous convolutional neural network to generate a result after superposition of Gaussian heatmaps, | | | | sweet wind2Denotes the L2 norm, viAnd whether the ith key part has a label in the sample image or not is shown, if so, the value is 1, otherwise, the value is 0.
9. The human body posture estimation method according to claim 7, characterized in that: the specific processes of training the posture time merging network, the posture residual error merging network and the posture correcting network are as follows: firstly, fixing trained Backbone network parameters, then inputting human body ROI in all video images of a sample into the Backbone network one by one, calculating a loss function L2 between a human body posture prediction result output by the whole bidirectional continuous convolutional neural network and artificial marking information corresponding to the sample, and repeatedly updating parameters of a posture time merging network, a posture residual error merging network and a posture correction network through back propagation according to a loss function L2 until a loss function L2 converges, wherein the expression of the loss function L2 is as follows:
Figure FDA0002872121240000031
wherein: n is the number of labeled human body key parts, Ggt_iArtificially marking coordinates of the ith key part of human ROI in the 2 nd frame video image of a group of samples to generate a Gaussian heat map G through conversionpred_iPredicting a Gauss heat map generated by converting output coordinates of an ith key part of a human body ROI in a 2 nd frame video image of a group of samples through a bidirectional continuous convolution neural network, | | | | | ventilation2Denotes the L2 norm, viAnd whether the ith key part has a label in the sample image or not is shown, if so, the value is 1, otherwise, the value is 0.
10. The human body posture estimation method according to claim 1, characterized in that: the specific implementation process of the step (4) is as follows: inputting the human body ROI of the same person in continuous 3 frames of video images to be estimated into a human body posture estimation model to output to obtain a Gaussian heat map, converting and calculating the Gaussian heat map to obtain the coordinate information of key parts of the same person in a 2 nd frame of video image, mapping the coordinate information into the 2 nd frame of video image, and sequentially linking the key parts to generate a prediction result of a human body skeleton, thereby realizing human body posture estimation.
CN202011610311.1A 2020-12-30 2020-12-30 Human body posture estimation method based on bidirectional serialization modeling Active CN112633220B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011610311.1A CN112633220B (en) 2020-12-30 2020-12-30 Human body posture estimation method based on bidirectional serialization modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011610311.1A CN112633220B (en) 2020-12-30 2020-12-30 Human body posture estimation method based on bidirectional serialization modeling

Publications (2)

Publication Number Publication Date
CN112633220A true CN112633220A (en) 2021-04-09
CN112633220B CN112633220B (en) 2024-01-09

Family

ID=75286799

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011610311.1A Active CN112633220B (en) 2020-12-30 2020-12-30 Human body posture estimation method based on bidirectional serialization modeling

Country Status (1)

Country Link
CN (1) CN112633220B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113205043A (en) * 2021-04-30 2021-08-03 武汉大学 Video sequence two-dimensional attitude estimation method based on reinforcement learning
CN113627396A (en) * 2021-09-22 2021-11-09 浙江大学 Health monitoring-based skipping rope counting method
CN113673469A (en) * 2021-08-30 2021-11-19 广州深灵科技有限公司 Human body key point analysis training and reasoning method and device based on video stream
CN113920545A (en) * 2021-12-13 2022-01-11 中煤科工开采研究院有限公司 Method and device for detecting posture of underground coal mine personnel
CN115116132A (en) * 2022-06-13 2022-09-27 南京邮电大学 Human behavior analysis method for deep perception in Internet of things edge service environment
CN116386089A (en) * 2023-06-05 2023-07-04 季华实验室 Human body posture estimation method, device, equipment and storage medium under motion scene

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016190934A2 (en) * 2015-02-27 2016-12-01 Massachusetts Institute Of Technology Methods, systems, and apparatus for global multiple-access optical communications
WO2017133009A1 (en) * 2016-02-04 2017-08-10 广州新节奏智能科技有限公司 Method for positioning human joint using depth image of convolutional neural network
CN108932500A (en) * 2018-07-09 2018-12-04 广州智能装备研究院有限公司 A kind of dynamic gesture identification method and system based on deep neural network
CN111695457A (en) * 2020-05-28 2020-09-22 浙江工商大学 Human body posture estimation method based on weak supervision mechanism

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016190934A2 (en) * 2015-02-27 2016-12-01 Massachusetts Institute Of Technology Methods, systems, and apparatus for global multiple-access optical communications
WO2017133009A1 (en) * 2016-02-04 2017-08-10 广州新节奏智能科技有限公司 Method for positioning human joint using depth image of convolutional neural network
CN108932500A (en) * 2018-07-09 2018-12-04 广州智能装备研究院有限公司 A kind of dynamic gesture identification method and system based on deep neural network
CN111695457A (en) * 2020-05-28 2020-09-22 浙江工商大学 Human body posture estimation method based on weak supervision mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈昱昆;汪正祥;于莲芝;: "轻量级双路卷积神经网络与帧间信息推理的人体姿态估计", 小型微型计算机系统, no. 10 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113205043A (en) * 2021-04-30 2021-08-03 武汉大学 Video sequence two-dimensional attitude estimation method based on reinforcement learning
CN113673469A (en) * 2021-08-30 2021-11-19 广州深灵科技有限公司 Human body key point analysis training and reasoning method and device based on video stream
CN113627396A (en) * 2021-09-22 2021-11-09 浙江大学 Health monitoring-based skipping rope counting method
CN113627396B (en) * 2021-09-22 2023-09-05 浙江大学 Rope skipping counting method based on health monitoring
CN113920545A (en) * 2021-12-13 2022-01-11 中煤科工开采研究院有限公司 Method and device for detecting posture of underground coal mine personnel
CN115116132A (en) * 2022-06-13 2022-09-27 南京邮电大学 Human behavior analysis method for deep perception in Internet of things edge service environment
CN115116132B (en) * 2022-06-13 2023-07-28 南京邮电大学 Human behavior analysis method for depth perception in Internet of things edge service environment
CN116386089A (en) * 2023-06-05 2023-07-04 季华实验室 Human body posture estimation method, device, equipment and storage medium under motion scene
CN116386089B (en) * 2023-06-05 2023-10-31 季华实验室 Human body posture estimation method, device, equipment and storage medium under motion scene

Also Published As

Publication number Publication date
CN112633220B (en) 2024-01-09

Similar Documents

Publication Publication Date Title
CN112633220B (en) Human body posture estimation method based on bidirectional serialization modeling
US11810366B1 (en) Joint modeling method and apparatus for enhancing local features of pedestrians
CN115601549A (en) River and lake remote sensing image segmentation method based on deformable convolution and self-attention model
CN111986240A (en) Drowning person detection method and system based on visible light and thermal imaging data fusion
CN113283525B (en) Image matching method based on deep learning
CN112669350A (en) Adaptive feature fusion intelligent substation human body target tracking method
CN111797688A (en) Visual SLAM method based on optical flow and semantic segmentation
CN111695457A (en) Human body posture estimation method based on weak supervision mechanism
CN116524062B (en) Diffusion model-based 2D human body posture estimation method
CN112084952B (en) Video point location tracking method based on self-supervision training
Wang et al. MCF3D: Multi-stage complementary fusion for multi-sensor 3D object detection
CN106887010A (en) Ground moving target detection method based on high-rise scene information
CN110705366A (en) Real-time human head detection method based on stair scene
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN116092190A (en) Human body posture estimation method based on self-attention high-resolution network
CN113269038B (en) Multi-scale-based pedestrian detection method
Guo et al. Scale region recognition network for object counting in intelligent transportation system
CN111680640B (en) Vehicle type identification method and system based on domain migration
CN111274901B (en) Gesture depth image continuous detection method based on depth gating recursion unit
CN113066074A (en) Visual saliency prediction method based on binocular parallax offset fusion
Huang et al. Temporally-aggregating multiple-discontinuous-image saliency prediction with transformer-based attention
CN112950481B (en) Water bloom shielding image data collection method based on image mosaic network
CN115331171A (en) Crowd counting method and system based on depth information and significance information
Kim et al. Global convolutional neural networks with self-attention for fisheye image rectification
CN114419729A (en) Behavior identification method based on light-weight double-flow network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant