CN112633220A - Human body posture estimation method based on bidirectional serialization modeling - Google Patents
Human body posture estimation method based on bidirectional serialization modeling Download PDFInfo
- Publication number
- CN112633220A CN112633220A CN202011610311.1A CN202011610311A CN112633220A CN 112633220 A CN112633220 A CN 112633220A CN 202011610311 A CN202011610311 A CN 202011610311A CN 112633220 A CN112633220 A CN 112633220A
- Authority
- CN
- China
- Prior art keywords
- human body
- network
- posture
- attitude
- posture estimation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 241000282414 Homo sapiens Species 0.000 title claims abstract description 106
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000002457 bidirectional effect Effects 0.000 title claims abstract description 25
- 239000013598 vector Substances 0.000 claims description 49
- 238000012549 training Methods 0.000 claims description 20
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 17
- 238000013527 convolutional neural network Methods 0.000 claims description 17
- 238000012937 correction Methods 0.000 claims description 14
- 230000004927 fusion Effects 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 9
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 235000009508 confectionery Nutrition 0.000 claims 1
- 238000009423 ventilation Methods 0.000 claims 1
- 238000004458 analytical method Methods 0.000 abstract description 2
- 230000036544 posture Effects 0.000 description 60
- 230000006870 function Effects 0.000 description 12
- 238000001514 detection method Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 230000000875 corresponding effect Effects 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 239000011800 void material Substances 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000001131 transforming effect Effects 0.000 description 2
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 1
- 206010000117 Abnormal behaviour Diseases 0.000 description 1
- 238000012935 Averaging Methods 0.000 description 1
- 206010039203 Road traffic accident Diseases 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 210000003423 ankle Anatomy 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 210000000707 wrist Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Human Computer Interaction (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a human body posture estimation method based on bidirectional serialization modeling, which takes continuous 3 frames as input, fully utilizes time sequence information of a video to calculate the approximate space range of each joint, and then regresses the specific position of each joint from a smaller range, thereby better solving the problems of inherent shielding, motion blurring and the like in a human body posture estimation task, leading the generalization of a model to be stronger and having higher accuracy. The method fully utilizes the time sequence information of the video, enhances the reasoning capability of the model, can better estimate key parts of the human body, and has important significance in industries needing to extract the posture in real time for analysis, such as security protection, short video platforms and the like.
Description
Technical Field
The invention belongs to the technical field of human body posture estimation, and particularly relates to a human body posture estimation method based on bidirectional serialization modeling.
Background
Human body posture estimation is a leading research field in computer vision, and aims to locate human body key parts (such as wrists and ankles) in pictures or videos so as to realize human body posture estimation. Human body posture estimation is a bridge for communicating a machine and a person, has great practical significance, is widely applied to a plurality of fields, such as the field of stage animation, and can generate real-time interactive animation effect by identifying the posture action of the person; in the field of automatic driving, traffic accidents can be avoided in advance by predicting the motion trend of pedestrians; in the field of security protection, abnormal behaviors can be detected by identifying a specific attitude sequence.
Currently, human posture estimation methods are mainly classified into two types: (1) from top to bottom, all human body positions in the picture are detected firstly, and the human body is marked by a rectangular bounding box usually; then identifying the joints of each human body through a human body joint part detector; and then mapping the cut character posture information back to the original picture by affine transformation, thereby realizing the estimation of all human body postures in the picture. The top-down method separates the human position detection task from the human joint detection task, and focuses on the posture estimation method, so that the method has high accuracy, but the detection time is positively correlated with the number of human beings in the picture, and the method needs to use a target detection technology, and the detection quality of the position coordinate directly influences the final result of the posture estimation. (2) From bottom to top, firstly detecting joint position information of all human bodies in the picture, and then clustering joint coordinates belonging to the same person, thereby carrying out posture estimation on all human bodies in the picture. The bottom-up method has high efficiency, the detection time is less influenced by the number of the human objects in the picture, but the accuracy is slightly behind.
The mainstream human body posture estimation methods include a network architecture designed for a static picture from top to bottom and from bottom to top, which is good at human body posture estimation in a single-frame picture. Generally, 1/25 seconds, which is 1 frame, is very short, so that the image between two frames of the video does not change very much, and has very high similarity, and because of the rich geometric consistency between adjacent frames of the video, such extra clues can be used to correct key points which are difficult to predict, such as occlusion or motion blur.
The traditional image-based attitude estimation method cannot effectively utilize the extra information, so that the situations of frequent high entanglement, mutual occlusion, motion blur and the like of people in a video sequence cannot be processed, and a good result is difficult to obtain in video attitude estimation. To address this problem, the document [ floating ConvNets for Human dose Estimation in Videos- [ CODE ] -pfister.t, charles.j & zisserman.a (ICCV 2015) ] proposes to compute dense optical flow information between each two frames and then correct the initial Pose estimate using flow-based time information; when the optical flow can be correctly calculated, the method achieves a good effect, however, the calculation of the optical flow is greatly affected by the picture quality, the shading and the like, all optical flow information cannot be accurately calculated in the video, and the calculation of the optical flow information often needs a large amount of calculation support. Some researchers also propose to use a Long Short-Term Memory network (LSTM) to directly model a video to capture timing information, however, due to the structural limitation of the LSTM network, this method can only obtain a good effect when people in a video frame are sparse, and when the method is used in a complex scene, the situations of occlusion, motion blur, etc. cannot be handled.
Disclosure of Invention
In view of the above, the present invention provides a human body posture estimation method based on bi-directional serialization modeling, which takes continuous 3 frames as input, calculates the approximate spatial range of each joint by fully utilizing the time sequence information of the video, and then regresses the specific position of the joint from a smaller range, thereby better solving the problems of occlusion, motion blur, etc. inherent in the human body posture estimation task, and making the generalization of the model stronger and have higher accuracy.
A human body posture estimation method based on bidirectional serialization modeling comprises the following steps:
(1) collecting a video data set for estimating the human body posture and preprocessing the video data set;
(2) regarding a section of complete video in a video data set, taking continuous 3 frames of video images as a group of samples, and manually marking coordinates of each key part of a human body in the video images;
(3) constructing a bidirectional continuous convolutional neural network, and training the convolutional neural network by using a large number of samples to obtain a human body posture estimation model;
(4) and inputting continuous 3 frames of video images to be estimated into the human body posture estimation model, and outputting to obtain the posture estimation result of the person in the 2 nd frame of video image, namely the coordinates of each key part of the human body.
Further, in the step (1), for each frame of video image in the video data set, the position coordinates of a human body ROI (region of interest, i.e. human figure position bounding box) in the image are detected by the YOLOv5 algorithm, and the ROI is enlarged by 25%.
Further, the bidirectional continuous convolution neural network is composed of a Backbone network, a posture time merging network, a posture residual error merging network and a posture correction network, wherein the Backbone network is used for preliminarily calculating a posture characteristic vector h of a human body in three frames of video images of input samplesi-1、hi、hi+1The three feature vectors are superposed to obtain a vector phi (h) which is respectively input to an attitude time merging network and an attitude residual error fusion network, the attitude time merging network is used for coding the approximate space range of each joint of the human body to obtain a feature vector xi (h), the attitude residual error fusion network is used for calculating an attitude residual error vector psi (h) of the human body, and the xi (h) and the feature vector eta superposed with the psi (h) are input to the attitude correction network to be calculated to obtain a human body attitude prediction result.
Further, the attitude time combination network is formed by stacking three Residual blocks (Residual blocks), a vector phi (h) is recombined according to a joint sequence and then serves as the input of the network, and a feature vector xi (h) is output; the attitude residual fusion network is formed by stacking five residual blocks, firstly, attitude characteristic vectors of a second frame and a first frame and attitude characteristic vectors of the second frame and a third frame in a sample are respectively subjected to difference, meanwhile, tensor zeta is obtained through cascade (association) with weight and is used as input of the network, an attitude residual vector psi (h) is output, and a specific expression of the tensor zeta is as follows:
further, the residual block is formed by sequentially connecting a convolution layer with the size of 3 × 3, a batch normalization layer and a Relu activation layer, the residual block in the attitude time merging network adopts packet convolution, and the packet number groups is 17 (according to the key point standard of the COCO data set, there are 17 key points in total); the residual block in the posture residual fusion network does not use packet convolution, and the packet number groups is 1.
Furthermore, the posture correction network is composed of five parallel deformable convolutions, the expansion rates of the five deformable convolutions are respectively 3, 6, 9, 12 and 15, each deformable convolution takes the result of stacking the feature vector xi (h) and the eta as input, predicted Gaussian heat maps are output, and the five Gaussian heat maps output by the five convolutions respectively are averaged to obtain the human body posture prediction result.
Further, the process of training the bidirectional continuous convolutional neural network in the step (3) is divided into two steps: firstly, training a Backbone network, then fixing Backbone network parameters, training a posture time merging network, a posture residual error merging network and a posture correction network.
Further, the specific process of training the backhaul network is as follows: inputting human body ROI in all video images of a sample into a Backbone network one by one, calculating a loss function L1 between a human body posture prediction result output by the whole bidirectional continuous convolutional neural network and artificial marking information corresponding to the sample, and repeatedly updating parameters of the Backbone network through back propagation according to a loss function L1 until the loss function L1 converges, wherein the expression of the loss function L1 is as follows:
wherein: n is the number of labeled human body key parts, Hgt_iTransforming the coordinates of manually marked ith key part of ROI in a group of samples to generate the result of superposition of Gaussian heatmaps Hpred_iGenerating a Gaussian heatmap superimposed result by transforming coordinates of the ith key part of the ROI of all human bodies in a group of samples and predicting output by a bidirectional continuous convolutional neural network2Denotes the L2 norm, viAnd whether the ith key part has a label in the sample image or not is shown, if so, the value is 1, otherwise, the value is 0.
Further, the specific process of training the posture time merging network, the posture residual error merging network and the posture correction network is as follows: firstly, fixing trained Backbone network parameters, then inputting human body ROI in all video images of a sample into the Backbone network one by one, calculating a loss function L2 between a human body posture prediction result output by the whole bidirectional continuous convolutional neural network and artificial marking information corresponding to the sample, and repeatedly updating parameters of a posture time merging network, a posture residual error merging network and a posture correction network through back propagation according to a loss function L2 until a loss function L2 converges, wherein the expression of the loss function L2 is as follows:
wherein: n is the number of labeled human body key parts, Ggt_iArtificially marking coordinates of the ith key part of human ROI in the 2 nd frame video image of a group of samples to generate a Gaussian heat map G through conversionpred_iPredicting a Gauss heatmap generated by transformation of output coordinates by a bi-directional continuous convolutional neural network for an ith key part of a human ROI in a 2 nd frame video image of a set of samples, | |2Denotes the L2 norm, viAnd whether the ith key part has a label in the sample image or not is shown, if so, the value is 1, otherwise, the value is 0.
Further, the specific implementation process of the step (4) is as follows: inputting the human body ROI of the same person in continuous 3 frames of video images to be estimated into a human body posture estimation model to output to obtain a Gaussian heat map, converting and calculating the Gaussian heat map to obtain the coordinate information of key parts of the same person in a 2 nd frame of video image, mapping the coordinate information into the 2 nd frame of video image, and sequentially linking the key parts to generate a prediction result of a human body skeleton, thereby realizing human body posture estimation.
The invention relates to a human body posture estimation method based on bidirectional continuity, which mainly uses deformable convolution networks with different voidage rates as a prediction model; the deformable convolution network is a variant of the traditional convolution neural network, convolution kernels of the traditional convolution neural network are square, general objects such as human bodies and the like are not square, the traditional convolution network has certain limitation, and the deformable convolution network can obtain convolution kernels in any shapes by learning offset parameters of each pixel of the convolution kernels, so that the deformable convolution network can be better suitable for objects in various shapes; each convolutional layer adopts different void ratios corresponding to different receptive fields, the larger the void ratio is, the larger the receptive field is, information biased to the whole situation can be captured, otherwise, the smaller the void ratio can capture more exquisite local information; therefore, the design of the deformable convolution network is more reasonable for estimating the human body posture in the video.
The invention fully utilizes the time sequence information of the video, enhances the reasoning capability of the model, can better estimate the key parts of the human body, has important significance in industries needing to extract the posture in real time for analysis, such as security protection, short video platforms and the like, and has the following beneficial technical effects:
1. the method better estimates the key points of the shielded and motion blurred image through an accurate attitude estimation algorithm, and has the characteristics of more accuracy and rapidness in detection.
2. The method is designed aiming at the video, better accords with various application scenes, and simultaneously adopts packet convolution, cavity convolution and the like, so that better effect can be obtained by measuring fewer parameters, and the attitude estimation can be applied in real time.
Drawings
FIG. 1 is a flowchart illustrating a human body posture estimation method according to the present invention.
FIG. 2 is a schematic diagram of a Residual Block structure and its stacking method.
FIG. 3 is a schematic structural diagram of a bi-directional continuous convolutional neural network according to the present invention.
Detailed Description
In order to more specifically describe the present invention, the following detailed description is provided for the technical solution of the present invention with reference to the accompanying drawings and the specific embodiments.
As shown in FIG. 1, the human body posture estimation method based on bidirectional continuity of the invention comprises the following steps:
(1) and collecting and selecting a human body posture estimation video data set, and preprocessing the data set.
In the embodiment, the Posetrack data set is adopted as training data and is used for a human body posture tracking task, and people can be shielded and the motion of many videos is blurred, so that the difficulty of estimating the human body posture of the videos is greatly increased. This embodiment is a top-down approach, so the data set needs to be preprocessed: firstly, detecting the position boundary box of each person in the frame to be estimated through a YOLO V5 detection algorithm, and then enlarging each boundary box by 25% to cut the front frame and the rear frame to obtain three frame images of the same person.
(2) And constructing a bidirectional continuous convolution neural network model as a human body posture estimation model.
As shown in fig. 3, a bidirectional continuous convolutional neural network (dcpos) is mainly composed of the following parts: the system comprises a backhaul network module, a posture time merging module PTM, a posture residual error fusion module PRF and a posture correction network module PCN. In the embodiment, the backhaul network module adopts a high-resolution network HRNet to preliminarily calculate the human postures in the three input pictures to obtain the characteristic vector hi-1、hi、hi+1The three vectors are superposed to obtain a vector phi (h), and two parallel branches are input; and the attitude time merging module encodes the approximate spatial range xi (h) of each joint, the attitude residual fusion module obtains an attitude residual vector psi (h), and then the characteristic vector xi (h), the characteristic vector eta formed by superposing the characteristic vector xi (h) and the characteristic vector psi (h) are input into the attitude correction network to obtain a final attitude prediction result.
The attitude time merging module consists of three stacked Residual blocks (Residual blocks), a group of samples obtain a characteristic vector phi (h) through a backhaul network, and the characteristic vector phi (h) is recombined according to the joint sequence and is used as the input of the module to output a characteristic vector xi (h); each residual block adopts packet convolution, and the parameter groups are 17 (according to the key point standard of the COCO data set, there are 17 key points in total).
The attitude residual fusion module is composed of five stacked residual blocks, firstly attitude feature vectors of a first frame and a second frame and attitude feature vectors of a third frame and a second frame of the group of samples are respectively subjected to difference, simultaneously, a tensor zeta is obtained through weight cascade and is used as the module input, and an attitude residual vector psi (h) is output, wherein the tensor xi can be formalized as:
as shown in fig. 2, the residual block is composed of a 3 × 3 convolution layer, a batch normalization layer, and a Relu activation layer; the difference lies in that the groups parameter in the three residual block convolution layers forming the PTM module is 17, the corresponding PRF module does not use the packet convolution, and the groups parameter in the convolution layers is 1 at the moment.
The posture correction network consists of five parallel deformable convolutions and the expansion rates are set as follows: 3. 6, 9, 12 and 15, each deformable convolution takes the feature vector xi (h) and eta stack as input, a predicted Gaussian heat map is output, and finally five heat maps are averaged to obtain a final prediction result.
(3) Inputting the data preprocessed in the step (1) into a model, and updating parameters and training the model by taking the L distance as a loss function.
The DCPose adopts a separate training method, firstly trains a Backbone network, then fixes the Backbone network, and trains other part of networks.
In the DCPose, each frame of a video is taken as a current frame to be estimated, one frame is taken from front to back to be divided into a plurality of sub-picture sequences, the length of each sub-picture sequence is 3, label information of key point positions of all human bodies exists in each sub-picture sequence, and then each divided sub-picture sequence is taken as the input of the DCPose.
The Backbone network firstly loads official pre-training model parameters, then inputs a group of sub-picture sequences, outputs attitude characteristic vectors, and calculates the mean square error with the real attitude vectors to obtain the loss value of each frame, wherein the expression of a loss function L is as follows:
wherein: hgt_iThe result obtained by superposing Gaussian heatmaps generated by converting real coordinates of the ith key part of all people in the subsequence is Hpred_iSuperimposed gaussian heatmap results generated for coordinate transformation of all person's ith key site predictions in subsequences |2Represents L2 norm, N is the number of key parts marked by human body, viAnd whether the coordinate has a label is shown, if so, the value is 1, otherwise, the value is 0.
After the training of the Backbone network is finished, the parameters of the Backbone network are fixed, each sub-picture sequence is input into a DCPose network, and an attitude feature vector phi (h) with the dimensionality of [4,51,96 and 72] is obtained through the Backbone network; then inputting a PTM network to obtain a feature vector xi (h) with the dimensionality of [4,17,96 and 72], inputting a PRF network to obtain a feature vector psi (h) with the dimensionality of [4,128,96 and 72 ]; then inputting the superposed vector eta of the feature vector xi (h) and the feature vectors xi (h) and psi (h) into the PCN network together, wherein the dimensionality is [4,145,96 and 72], each deformable convolution layer outputs a posture feature vector, and the dimensionality is [4,17,96 and 72 ]; the final gaussian heatmap was obtained by averaging 5 different pose feature vectors.
During DCPose training, L2 Loss is mainly adopted, and in each picture sequence input into the bidirectional continuous convolutional neural network, the frame 2 is really needed to estimate the posture, so that the Loss value cannot be calculated through the frames 1 and 3; the 2 nd frame loss function calculation is basically the same as the loss function in the backhaul network training, and the only difference is that Hgt_iThe result of Gaussian heat map generated by transforming the real coordinates of the ith key part of the person in the 2 nd frame of the sample Hpred_iThe result of the Gaussian heatmap generated by the coordinate transformation of the i key parts of the human object in the 2 nd frame is predicted; by making full use of the bidirectional information of the previous and subsequent frames, the network has more accurate prediction capability.
(4) After the model training is finished, inputting a test set and outputting a human body posture estimation result, wherein the specific implementation process is as follows:
4.1 the test set is input into the trained model to obtain the Gaussian heatmap of each frame.
4.2 through a Gaussian heat map coordinate conversion algorithm, calculating from the final Gaussian heat map in the step 4.1 to obtain the coordinate information of key parts of the human body, then mapping the coordinate information to an original picture to obtain the positions of the key parts, and finally linking the key parts according to the sequence to generate a prediction result of the human skeleton so as to achieve the target of human posture estimation.
The embodiments described above are presented to enable a person having ordinary skill in the art to make and use the invention. It will be readily apparent to those skilled in the art that various modifications to the above-described embodiments may be made, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.
Claims (10)
1. A human body posture estimation method based on bidirectional serialization modeling comprises the following steps:
(1) collecting a video data set for estimating the human body posture and preprocessing the video data set;
(2) regarding a section of complete video in a video data set, taking continuous 3 frames of video images as a group of samples, and manually marking coordinates of each key part of a human body in the video images;
(3) constructing a bidirectional continuous convolutional neural network, and training the convolutional neural network by using a large number of samples to obtain a human body posture estimation model;
(4) and inputting continuous 3 frames of video images to be estimated into the human body posture estimation model, and outputting to obtain the posture estimation result of the person in the 2 nd frame of video image, namely the coordinates of each key part of the human body.
2. The human body posture estimation method according to claim 1, characterized in that: in the step (1), for each frame of video image in the video data set, detecting the position coordinates of the human body ROI in the image by using the YOLOv5 algorithm, and enlarging the ROI by 25%.
3. The human body posture estimation method according to claim 1, characterized in that: the bidirectional continuous convolution neural network consists of a Backbone network, a posture time merging network, a posture residual error merging network and a posture correction network, wherein the Backbone network is used for preliminarily calculating a posture characteristic vector h of a human body in three frames of video images of input samplesi-1、hi、hi+1The three feature vectors are superposed to obtain a vector phi (h) which is respectively input to an attitude time merging network and an attitude residual error fusion network, the attitude time merging network is used for coding the approximate space range of each joint of the human body to obtain a feature vector xi (h), the attitude residual error fusion network is used for calculating an attitude residual error vector psi (h) of the human body, and the xi (h) and the feature vector eta superposed with the psi (h) are input to the attitude correction network to be calculated to obtain a human body attitude prediction result.
4. The human body posture estimation method according to claim 3, characterized in that: the attitude time merging network is formed by stacking three residual blocks, a vector phi (h) is recombined according to a joint sequence and then is used as the input of the network, and a characteristic vector xi (h) is output; the attitude residual error fusion network is formed by stacking five residual error blocks, firstly, attitude characteristic vectors of a second frame and a first frame and attitude characteristic vectors of a second frame and a third frame in a sample are respectively subjected to difference, simultaneously, a tensor zeta is obtained through weight cascade and is used as an input of the network, an attitude residual error vector psi (h) is output, and a specific expression of the tensor zeta is as follows:
5. the human body posture estimation method according to claim 4, characterized in that: the residual block is formed by sequentially connecting a convolution layer with the size of 3 multiplied by 3, a batch normalization layer and a Relu activation layer, the residual block in the attitude time merging network adopts packet convolution, and the packet number groups is 17; the residual block in the posture residual fusion network does not use packet convolution, and the packet number groups is 1.
6. The human body posture estimation method according to claim 3, characterized in that: the posture correction network is composed of five parallel deformable convolutions, the expansion rates of the five deformable convolutions are respectively 3, 6, 9, 12 and 15, each deformable convolution takes the result of stacking the feature vector xi (h) and eta as input, predicted Gaussian heat maps are output, and the five Gaussian heat maps output by the five convolutions respectively are averaged to obtain a human body posture prediction result.
7. The human body posture estimation method according to claim 3, characterized in that: the process of training the bidirectional continuous convolutional neural network in the step (3) is divided into two steps: firstly, training a Backbone network, then fixing Backbone network parameters, training a posture time merging network, a posture residual error merging network and a posture correction network.
8. The human body posture estimation method according to claim 7, characterized in that: the specific process of training the backhaul network is as follows: inputting human body ROI in all video images of a sample into a Backbone network one by one, calculating a loss function L1 between a human body posture prediction result output by the whole bidirectional continuous convolutional neural network and artificial marking information corresponding to the sample, and repeatedly updating parameters of the Backbone network through back propagation according to a loss function L1 until the loss function L1 converges, wherein the expression of the loss function L1 is as follows:
wherein: n is the number of labeled human body key parts, Hgt_iTransforming the coordinates of manually marked ith key part of ROI in a group of samples to generate the result of superposition of Gaussian heatmaps Hpred_iConverting coordinates predicted and output by all human body ROI (region of interest) key parts in a group of samples through a bidirectional continuous convolutional neural network to generate a result after superposition of Gaussian heatmaps, | | | | sweet wind2Denotes the L2 norm, viAnd whether the ith key part has a label in the sample image or not is shown, if so, the value is 1, otherwise, the value is 0.
9. The human body posture estimation method according to claim 7, characterized in that: the specific processes of training the posture time merging network, the posture residual error merging network and the posture correcting network are as follows: firstly, fixing trained Backbone network parameters, then inputting human body ROI in all video images of a sample into the Backbone network one by one, calculating a loss function L2 between a human body posture prediction result output by the whole bidirectional continuous convolutional neural network and artificial marking information corresponding to the sample, and repeatedly updating parameters of a posture time merging network, a posture residual error merging network and a posture correction network through back propagation according to a loss function L2 until a loss function L2 converges, wherein the expression of the loss function L2 is as follows:
wherein: n is the number of labeled human body key parts, Ggt_iArtificially marking coordinates of the ith key part of human ROI in the 2 nd frame video image of a group of samples to generate a Gaussian heat map G through conversionpred_iPredicting a Gauss heat map generated by converting output coordinates of an ith key part of a human body ROI in a 2 nd frame video image of a group of samples through a bidirectional continuous convolution neural network, | | | | | ventilation2Denotes the L2 norm, viAnd whether the ith key part has a label in the sample image or not is shown, if so, the value is 1, otherwise, the value is 0.
10. The human body posture estimation method according to claim 1, characterized in that: the specific implementation process of the step (4) is as follows: inputting the human body ROI of the same person in continuous 3 frames of video images to be estimated into a human body posture estimation model to output to obtain a Gaussian heat map, converting and calculating the Gaussian heat map to obtain the coordinate information of key parts of the same person in a 2 nd frame of video image, mapping the coordinate information into the 2 nd frame of video image, and sequentially linking the key parts to generate a prediction result of a human body skeleton, thereby realizing human body posture estimation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011610311.1A CN112633220B (en) | 2020-12-30 | 2020-12-30 | Human body posture estimation method based on bidirectional serialization modeling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011610311.1A CN112633220B (en) | 2020-12-30 | 2020-12-30 | Human body posture estimation method based on bidirectional serialization modeling |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112633220A true CN112633220A (en) | 2021-04-09 |
CN112633220B CN112633220B (en) | 2024-01-09 |
Family
ID=75286799
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011610311.1A Active CN112633220B (en) | 2020-12-30 | 2020-12-30 | Human body posture estimation method based on bidirectional serialization modeling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112633220B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113205043A (en) * | 2021-04-30 | 2021-08-03 | 武汉大学 | Video sequence two-dimensional attitude estimation method based on reinforcement learning |
CN113627396A (en) * | 2021-09-22 | 2021-11-09 | 浙江大学 | Health monitoring-based skipping rope counting method |
CN113673469A (en) * | 2021-08-30 | 2021-11-19 | 广州深灵科技有限公司 | Human body key point analysis training and reasoning method and device based on video stream |
CN113920545A (en) * | 2021-12-13 | 2022-01-11 | 中煤科工开采研究院有限公司 | Method and device for detecting posture of underground coal mine personnel |
CN115116132A (en) * | 2022-06-13 | 2022-09-27 | 南京邮电大学 | Human behavior analysis method for deep perception in Internet of things edge service environment |
CN116386089A (en) * | 2023-06-05 | 2023-07-04 | 季华实验室 | Human body posture estimation method, device, equipment and storage medium under motion scene |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016190934A2 (en) * | 2015-02-27 | 2016-12-01 | Massachusetts Institute Of Technology | Methods, systems, and apparatus for global multiple-access optical communications |
WO2017133009A1 (en) * | 2016-02-04 | 2017-08-10 | 广州新节奏智能科技有限公司 | Method for positioning human joint using depth image of convolutional neural network |
CN108932500A (en) * | 2018-07-09 | 2018-12-04 | 广州智能装备研究院有限公司 | A kind of dynamic gesture identification method and system based on deep neural network |
CN111695457A (en) * | 2020-05-28 | 2020-09-22 | 浙江工商大学 | Human body posture estimation method based on weak supervision mechanism |
-
2020
- 2020-12-30 CN CN202011610311.1A patent/CN112633220B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016190934A2 (en) * | 2015-02-27 | 2016-12-01 | Massachusetts Institute Of Technology | Methods, systems, and apparatus for global multiple-access optical communications |
WO2017133009A1 (en) * | 2016-02-04 | 2017-08-10 | 广州新节奏智能科技有限公司 | Method for positioning human joint using depth image of convolutional neural network |
CN108932500A (en) * | 2018-07-09 | 2018-12-04 | 广州智能装备研究院有限公司 | A kind of dynamic gesture identification method and system based on deep neural network |
CN111695457A (en) * | 2020-05-28 | 2020-09-22 | 浙江工商大学 | Human body posture estimation method based on weak supervision mechanism |
Non-Patent Citations (1)
Title |
---|
陈昱昆;汪正祥;于莲芝;: "轻量级双路卷积神经网络与帧间信息推理的人体姿态估计", 小型微型计算机系统, no. 10 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113205043A (en) * | 2021-04-30 | 2021-08-03 | 武汉大学 | Video sequence two-dimensional attitude estimation method based on reinforcement learning |
CN113673469A (en) * | 2021-08-30 | 2021-11-19 | 广州深灵科技有限公司 | Human body key point analysis training and reasoning method and device based on video stream |
CN113627396A (en) * | 2021-09-22 | 2021-11-09 | 浙江大学 | Health monitoring-based skipping rope counting method |
CN113627396B (en) * | 2021-09-22 | 2023-09-05 | 浙江大学 | Rope skipping counting method based on health monitoring |
CN113920545A (en) * | 2021-12-13 | 2022-01-11 | 中煤科工开采研究院有限公司 | Method and device for detecting posture of underground coal mine personnel |
CN115116132A (en) * | 2022-06-13 | 2022-09-27 | 南京邮电大学 | Human behavior analysis method for deep perception in Internet of things edge service environment |
CN115116132B (en) * | 2022-06-13 | 2023-07-28 | 南京邮电大学 | Human behavior analysis method for depth perception in Internet of things edge service environment |
CN116386089A (en) * | 2023-06-05 | 2023-07-04 | 季华实验室 | Human body posture estimation method, device, equipment and storage medium under motion scene |
CN116386089B (en) * | 2023-06-05 | 2023-10-31 | 季华实验室 | Human body posture estimation method, device, equipment and storage medium under motion scene |
Also Published As
Publication number | Publication date |
---|---|
CN112633220B (en) | 2024-01-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112633220B (en) | Human body posture estimation method based on bidirectional serialization modeling | |
US11810366B1 (en) | Joint modeling method and apparatus for enhancing local features of pedestrians | |
CN115601549A (en) | River and lake remote sensing image segmentation method based on deformable convolution and self-attention model | |
CN111986240A (en) | Drowning person detection method and system based on visible light and thermal imaging data fusion | |
CN113283525B (en) | Image matching method based on deep learning | |
CN112669350A (en) | Adaptive feature fusion intelligent substation human body target tracking method | |
CN111797688A (en) | Visual SLAM method based on optical flow and semantic segmentation | |
CN111695457A (en) | Human body posture estimation method based on weak supervision mechanism | |
CN116524062B (en) | Diffusion model-based 2D human body posture estimation method | |
CN112084952B (en) | Video point location tracking method based on self-supervision training | |
Wang et al. | MCF3D: Multi-stage complementary fusion for multi-sensor 3D object detection | |
CN106887010A (en) | Ground moving target detection method based on high-rise scene information | |
CN110705366A (en) | Real-time human head detection method based on stair scene | |
CN116596966A (en) | Segmentation and tracking method based on attention and feature fusion | |
CN116092190A (en) | Human body posture estimation method based on self-attention high-resolution network | |
CN113269038B (en) | Multi-scale-based pedestrian detection method | |
Guo et al. | Scale region recognition network for object counting in intelligent transportation system | |
CN111680640B (en) | Vehicle type identification method and system based on domain migration | |
CN111274901B (en) | Gesture depth image continuous detection method based on depth gating recursion unit | |
CN113066074A (en) | Visual saliency prediction method based on binocular parallax offset fusion | |
Huang et al. | Temporally-aggregating multiple-discontinuous-image saliency prediction with transformer-based attention | |
CN112950481B (en) | Water bloom shielding image data collection method based on image mosaic network | |
CN115331171A (en) | Crowd counting method and system based on depth information and significance information | |
Kim et al. | Global convolutional neural networks with self-attention for fisheye image rectification | |
CN114419729A (en) | Behavior identification method based on light-weight double-flow network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |