CN108229445A

CN108229445A - A kind of more people's Attitude estimation methods based on cascade pyramid network

Info

Publication number: CN108229445A
Application number: CN201810132802.6A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2018-02-09
Filing date: 2018-02-09
Publication date: 2018-06-29

Abstract

A kind of more people's Attitude estimation methods based on cascade pyramid network proposed in the present invention, main contents include：Cascade pyramid network (CPN), more people's Attitude estimations, training and test, its process is, bounding box suggestion is first generated according to the anchor point of acquiescence, then it is cut according to characteristic pattern, and pass through recursive convolution neural network (R CNN) further refinement suggestion, to obtain final bounding box, bounding box irises out the personage in picture, then with cascade pyramid network key point is positioned in the bounding box of each personage, wherein global network can position simple key point, network is refined by integrating the character representation from all ranks of global network to handle difficult key point, and it is only lost from the key point backpropagation of selection.Method from up to down for more people's pose estimations, using pyramid network is cascaded, is substantially increased the performance of Attitude estimation, can adapt to the high request of Attitude estimation in practical application by the present invention.

Description

A kind of more people's Attitude estimation methods based on cascade pyramid network

Technical field

The present invention relates to Attitude estimation fields, estimate more particularly, to a kind of more people's postures based on cascade pyramid network Meter method.

Background technology

More people's Attitude estimations are that the key point of all persons in image is identified and positioned, it is human action's identification The challenge subjects that the basic research subject and tool of vision applications a variety of with human-computer interaction etc. acquire a certain degree of difficulty.More people's postures Estimation technique, which can be used for sports or nautch etc., needs the field estimated personage's posture, by sportsman Or the posture of performer is identified and analyzed, them can be helped to carry out objective and amount to the movement posture of oneself or other people The analysis of change or statistical correlation data, for creating personalized training and analysis system, instruct sportsman or performer into The scientific and effective training of row；It can be used for pedestrian's Attitude estimation of field of traffic, be carried out by the posture to numerous pedestrians Identification and analysis judge the direction that pedestrian advances, so as to assist driver's planning travelling line and take corresponding measure.It is relevant Attitude estimation technology can be also used for human-computer interaction, public arena the fields such as safety-protection system, brought more to people’s lives Facility.However, existing Attitude estimation method cannot still well solve due to block key point, stealthy key point with And the problem of accuracy is not high is estimated caused by complicated background.

The present invention proposes a kind of more people's Attitude estimation methods based on cascade pyramid network, first according to the anchor point of acquiescence Bounding box suggestion is generated, is then cut according to characteristic pattern, and passes through recursive convolution neural network (R-CNN) and further refines It is recommended that obtain final bounding box, bounding box irises out the personage in picture, then with cascade pyramid network in each personage Bounding box in position key point, wherein global network can position simple key point, refine network by integrating from complete The character representation of all ranks of office network handles difficult key point, and only lost from the key point backpropagation of selection.This hair It is bright that method from up to down is used for more people's pose estimations, using pyramid network is cascaded, substantially increase the property of Attitude estimation Energy can adapt to the high request of Attitude estimation in practical application.

Invention content

Estimate that accuracy is not high caused by for the key point due to blocking, stealthy key point and complicated background Problem, the purpose of the present invention is to provide a kind of more people's Attitude estimation methods based on cascade pyramid network, first according to acquiescence Anchor point generation bounding box suggestion, then cut according to characteristic pattern, and pass through recursive convolution neural network (R-CNN) into one Step refinement suggests that, to obtain final bounding box, bounding box irises out the personage in picture, then with cascade pyramid network every Key point is positioned in the bounding box of a personage, wherein global network can position simple key point, refine network and pass through integration Character representation from all ranks of global network handles difficult key point, and only damaged from the key point backpropagation of selection It loses.

To solve the above problems, the present invention provides a kind of more people's Attitude estimation methods based on cascade pyramid network, Main contents include：

(1) cascade pyramid network (CPN)；

(2) more people's Attitude estimations；

(3) training and test.

Wherein, the cascade pyramid network (CPN), cascade pyramid network include two sub-networks, respectively entirely Office network and refinement network；Global network is a feature pyramid network, can position the key point of " simple ", such as eyes and Hand, but possibly can not accurately identify and be blocked or sightless key point；Network is refined by integrating from global network to own The character representation of rank handles " hardly possible " key point.

Further, the global network, the last one by different convolution feature second to the 5th convolutional layer are residual Poor block is expressed as C₂,C₃,…,C₅；In C₂,C₃,…,C₅It is upper that 3 × 3 convolution filters is applied to generate the thermal map of key point； Such as shallow-layer feature C₂And C₃With higher spatial resolution, but the semantic information identified is less；And further feature layer C₄And C₅By In convolution (He Chihua) with more semantic informations, but spatial resolution is relatively low；Therefore, usually U-shaped structure is integrated, from And keep the spatial resolution and semantic information of characteristic layer；Feature pyramid network (FPN) is further by depth supervision message U-shaped structure is improved, similar feature pyramid structure is applied to crucial point estimation；Each element in upsampling process 1 × 1 convolution kernel is applied before summation process.

Further, the refinement network in order to improve the efficiency of information transmission and keep integrality, refines network and leads to The information for crossing different stage is transmitted, finally by up-sampling and cascade mode by the information integration of different levels；It refines All pyramid features are together in series by network；In addition, more bottleneck blocks are added in deeper level, smaller sky Between size good balance is achieved between efficiency；

With the continuous training of network, network often increasingly focuses on most of " simple " key point, and thinks little of hiding Gear and key point；Therefore in network is refined, the key point of " hardly possible " is clearly selected based on training loss, and only from the pass of selection The backpropagation of key point is lost.

Wherein, more people's Attitude estimations, the methods of more people's Attitude estimations are broadly divided into from bottom to top and from top to bottom Method；This method employs top-to-bottom method, i.e., positions first from image and iris out all persons with bounding box, so The single Attitude estimation in bounding box is solved the problems, such as afterwards；

If in order to obtain good performance, then personage is needed to examine for more people's pose estimations method from up to down Survey device and single pose estimation device.

Further, the person detecting, detection method is usually made of two stages, first according to the anchor point of acquiescence Bounding box suggestion is generated, is then cut according to characteristic pattern, and passes through recursive convolution neural network (R-CNN) and further refines It is recommended that obtain final bounding box.

Further, the cutting, for the detection block of each personage, which is extended to one fixed high wide Than, such as height:Width=256:192, then from image cropping without warp image the ratio of width to height；Finally, by the figure after cutting The size of picture is adjusted to the fixed size of 256 pixel of default height and 192 pixels.

Wherein, the training and test, are verified on the data set comprising 5000 images, and test set includes surveying Try development set (20K images) and test challenge collection (20K images)；Most of experiments are all in object key point similarity (OKS) On the basis of assessed, wherein OKS defines the similarity between different human body posture；After image cropping, using random Overturning, Random-Rotation (- 40 °~+40 °) and random size (0.7~1.3) enhance data.

Further, the training, all Attitude estimation models are trained using stochastic gradient descent algorithm, Initial learning rate is 5 × 10^-4；Learning rate every 10 periods reduce by 2 times；Use 10^-5Weight attenuation, and make in a network It is normalized with batch.

Further, the test during test, in order to minimize the variance of prediction, applies two on the hot spot of prediction Tie up Gaussian filter；It predicts the posture of corresponding flipped image and the thermal map that is averaged is finally to be predicted；It is responded using from highest A quarter to the second high response direction deviates the final position to obtain key point.

Description of the drawings

Fig. 1 is a kind of system framework figure of more people's Attitude estimation methods based on cascade pyramid network of the present invention.

Fig. 2 is a kind of cascade pyramid network knot of more people's Attitude estimation methods based on cascade pyramid network of the present invention Structure.

Fig. 3 is a kind of heat outputting of the different characteristic of more people's Attitude estimation methods based on cascade pyramid network of the present invention Figure.

Specific embodiment

It should be noted that in the absence of conflict, the feature in embodiment and embodiment in the application can phase It mutually combines, the present invention is described in further detail in the following with reference to the drawings and specific embodiments.

Fig. 1 is a kind of system framework figure of more people's Attitude estimation methods based on cascade pyramid network of the present invention.Mainly Including cascade pyramid network (CPN)；More people's Attitude estimations；Training and test.

The method of more people's Attitude estimations is broadly divided into from bottom to top and top-to-bottom method；This method employ from upper and Under method, i.e., first from image position and iris out all persons with bounding box, then solve bounding box in single posture Estimation problem；

Detection method is usually made of two stages, generates bounding box suggestion according to the anchor point of acquiescence first, then basis Characteristic pattern is cut, and passes through recursive convolution neural network (R-CNN) further refinement suggestion, to obtain final boundary Frame.

For the detection block of each personage, which is extended to a fixed depth-width ratio, such as height:Width=256: 192, then from image cropping without warp image the ratio of width to height；Finally, the size of the image after cutting is adjusted to default height The fixed size of 256 pixels and 192 pixels.

Training and test are verified, test set includes test development collection on the data set comprising 5000 images (20K images) and test challenge collection (20K images)；Most of experiments are all enterprising on object key point similarity (OKS) basis Row assessment, wherein OKS defines the similarity between different human body posture；After image cropping, using random overturning, at random (- 40 °~+40 °) and random size (0.7~1.3) are rotated to enhance data.

All Attitude estimation models are trained using stochastic gradient descent algorithm, and initial learning rate is 5 × 10^-4；Learning rate every 10 periods reduce by 2 times；Use 10^-5Weight attenuation, and in a network using batch normalize.

During test, in order to minimize the variance of prediction, 2-d gaussian filters device is applied on the hot spot of prediction；Prediction is corresponding Flipped image posture and the thermal map that is averaged finally to be predicted；Four points of the second high response direction are responsive to using from highest One of offset obtain the final position of key point.

Fig. 2 is a kind of cascade pyramid network knot of more people's Attitude estimation methods based on cascade pyramid network of the present invention Structure.It cascades pyramid network and includes two sub-networks, respectively global network and refinement network；Global network is a feature gold Word tower network can position the key point of " simple ", such as eyes and hand, but possibly can not accurately identify and be blocked or sightless Key point；Network is refined by integrating the character representation from all ranks of global network to handle " hardly possible " key point.

Fig. 3 is a kind of heat outputting of the different characteristic of more people's Attitude estimation methods based on cascade pyramid network of the present invention Figure.As shown in figure 3, global network can efficiently locate the key point of eyes, but it possibly can not be accurately positioned the position of buttocks It puts；The positioning of the key point as buttocks usually requires more contextual informations rather than neighbouring external appearance characteristic；Cause This, based on global network Direct Recognition, these " hard " key points are often difficult, it is therefore desirable to refine network to handle this Problem.

Wherein, the global network, by the last one residual block of different convolution feature second to the 5th convolutional layer It is expressed as C₂,C₃,…,C₅；In C₂,C₃,…,C₅It is upper that 3 × 3 convolution filters is applied to generate the thermal map of key point；It is such as shallow Layer feature C₂And C₃With higher spatial resolution, but the semantic information identified is less；And further feature layer C₄And C₅Due to volume It accumulates (He Chihua) and there are more semantic informations, but spatial resolution is relatively low；Therefore, usually U-shaped structure is integrated, so as to protect Hold the spatial resolution and semantic information of characteristic layer；Feature pyramid network (FPN) is further improved by depth supervision message Similar feature pyramid structure is applied to crucial point estimation by U-shaped structure；Each element summation in upsampling process 1 × 1 convolution kernel is applied before process.

Wherein, the refinement network in order to improve the efficiency of information transmission and keep integrality, refines network and passes through not The information of same level is transmitted, finally by up-sampling and cascade mode by the information integration of different levels；Refine network All pyramid features are together in series；In addition, more bottleneck blocks are added in deeper level, smaller space ruler It is very little that good balance is achieved between efficiency；

For those skilled in the art, the present invention is not limited to the details of above-described embodiment, in the essence without departing substantially from the present invention In the case of refreshing and range, the present invention can be realized in other specific forms.In addition, those skilled in the art can be to this hair Bright to carry out various modification and variations without departing from the spirit and scope of the present invention, these improvements and modifications also should be regarded as the present invention's Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and change.

Claims

A kind of 1. more people's Attitude estimation methods based on cascade pyramid network, which is characterized in that main to include cascade pyramid Network (CPN) (one)；More people's Attitude estimations (two)；Training and test (three).
2. based on the cascade pyramid network (CPN) (one) described in claims 1, which is characterized in that cascade pyramid network Including two sub-networks, respectively global network and refinement network；Global network is a feature pyramid network, can be positioned The key point of " simple " such as eyes and hand, but possibly can not be accurately identified and be blocked or sightless key point；Network is refined to lead to It crosses and integrates the character representation from all ranks of global network to handle " hardly possible " key point.
3. based on the global network described in claims 2, which is characterized in that by different convolution feature second to the 5th convolution The last one residual block of layer is expressed as C₂,C₃,…,C₅；In C₂,C₃,…,C₅It is upper that 3 × 3 convolution filters is applied to generate The thermal map of key point；Such as shallow-layer feature C₂And C₃With higher spatial resolution, but the semantic information identified is less；And deep layer Characteristic layer C₄And C₅There are more semantic informations due to convolution (He Chihua), but spatial resolution is relatively low；Therefore, usually by U Shape structural integrity, so as to keep the spatial resolution of characteristic layer and semantic information；Feature pyramid network (FPN) is supervised by depth It superintends and directs information and further improves U-shaped structure, similar feature pyramid structure is applied to crucial point estimation；In upsampling process In each element summation process before apply 1 × 1 convolution kernel.
4. based on the refinement network described in claims 2, which is characterized in that in order to improve the efficiency of information transmission and keep Whole property is refined network and is transmitted by the information of different stage, finally by up-sampling and cascade mode by different levels Information integration；It refines network all pyramid features are together in series；In addition, more bottleneck blocks are added to deeper layer In secondary, smaller bulk achieves good balance between efficiency；

With the continuous training of network, network often increasingly focuses on most of " simple " key point, and think little of blocking with Key point；Therefore in network is refined, the key point of " hardly possible " is clearly selected based on training loss, and only from the key point of selection Backpropagation is lost.
5. more people's Attitude estimations (two) described in based on claims 1, which is characterized in that the method for more people's Attitude estimations is main It is divided into from bottom to top and top-to-bottom method；This method employs top-to-bottom method, i.e., first from image positioning and All persons are irised out with bounding box, then solve the problems, such as the single Attitude estimation in bounding box；

If method from up to down in order to obtain good performance, is then needed into person detector for more people's pose estimations And single pose estimation device.
6. based on the person detecting described in claims 5, which is characterized in that detection method is usually made of two stages, first Bounding box suggestion is first generated according to the anchor point of acquiescence, is then cut according to characteristic pattern, and pass through recursive convolution neural network (R-CNN) further refinement is suggested, to obtain final bounding box.
7. based on the cutting described in claims 6, which is characterized in that for the detection block of each personage, which is extended to One fixed depth-width ratio, such as height:Width=256:192, then from image cropping without warp image the ratio of width to height；Most Afterwards, the size of the image after cutting is adjusted to the fixed size of 256 pixel of default height and 192 pixels.
8. based on the training described in claims 1 and test (three), which is characterized in that in the data set for including 5000 images On verified, test set include test development collection (20K images) and test challenge collect (20K images)；Most of experiments are all It is assessed on the basis of object key point similarity (OKS), wherein OKS defines similar between different human body posture Degree；After image cropping, enhance number using random overturning, Random-Rotation (- 40 °~+40 °) and random size (0.7~1.3) According to.
9. based on the training described in claims 8, which is characterized in that all Attitude estimation models are all to use stochastic gradient Descent algorithm training, initial learning rate is 5 × 10^-4；Learning rate every 10 periods reduce by 2 times；Use 10^-5Weight decline Subtract, and normalized in a network using batch.
10. based on the test described in claims 8, which is characterized in that during test, in order to minimize the variance of prediction, pre- 2-d gaussian filters device is applied on the hot spot of survey；It predicts the posture of corresponding flipped image and the thermal map that is averaged is final pre- to obtain It surveys；The final position to obtain key point is deviated using a quarter that the second high response direction is responsive to from highest.