CN109190537A

CN109190537A - A kind of more personage's Attitude estimation methods based on mask perceived depth intensified learning

Info

Publication number: CN109190537A
Application number: CN201810968949.9A
Authority: CN
Inventors: 田彦; 王勋; 吴佳辰
Original assignee: Zhejiang Gongshang University
Current assignee: Hangzhou Yunqi Smart Vision Technology Co Ltd
Priority date: 2018-08-23
Filing date: 2018-08-23
Publication date: 2019-01-11
Anticipated expiration: 2038-08-23
Also published as: CN109190537B

Abstract

The invention discloses a kind of more personage's Attitude estimation methods based on mask perceived depth intensified learning, this method constructs more personage's Attitude estimation models first, and more personage's Attitude estimation models are made of three sub- networks of deeply learning network and single Attitude estimation network of the detection network of acquisition detection block and mask, raising positioning accuracy；Then more personage's Attitude estimation models are trained using training sample；Image to be detected is inputted in trained more personage's Attitude estimation models when test, obtains personage's posture in all detection blocks of image to be detected.Mask information is introduced deeply learning network and single Attitude estimation network by the method for the present invention, improves the effect in the two stages, and quotes residual error structure and solve gradient disappearance and gradient explosion issues.The method of the present invention is more competitive compared with other advanced more personage's Attitude estimation methods.

Description

A kind of more personage's Attitude estimation methods based on mask perceived depth intensified learning

Technical field

The present invention relates to human body attitude estimation techniques, and in particular to a kind of more people based on mask perceived depth intensified learning Object Attitude estimation method.

Background technique

As a large amount of multi-media sensor is disposed, Fashion Design, clinical analysis, human-computer interaction, Activity recognition, movement The extensive use of the motion captures technology such as rehabilitation, the hot spot that human body attitude is estimated for multimedia industry concern.

Recently, single Attitude estimation is made to achieve significant progress by using the framework based on deep learning.However, more People's Attitude estimation is the posture for judging multiple personages in image, the especially individual in estimation crowd, is still an arduous times Business.The Major Difficulties of the task are as follows: firstly, the number in image is unknown, and personage is likely to occur in the picture Any position exists in any proportion.Secondly, between personage in image, there are certain type of interactions, such as block, and hand over Stream, touch etc., this to estimate more difficult.Third, as the number in image increases, computation complexity increases therewith, this So that designing efficient algorithm also becomes a challenge.Shown in Major Difficulties such as Fig. 1 (a)-(d).

It is top-down and it is bottom-up be handle human body attitude estimation main method.Top-down method utilizes one Detector and single attitude estimator carry out testing and evaluation to each detected personage.However, when distance between personage It crosses closely, will lead to single detector failure, and computation complexity increases with the increase of number in picture.Bottom-up side Method in contrast, first detects artis, and the posture of people is judged in conjunction with local environmental information, needs the overall situation due to finally parsing Information, this method cannot be guaranteed efficiency.

Due to the testing result position inaccurate of top-down and bottom-up two methods, This further reduces more people The accuracy rate of object Attitude estimation.Testing result and human body attitude estimated result relational graph such as Fig. 1 (e) are shown.In target detection side To, based on deep learning mode generate detection block and true frame friendship and than meet be greater than 0.5.However, the detection of redundancy As a result it is unfavorable for human body attitude estimation.We need to correct detection block according to original testing result.Deeply learns A kind of effective mode, it can select best means to obtain optimal value according to environmental information.

Summary of the invention

In view of the above-mentioned deficiencies in the prior art, it is an object of the present invention to provide a kind of based on mask perceived depth intensified learning More personage's Attitude estimation methods, this method can effectively propose personage's Attitude estimation accuracy rate.

The purpose of the present invention is achieved through the following technical solutions: a kind of based on mask perceived depth intensified learning More personage's Attitude estimation methods, method includes the following steps:

(1) construct more personage's Attitude estimation models: more personage's Attitude estimation models are by acquisition detection block and mask Detect network, the deeply learning network for improving positioning accuracy and the sub- network composition of three, single Attitude estimation network；

(1.1) it detects network: obtaining the human body in the detection block and detection block of original image by multi-task learning network Binary mask；

(1.2) deeply learning network: it is different from the existing mode based on sampling calibration for calibrating positioning result, Calibrating mode is expressed as markov decision process by the present invention；Detection is updated by recursive reward or punishment learning process Frame；The purpose of this part is one optimal policy function of study, which is mapped to state S in behavior A；

In computer vision field, most of deeply study are using using characteristic pattern as the method for state vector；So And mixed and disorderly background can generate high activation value in characteristic pattern, this meeting interference calibration result was estimated to influence human body attitude Journey；In the present invention, ambient condition is defined as a tuple (h, i), wherein h comes from the history decision in decision networks Vector, i are the characteristic patterns with mask；By using the convolutional neural networks model f of pre-training₁Original spy is extracted from image x Sign figure, then characteristic pattern is passed to multitask network f₂, wherein extracting multitask network as a concern figure with mask Characteristic pattern；The expression formula of i is as follows:

I=f₂(f₁(x))⊙f₁(x)

Wherein, ⊙ is Hadamard product.

When using accurate foreground mask, the redundancy in characteristic pattern will be removed such as；Characteristic pattern with mask mentions Inferior grade information such as shape, profile, the posture of high-grade information such as human body are supplied；This facilitates calibration process.

Human body binary mask and the size adjusting that network obtains be will test as the full connection with deeply learning network After the matched detection block image of layer is multiplied, the input as deeply learning network；The output of deeply learning network is The reward value of 11 kinds of detection block adjustment behaviors.

It includes four classes that detection block, which adjusts behavior: scaling behavior (including reduce and amplify), translation behavior are (including up and down The translation of four direction), termination behavior (whether terminate frame retrieval adjustment), the ratio of width to height adjustment behavior (increase and decrease of width direction and The increase and decrease of short transverse).Front window is worked as into window movement as a result, each behavior can be set to keep detector generation metastable 0.1 times of mouth size.

It selects the maximum behavior of reward value to adjust detection block, the detection block Image Iterative newly obtained is inputted into deeply Learning network, until the maximum behavior of reward value be termination behavior, output calibration after detection block.

(1.3) the single Attitude estimation network is specific as follows: the detection block image after mask and calibration is passed to one Attitude estimation network obtains single posture；

(2) more personage's Attitude estimation models are trained using training sample；Image to be detected is inputted trained In more personage's Attitude estimation models, personage's posture in all detection blocks of image to be detected is obtained.

Further, the detection network uses two-stage (two-stage) processing mode: in the first stage, using depth Layer residual error network extracts the characteristic pattern of original image and generates several candidate frames by RPN network；In second stage, candidate frame is passed Enter three branches and carry out multi-task learnings, respectively obtains that classification confidence, detection block offset, human body binary system is covered in detection block Code.

Further, the detection network uses following associated losses function in each branch of second stage:

L=L_cls+α₁L_box+α₂L_mask

Wherein, L_clsFor Classification Loss, indicated using cross entropy；L_boxFor positioning loss, detection block is measured using L1 norm With the difference between true frame；L_maskFor segmentation loss, indicated using average two-value cross entropy；α₁With α₂For three kinds of losses of balance Proportionality coefficient.

Further, the deeply learning network includes sequentially connected 8 × 8 convolutional layer, 4 × 4 convolutional layers, 3 × 3 Convolutional layer, for the output tool of 3 × 3 convolutional layer there are two branch, a branch obtains the excellent of 11 dimensions by the 11 full articulamentums of dimension Potential function A (a, s；θ, a), another branch obtain state value function V (s by the 1 full articulamentum of dimension；θ, β), wherein θ is shared Convolution layer parameter, α, β are the respective full connection layer parameters of Liang Ge branch, and a is that detection block adjusts behavior, and s is deeply study The input of network；Advantage function is added with state value function and obtains Q function, the reward value of each behavior is calculated by Q function. Q (s, a；θ, α, β)=V (s；θ, β)+A (a, s；θ, α).

Further, in the deeply learning network, the reward value r's of current iteration is expressed as follows:

R (s, a)=(IoU (w '_i, g_i)-IoU(w_i, g_i))+λb′/B′

Wherein, first item IoU (w '_i, g_i)-IoU(w_i, g_i) it is traditional reward item, Section 2 λ b '/B ' is to constrain Detection block size and increased regular terms, w_iWith w '_iRespectively indicate detection block of target i before and after behavior a, g_iIt indicates True frame, the cross sectional area of detection block and true frame of the b ' expression after behavior a, inspection of the B ' expression after behavior a Frame area is surveyed, cross sectional area is sought in IoU expression, and λ be that control rewards the scale factor of item and regular terms (λ occurrence is by testing It is determined when parameter adjusts, generally takes 1~10).

Termination behavior is an additional behavior, it will not move detection block, only judges that optimal result is in intensified learning No to be found, its reward value is defined as follows:

Wherein, τ is friendship and the threshold value decision reward of ratio is positive and negative, and η is corresponding reward value.

According to Q function housing choice behavior a, Q, (s, a) function representation is currently cumulative with the following reward value.

A=argmax_aQ (s, a)

Q (s, a)=r+ γ max_a' Q (s ', a ')

The loss function loss (θ) of training Q function is expressed as follows:

Loss (θ)=E (r+ γ max_a' Q (s ', a ', θ)-Q (s, a, θ))

Wherein, θ is the parameter of deeply learning network, and s and a are the defeated of current iteration deeply learning network respectively Enter and adjust behavior with detection block, s ' and a ' are input and the detection block adjustment row of next iteration deeply learning network respectively For, Q (s, a, θ) be all reward values that current iteration starts and, Q (s ', a ', θ) be since next iteration with lottery It encourages value and r is the reward value of current iteration, and γ is discount factor, and E is to take expectation to the penalty values under all iteration.

Further, in the deeply learning network, in order to improve the learning efficiency of parameter θ, in the following ways:

(a) in order to promote study stability, target network is introduced, it separates with online network, updates in each iteration Online network, updates target network at intervals；

(b) in order to avoid falling into local minimum, using ε-greedy strategy as action strategy；

(c) to solve the problems, such as data dependence, use experience plays back (experience replay), (s, a, r, s ') quilt It stores in the buffer, in training, from the sample of fixed quantity is randomly choosed in caching to reduce the correlation between data.

Further, the deeply learning network, using dueling DQN structure, the structure is in the Decision Evaluation phase Between can quickly identify correct behavior.As the Q network of mark, dueling structured training only needs backpropagation, does not need Supervised learning or algorithm modification can estimate V (s automatically；θ, β) and A (a, s；θ, α).

Further, the single Attitude estimation network will test human body binary mask and cascade gold that network obtains Word tower network (CPN) combines, and carries out human body attitude estimation, and the loss function of single Attitude estimation network is as follows:

L=L_inf+kL_mask

Wherein L_infFor the single posture of prediction and the error term of true posture, L_maskFor indicate prediction single posture with The regular terms of human body binary mask error, k be balance two scale factor (k is obtained according to practical experience, generally take 1~ 5)；L_mask=∑_pL_p, p is human joint points number, wherein

Wherein,Indicate that for p node in the predicted value of the position l, l is the maximum position of activation value in activation figure in activation figure； m_lIt is the human body binary mask on the position l, 1 indicates to indicate in human region, 0 in background area；If node is not in human body As a result region pays for, otherwise loss function is unaffected.

Further, more personage's Attitude estimation model training stages are calculated using GPU.

Preferably, the training details of detection network are as follows: α in loss function₁With α₂4.0 and 10.0 are taken respectively.Whole network Use momentum for 0.9 stochastic gradient descent algorithm, weight decays to 0.0005.Preceding 60,000 iteration, learning rate 0.01, rear 2 Ten thousand iteration, learning rate 0.001.In every batch of data, take 48 positive samples from 4 trained pictures, 48 negative samples This is from mixed and disorderly background.In Qualify Phase, threshold confidence is set as 0.7, friendship used for positioning and ratio is set as 0.6.

In the calibration process learnt based on deeply, 10,000 data are cached as one, the number of batch of data It is 32 according to amount.λ in loss function takes 1~10.In the experimental stage, ε-greedy strategy is used.In training, training 5,000 time Afterwards, ε drops to 0.05 from 0.3.Discount factor γ is 0.9.

In the single Attitude estimation stage, the k in loss function is set as 0.4, and model uses stochastic gradient descent algorithm, initially Learning rate is 0.0005, every to have traversed 10 data set learning rates reduction half.Weight attenuation rate is 0.00005, and is used Batch normalization (Batch Normalization).

Compared with the prior art, the device have the advantages that are as follows:

1. more personage's Attitude estimation methods proposed by the present invention based on mask perceived depth intensified learning increase detection Accuracy rate.

2. mask information is used to eliminate negative effect brought by mixed and disorderly background information, and most according to reward function selection Good behavior.

3. increasing regularization term in human body attitude estimation stages to punish the node outside human body contour outline.

4. more personage's Attitude estimation models are tested on MPII test set, Average Accuracy (mean Average Precision, mAP) than prior art model 1.1 are improved, it is tested on MS-COCO test development data set, it is average accurate Degree has reached 73.0.

Detailed description of the invention

Fig. 1 is more personage's Attitude estimation difficult point schematic diagrames provided in an embodiment of the present invention；

Fig. 2 is that more personage's Attitude estimation frames provided in an embodiment of the present invention based on mask perceived depth intensified learning show It is intended to；

Fig. 3 is the activation figure in detection-phase detection block provided in an embodiment of the present invention；

Fig. 4 is the schematic diagram under different behaviors；

Fig. 5 is the schematic diagram of depth Q network provided in an embodiment of the present invention；

Fig. 6 is the Attitude estimation block schematic illustration of mask perception provided in an embodiment of the present invention；

Fig. 7 is the accuracy rate curve of state provided in an embodiment of the present invention and reward function；

Fig. 8 is the testing result in MPII data set provided in an embodiment of the present invention.

Specific embodiment

In order to more specifically describe the present invention, with reference to the accompanying drawing and specific embodiment is to technical solution of the present invention It is described in detail.

More personage's Attitude estimation methods provided in this embodiment can obtain the personage position of on-fixed quantity in piece image It sets and posture information, and can be applied to clinical analysis, human-computer interaction, the multimedia industries such as Activity recognition.

Multi-task learning network is based on using present embodiment and obtains detection and localization frame and mask, is learnt using deeply Network calibration positioning finally carries out human body attitude estimation using personage of the single Attitude estimation network to detection block.Below with reference to Attached drawing is illustrated this specific embodiment of the invention.

Fig. 1 is more personage's Attitude estimation difficult point schematic diagrames provided in an embodiment of the present invention, and (a) shows the number of personage in picture Amount and position are unknown.(b) (c) (d), which is respectively indicated, blocks, and exchanges, and contact embodies the interaction between personage, (e) embodies The relationship of detection block detection and human body attitude estimation.

Fig. 2 is that more personage's Attitude estimation frames provided in an embodiment of the present invention based on mask perceived depth intensified learning show It is intended to, the acquisition detection block and personage's mask of multitask Network Synchronization calibrate positioning result using deeply learning network.Most Eventually, the posture of the network-evaluated each personage of hourglass is utilized.Mask information is utilized in calibration and estimation stages.

Fig. 3 is the activation figure in detection-phase detection block provided in an embodiment of the present invention, and original image (a) is passed through convolution Neural network obtains activation figure (b), it will be seen that mixed and disorderly background information equally produces high activation value in figure (b). Fig. 3 (c) indicates that redundancy when using accurate foreground mask, in characteristic pattern will be removed.

Fig. 4 is the schematic diagram under different behaviors, respectively indicates scaling behavior, translation behavior, termination behavior, the ratio of width to height adjustment 4 class behavior of behavior.

Fig. 5 is the schematic diagram of depth Q network provided in an embodiment of the present invention, including sequentially connected 8 × 8 convolutional layer, 4 × 4 convolutional layers, 3 × 3 convolutional layers, there are two branch, a branch is obtained the output tool of 3 × 3 convolutional layers by the 11 full articulamentums of dimension Advantage function A (a, the s of 11 dimensions；θ, α), another branch obtains state value function V (s by the 1 full articulamentum of dimension；θ, β), wherein θ is shared convolutional layer parameter, and α, β are the respective full connection layer parameters of Liang Ge branch, and a is that detection block adjusts behavior, and s is that depth is strong Change the input of learning network；Advantage function is added with state value function and obtains Q function, each behavior is calculated by Q function Reward value.

Fig. 6 is the Attitude estimation block schematic illustration of combination mask provided in an embodiment of the present invention, the single Attitude estimation Network will test human body binary mask and cascade pyramid network (CPN) combination that network obtains, carry out human body attitude and estimate Meter, the loss function of single Attitude estimation network are as follows:

L=L_inf+kL_mask

Fig. 7 is the accuracy rate curve of state provided in an embodiment of the present invention and reward function, and (a) is that the training of state is accurate Rate curve is (b) the test accuracy rate curve of state, (c) is the training accuracy rate curve of reward function, (d) is reward function Test accuracy rate curve.

More personage's Attitude estimations are carried out to image using the present embodiment, experimental result such as Fig. 8 institute on MPII data set Show, (a) indicates prediction successfully as a result, (b) indicating the result of prediction of failure.From the result of prediction of failure, we can be total (1) is born although detection method is improved, the method downward from item is still promised to undertake (early by early stage Commitment influence).(2) our method appears in the feelings in estimation range and with less interaction suitable for personage Condition.

Technical solution of the present invention and beneficial effect is described in detail in above-described specific embodiment, Ying Li Solution is not intended to restrict the invention the foregoing is merely presently most preferred embodiment of the invention, all in principle model of the invention Interior done any modification, supplementary, and equivalent replacement etc. are enclosed, should all be included in the protection scope of the present invention.

Claims

1. a kind of more personage's Attitude estimation methods based on mask perceived depth intensified learning, which is characterized in that this method includes Following steps:

(1) construct more personage's Attitude estimation models: more personage's Attitude estimation models by acquisition detection block and mask detection Network, the deeply learning network for improving positioning accuracy and the sub- network composition of three, single Attitude estimation network；

The detection network is specific as follows: the people in the detection block and detection block of original image is obtained by multi-task learning network Body binary mask；

The deeply learning network is specific as follows: will test human body binary mask that network obtains and size adjusting be with After the matched detection block image of the full articulamentum of deeply learning network is multiplied, the input as deeply learning network； The output of deeply learning network is the reward value that 11 kinds of detection blocks adjust behavior；The detection block adjustment behavior includes four Class: scaling behavior, translation behavior, termination behavior, the ratio of width to height adjust behavior；The selection maximum behavior of reward value detects to adjust The detection block Image Iterative newly obtained is inputted deeply learning network by frame, until the maximum behavior of reward value is termination row To export the detection block after calibrating；

The single Attitude estimation network is specific as follows: the detection block image after mask and calibration is passed to single Attitude estimation net Network obtains single posture；

(2) more personage's Attitude estimation models are trained using training sample；Image to be detected is inputted into trained more people In object Attitude estimation model, personage's posture in all detection blocks of image to be detected is obtained.

2. as described in claim 1 based on more personage's Attitude estimation methods of mask perceived depth intensified learning, feature exists In the detection network uses two-stage processing mode: in the first stage, the spy of original image is extracted using deep layer residual error network Sign figure simultaneously generates several candidate frames by RPN network；In second stage, candidate frame is passed to three branches and carries out multi-task learning, Respectively obtain classification confidence, detection block offset, human body binary mask in detection block.

3. as claimed in claim 2 based on more personage's Attitude estimation methods of mask perceived depth intensified learning, feature exists In the detection network uses following associated losses function in each branch of second stage:

L=L_cls+α₁L_box+α₂L_mask

Wherein, L_clsFor Classification Loss, indicated using cross entropy；L_boxIt is lost for positioning, using L1 norm measurement detection block and very Difference between real frame；L_maskFor segmentation loss, indicated using average two-value cross entropy；α₁With α₂For the ratio of three kinds of losses of balance Example coefficient.

4. as described in claim 1 based on more personage's Attitude estimation methods of mask perceived depth intensified learning, feature exists In, the deeply learning network include sequentially connected 8 × 8 convolutional layer, 4 × 4 convolutional layers, 3 × 3 convolutional layers, described 3 × There are two branch, a branches, and advantage function A (a, the s of 11 dimensions are obtained by the 11 full articulamentums of dimension for the output tool of 3 convolutional layers；θ, α), another branch obtains state value function V (s by the 1 full articulamentum of dimension；θ, β), wherein θ is shared convolutional layer parameter, α, β It is the respective full connection layer parameter of Liang Ge branch, a is that detection block adjusts behavior, and s is the input of deeply learning network；It will be excellent Potential function is added with state value function obtains Q function, and the reward value of each behavior is calculated by Q function.

5. as claimed in claim 4 based on more personage's Attitude estimation methods of mask perceived depth intensified learning, feature exists In in the deeply learning network, the loss function loss (θ) of Q function is expressed as follows:

Loss (θ)=E (r+ γ max_a′Q (s ', a ', θ)-Q (s, a, θ))

Wherein, θ is the parameter of deeply learning network, s and a be respectively current iteration deeply learning network input and Detection block adjusts behavior, and s ' and a ' are the input and detection block adjustment behavior of next iteration deeply learning network, Q respectively (s, a, θ) be all reward values for starting of current iteration and, Q (s ', a ', θ) is all reward values since next iteration It is the reward value of current iteration with, r, γ is discount factor, and E is to take expectation to the penalty values under all iteration.

6. as claimed in claim 5 based on more personage's Attitude estimation methods of mask perceived depth intensified learning, feature exists In in the deeply learning network, the reward value r's of current iteration is expressed as follows:

R (s, a)=(IoU (w '_i, g_i)-IoU(w_i, g_i))+λb′/B′

Wherein, first item IoU (w '_i, g_i)-IoU(w_i, g_i) it is traditional reward item, Section 2 λ b '/B ' is to constrain detection Frame size and increased regular terms, w_iWith w '_iRespectively indicate detection block of target i before and after behavior a, g_iIndicate true Frame, the cross sectional area of detection block and true frame of the b ' expression after behavior a, detection block of the B ' expression after behavior a Cross sectional area is sought in area, IoU expression, and λ is the scale factor of control reward item and regular terms.

7. as claimed in claim 5 based on more personage's Attitude estimation methods of mask perceived depth intensified learning, feature exists In in the deeply learning network, in order to improve the learning efficiency of parameter θ, in the following ways:

(a) in order to promote study stability, target network is introduced, it is separated with online network, is updated in each iteration online Network updates target network at intervals；

(c) to solve the problems, such as data dependence, use experience playback, (s, a, r, s ') is stored in the buffer, in training, from The sample of fixed quantity is randomly choosed in caching to reduce the correlation between data.

8. as described in claim 1 based on more personage's Attitude estimation methods of mask perceived depth intensified learning, feature exists In the deeply learning network, using dueling DQN structure, which can quickly identify just during Decision Evaluation True behavior.

9. as described in claim 1 based on more personage's Attitude estimation methods of mask perceived depth intensified learning, feature exists In, the single Attitude estimation network will test the human body binary mask and cascade pyramid network integration that network obtains, into The loss function of pedestrian's body Attitude estimation, single Attitude estimation network is as follows:

L=L_inf+kL_mask

Wherein L_infFor the single posture of prediction and the error term of true posture, L_maskFor the single posture and human body two for indicating prediction The regular terms of system mask error, k are the scale factors for balancing two；L_mask=∑_pL_p, p is human joint points number；

Wherein,Indicate that for p node in the predicted value of the position l, l is the maximum position of activation value in activation figure in activation figure；m_lIt is l Human body binary mask on position, 1 indicates to indicate in human region, 0 in background area；If node not in human region, As a result it pays for, otherwise loss function is unaffected.

10. as described in claim 1 based on more personage's Attitude estimation methods of mask perceived depth intensified learning, feature exists In more personage's Attitude estimation model training stages are calculated using GPU.