CN110175575A

CN110175575A - A kind of single Attitude estimation method based on novel high-resolution network model

Info

Publication number: CN110175575A
Application number: CN201910454096.1A
Authority: CN
Inventors: 陈志�; 任杰; 岳文静; 周传; 陈璐; 刘玲; 江婧; 周松颖
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2019-05-29
Filing date: 2019-05-29
Publication date: 2019-08-27

Abstract

The present invention discloses a kind of method for carrying out single Attitude estimation based on the novel high-resolution network architecture.The invention is detected with image comprising single pedestrian of the detector to input first, is removed inaccurate detection block, is carried out EDS extended data set secondly by data enhancing；Then high-resolution features figure is kept by parallel multiresolution subnet in instantiation network structure, without restoring resolution ratio, crosspoint is introduced in parallel subnet, so that each subnet is repeatedly received information from other parallel subnets, improves the accuracy rate to single Attitude estimation；Due in most of complex scene, it will appear the phenomenon that key point is blocked, so proposing the data enhanced scheme sheltered using a key point, the convolutional neural networks instructed can be very effectively finely tuned by this scheme, the key point being blocked formidably is positioned by adjacent matching, the accuracy rate to occlusion issue is promoted, to obtain more preferably model.

Description

A kind of single Attitude estimation method based on novel high-resolution network model

Technical field

The present invention relates to one kind to be based on novel high-resolution network architecture method, belongs to deep learning, computer vision, machine The interleaving techniques fields such as device study.

Background technique

2D human posture estimation is always a basic but challenging problem in computer vision, carries out one The target of Attitude estimation is positioning human anatomy key point (for example, ancon, wrist etc.) or position.The application back of Attitude estimation Scape is very extensive, is concentrated mainly on intelligent video monitoring, human-computer interaction, virtual reality, smart home etc..This patent is to one Attitude estimation is interested, this is the basis of other relevant issues, such as more people's Attitude estimations, video pose estimation, Activity recognition and The problems such as tracking.

It is highest so far that nearest development shows that depth convolutional neural networks have had reached in terms of single Attitude estimation Accuracy rate, accuracy rate have been much higher than traditional method.Most of existing methods are transmitted by network and are inputted, by high-resolution spy Sign figure is down sampled to low resolution, then restores from low resolution characteristic pattern to high-resolution thinking (this process single or again It is multiple multiple), the process of Multi resolution feature extraction is realized with this.But on the contrary, this patent propose network in the whole process Characteristic pattern remains high-resolution, and the method for this and mainstream before is very different, and remains that high-resolution is logical It crosses and is gradually added what low resolution characteristic pattern sub-network was realized parallel in high-resolution features figure master network.

The visual problem generally acknowledged as one, pose estimation annoyings always researcher for many years, this is because existing Grow directly from seeds in living, often there are many complicated scenes, pedestrian often will appear phenomena such as blocking, these scenes are also to have to challenge Property scene, but use original training set training one of network the disadvantage is that usually Shortcomings amount comprising circumstance of occlusion Single picture carries out accurate critical point detection/positioning to train depth network, and outstanding single pose estimation system is necessary There is robustness with the critical point detection of severely deformed single pedestrian to blocking, the success under rare and novel posture, but It is that this problem is never well solved.A kind of novel key point shielding data enhancing side is proposed in this patent Case increases training data with trim network, improves the accuracy rate in complex scene.

Recent years, with the increasingly in-depth study to deep learning, and due to depth network model be not necessarily to according to Rely complicated feature work and can abundant pictorial information the features such as, more and more researchers start for deep learning to be applied to In the task of single Attitude estimation, the accuracy rate of single Attitude estimation is improved.

Summary of the invention

Technical problem: problem to be solved by the invention is by proposing a kind of list based on novel high-resolution network model People's Attitude estimation method makes network be always maintained at high-resolution spy using parallel multiresolution subnet and Multiscale Fusion repeatedly Sign figure is finely adjusted convolutional neural networks without restoring resolution ratio, and using a kind of new data enhancement method, increases Training data under circumstance of occlusion is improved in complex scene with trim network, single pedestrian be blocked in the case of accuracy rate.

Technical solution: to achieve the goals above, the invention adopts the following technical scheme:

A kind of single Attitude estimation method based on novel high-resolution network model, comprising the following steps:

It is data set that step 1) input tape, which has the RGB pictures cooperation of single pedestrian's posture of key point Labeling Coordinate, is made The pedestrian in picture is outlined with detector with rectangle frame, rectangle frame region is denoted as R；

Rectangle frame region R is saved as picture by step 2), carries out data enhancing to picture, the data enhancing includes that will scheme The mode of piece Random-Rotation, by the mode of picture overturning, add the mode of Gaussian noise to picture；

Data set proportionally 7:3 is divided into training set and test set by step 3), for convolutional neural networks training and Test；

Step 4) is arranged to training set resolution ratio fixed size；

Step 5) is according to parallel multiresolution subnet, repeatedly Multiscale Fusion principle example network structure；It is described parallel Multiresolution subnet is referred to by the way that the characteristic pattern lower than the network resolution ratio is added parallel in high-resolution features figure master network Sub-network；The Multiscale Fusion repeatedly refers to being exchanged with each other information between each parallel network, realizes Multiscale Fusion, more rulers Degree fusion carries out multiple；

Step 6) trains a convolutional neural networks on training set, uses multiwindow, multireel product in convolutional neural networks The picture of verification input carries out convolution operation, and the convolutional neural networks include convolutional layer, pond layer, full articulamentum, in which: are made With line rectification function, i.e. for ReLU function as activation primitive, the ReLU activation primitive is that form is f (x)=max (x, 0) Function, in which: x indicates the output valve of one layer of front, and max (x, 0) is used to take biggish value in x and 0, and f (x) is used to receive The return value of max (x, 0)；Use mean square error as loss function, limits instruction using the regularization of dropout mechanism and weight Practice model, the dropout mechanism refers to the method optimized to the artificial neural network with depth structure, learning In the process by the way that the fractional weight of hidden layer or output are returned 0 at random, the interdependency between node is reduced, realizes nerve The regularization of network；Learning rate is modified come training convolutional neural networks by dynamic in the training process, basic learning rate is set It is set to 10^-3And drop to 10 respectively in the 150th wheel and the 200th wheel respectively^-4With 10^-5Training process is terminated in 250 wheels, each Wheel requires to carry out propagated forward and backpropagation to all training datas；

Step 7) is formidably positioned using a kind of scheme of novel key point shielding data enhancing by adjacent matching The key point being blocked increases training data with trim network, improves the accuracy rate in complex scene, the novel key Point shielding data enhancing refers to manually blocking the data enhancement method of some key point or replicates on the image and paste certain The data enhancement method of a body key point patch；

Step 8) inputs the triple channel RGB picture comprising single pedestrian's posture, and the picture is that user inputs picture；

The pedestrian that step 9) inputs in picture user detects, and outlines rectangle frame；

Rectangle frame region is inputted novel high-resolution network by step 10), is carried out propagated forward, is obtained thermal map, the heat Figure is probability of the artis in each pixel；

Step 11) is by being responsive to the position that the upward a quarter offset of the second responder adjusts maximum calorific value from highest It sets, predicts the position of each key point, then again that key point is corresponding connected, the posture for obtaining single pedestrian in picture is estimated Meter.

Further, the step 5), comprising the following steps:

The parallel multiresolution subnet of step 51): by the way that low resolution spy is added parallel in high-resolution features figure master network Levy figure sub-network；Therefore, the resolution ratio of the parallel sub-network of the latter half include previous stage resolution ratio and one than previous Stage lower resolution ratio；

Step 52) Multiscale Fusion repeatedly: crosspoint is introduced in parallel sub-network, each sub-network is repeatedly from it He receives information at parallel sub-network, if input is { X₁, X₂..., X_n, it exports as { Y₁, Y₂..., Y_n, in which: X indicates input Response diagram, Y indicate the response diagram of output, X_nIndicate n-th of input response diagram, Y_nIndicate n-th of output response figure, point of output Resolution is identical as the resolution ratio of input, and each output is the polymerization of input mappingCrosspoint across the stage Y is mapped with additional output_n+1: Y_n+1=a (Y_n, n+1), in which: i indicates i-th of response diagram serial number, and k indicates k-th of response Figure serial number, X_iIndicate i-th of input response diagram, Y_n+1Indicate (n+1)th output response figure, Y_kIndicate k-th of output response figure, institute State function a (X_i, k) indicate by from i to k to X_iCarry out up-sampling or down-sampling, a (Y_n, n+1) and it indicates from n to n+1 to Y_nIt carries out Up-sampling or down-sampling, in which: up-sample and down-sampling is completed by convolution, carried out down using 3 × 3 convolution to stride Sampling, hip walks 3 × 3 convolution and stride 2 carries out 2 × down-sampling, with the continuous hip of stride 2 walk 3 × 3 convolution carry out 4 × under adopt Sample；For up-sampling, passed through interpolation value method later using 1 × 1 convolution and up-sampled, the interpolation value method is referred to original On the basis of image, carry out being inserted into new element using interpolation algorithm between pixel；If i=k, then function representation Are as follows: a (X_i, k) and=X_i, thermal map is returned from the output of the last one crosspoint；

Step 53) network example: instantiating key point thermal map estimation network, the main body of network include four simultaneously Row sub-network, the resolution ratio of sub-network are reduced to half, and width, that is, port number increases to twice；1st stage included 4 remaining single The width of Feature Mapping is reduced to S by member, followed by 3 × 3 convolution, and the S is the width of subnet, the 2nd, 3,4 stages point It Bao Han not 1,4,3 swap block；One swap block includes 4 remaining units, in which: each residue unit is in each resolution ratio Include two 3 × 3 convolution and a crosspoint across different resolution；A total of 8 crosspoints, that is, carried out 8 Secondary Multiscale Fusion.

Further, in the step 2), in the mode of picture Random-Rotation, the rotation angle of picture is -45 °~45 °, I.e. the rotation angle of picture is 45 ° counterclockwise to 45 ° clockwise.

Further, in the step 2), the mode of picture overturning includes flip horizontal and flip vertical.

Further, in the step 2), every picture passes through any two kinds in data enhancement method at random.

Further, in the step 4), training set resolution ratio is arranged to fixed size 256px × 192px, described 256 × 192 be pixel matrix, wherein wherein 256 each column pixel number, 192 be number of pels per line.

The utility model has the advantages that the invention adopts the above technical scheme compared with prior art, have the advantages that

The present invention carries out single Attitude estimation based on the novel high-resolution network architecture using a kind of, is differentiated by parallel more The principle of rate subnet and repeatedly Multiscale Fusion, instantiates network, improves the accuracy rate of single pose estimation, and The data enhanced scheme for proposing a kind of novel key point shielding, increases the training data under circumstance of occlusion with trim network, mentions Height in complex scene, single pedestrian be blocked in the case of accuracy rate.

Specifically:

(1) present invention constructs network according to the principle of parallel multiresolution, from high-resolution subnet initially as the first rank Section gradually adds resolution ratio subnet from high to low, is connected in parallel multiresolution subnet.Therefore, the parallel subnet of the latter half The resolution ratio of network includes the resolution ratio and lower resolution ratio of previous stage.Parallel multiresolution subnet is set to be always maintained at high-resolution Rate characteristic pattern, without restoring resolution ratio；

(2) present invention constructs network according to the principle of Multiscale Fusion repeatedly, introduces crosspoint in parallel subnet, makes It obtains each subnet and repeatedly receives information from other parallel subnets；

(3) present invention proposes a kind of novel key point shielding data enhanced scheme, formidably fixed by adjacent matching The key point that position is blocked improves the single pedestrian in complex scene and is being blocked by force to increase training data with trim network Accuracy rate under frame.

Detailed description of the invention

Fig. 1 is flow chart of the method for the present invention.

Specific embodiment

The technical solution of invention is further described in detail with reference to the accompanying drawing:

As shown in Figure 1, a kind of single Attitude estimation method based on novel high-resolution network model, including following step It is rapid:

Rectangle frame region R is saved as picture by step 2), carries out data enhancing to picture, the data enhancing includes that will scheme The mode of piece Random-Rotation, by the mode of picture overturning, add the mode of Gaussian noise to picture；Specifically: picture revolves at random In the mode turned, the rotation angle of picture is -45 °~45 °, i.e., the rotation angle of picture is 45 ° counterclockwise to 45 ° clockwise, The mode of picture overturning includes flip horizontal and flip vertical, and every picture passes through any two in data enhancement method at random Kind.

Step 4) is arranged to training set resolution ratio fixed size 256px × 192px, and described 256 × 192 be pixel value Matrix, wherein wherein 256 each column pixel number, 192 be number of pels per line.

In particular, the step 5), comprising the following steps:

The parallel multiresolution subnet of step 51): by the way that low resolution is gradually added parallel in high-resolution features figure master network Rate characteristic pattern sub-network；Artificial regards as initial network high-resolution, the resolution ratio for the characteristic pattern sub-network being added parallel It is lower than the resolution ratio of initial network；Therefore, the resolution ratio of the parallel sub-network of the latter half include previous stage resolution ratio and One than previous stage more lower resolution ratio；

Phenomena such as often being blocked due to pedestrian, in order to cope with these challenging scenes, using manually blocking The data enhancement method of some key point or the data enhancing side for replicating and pasting some body key point patch on the image Formula carries out data enhancing, to increase training data with trim network.User inputs the triple channel RGB comprising single pedestrian and schemes Piece.The pedestrian inputted in picture to user detects, and outlines rectangle frame, rectangle frame region is input to network, obtains thermal map, By being responsive to the position that the upward a quarter offset of the second responder adjusts maximum calorific value from highest, can predict every The position of one key point, it is then again that key point is corresponding connected, so that it may to obtain the Attitude estimation of single pedestrian in picture.

The above is only a preferred embodiment of the present invention, it should be pointed out that: for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of single Attitude estimation method based on novel high-resolution network model, which comprises the following steps:

It is data set that step 1) input tape, which has the RGB pictures cooperation of single pedestrian's posture of key point Labeling Coordinate, uses inspection It surveys rectangular circle and goes out the pedestrian in picture, rectangle frame region is denoted as R；

Rectangle frame region R is saved as picture by step 2), carries out data enhancing to picture, data enhancing include by picture with The mode of machine rotation, by the mode of picture overturning, add the mode of Gaussian noise to picture；

Data set proportionally 7:3 is divided into training set and test set, the training and survey for convolutional neural networks by step 3) Examination；

Step 4) is arranged to training set resolution ratio fixed size；

Step 5) is according to parallel multiresolution subnet, repeatedly Multiscale Fusion principle example network structure；It is described more points parallel Resolution subnet is referred to by the way that the characteristic pattern subnet lower than the network resolution ratio is added parallel in high-resolution features figure master network Network；The Multiscale Fusion repeatedly refers to being exchanged with each other information between each parallel network, realizes Multiscale Fusion, multiple dimensioned to melt It closes and carries out repeatedly；

Step 6) trains a convolutional neural networks on training set, uses multiwindow, multireel product verification in convolutional neural networks The picture of input carries out convolution operation, and the convolutional neural networks include convolutional layer, pond layer, full articulamentum, in which: uses line Property rectification function, i.e. for ReLU function as activation primitive, the ReLU activation primitive is form for the letter of f (x)=max (x, 0) Number, in which: x indicates the output valve of one layer of front, and max (x, 0) is used to take biggish value in x and 0, f (x) be used to receive max (x, 0) return value；Use mean square error as loss function, limits training mould using the regularization of dropout mechanism and weight Type, the dropout mechanism refers to the method optimized to the artificial neural network with depth structure, in learning process In by the way that the fractional weight of hidden layer or output are returned 0 at random, reduce the interdependency between node, realize neural network Regularization；Learning rate is modified come training convolutional neural networks by dynamic in the training process, basic learning rate is set as 10^-3And drop to 10 respectively in the 150th wheel and the 200th wheel respectively^-4With 10^-5Training process is terminated in 250 wheels, and each round is all It needs to carry out propagated forward and backpropagation to all training datas；

Step 7) is formidably positioned by adjacent matching and is hidden using a kind of scheme of novel key point shielding data enhancing The key point of gear increases training data with trim network, improves the accuracy rate in complex scene, the novel key point screen Data enhancing is covered to refer to manually blocking the data enhancement method of some key point or replicate and paste some body on the image The data enhancement method of body key point patch；

Rectangle frame region is inputted novel high-resolution network by step 10), is carried out propagated forward, is shown that thermal map, the thermal map are Probability of the artis in each pixel；

Step 11) by being responsive to the position that the upward a quarter offset of the second responder adjusts maximum calorific value from highest, The position of each key point is predicted, it is then again that key point is corresponding connected, obtain the Attitude estimation of single pedestrian in picture.

2. a kind of single Attitude estimation method based on novel high-resolution network model according to claim 1, special Sign is, the step 5), comprising the following steps:

The parallel multiresolution subnet of step 51): by the way that low resolution characteristic pattern is added parallel in high-resolution features figure master network Sub-network；Therefore, the resolution ratio of the parallel sub-network of the latter half includes that the resolution ratio of previous stage and one compare previous stage Lower resolution ratio；

Step 52) Multiscale Fusion repeatedly: introducing crosspoint in parallel sub-network, each sub-network repeatedly from other simultaneously Row sub-network receives information, if input is { X₁, X₂..., X_n, it exports as { Y₁, Y₂..., Y_n, in which: X indicates the response of input Figure, Y indicate the response diagram of output, X_nIndicate n-th of input response diagram, Y_nIndicate n-th of output response figure, the resolution ratio of output Identical as the resolution ratio of input, each output is the polymerization of input mappingCrosspoint across the stage has Additional output maps Y_n+1: Y_n+1=a (Y_n, n+1), in which: i indicates i-th of response diagram serial number, and k indicates k-th of response diagram sequence Number, X_iIndicate i-th of input response diagram, Y_n+1Indicate (n+1)th output response figure, Y_kIndicate k-th of output response figure, the letter Number a (X_i, k) indicate by from i to k to X_iCarry out up-sampling or down-sampling, a (Y_n, n+1) and it indicates from n to n+1 to Y_nAdopt Sample or down-sampling, in which: up-sample and down-sampling is completed by convolution, adopt using 3 × 3 convolution to stride Sample, a hip walks 3 × 3 convolution and stride 2 carries out 2 × down-sampling, walks 3 × 3 convolution with the continuous hip of stride 2 and carries out 4 × down-sampling； For up-sampling, passed through interpolation value method later using 1 × 1 convolution and up-sampled, the interpolation value method is referred in original figure As on the basis of, carry out being inserted into new element using interpolation algorithm between pixel；If i=k, then function representation are as follows: a (X_i, k) and=X_i, thermal map is returned from the output of the last one crosspoint；

Step 53) network example: instantiating key point thermal map estimation network, and the main body of network includes four parallel sons Network, the resolution ratio of sub-network are reduced to half, and width, that is, port number increases to twice；1st stage included 4 remaining units, Followed by 3 × 3 convolution, the width of Feature Mapping is reduced to S, the S is the width of subnet, the 2nd, 3,4 stages difference Include 1,4,3 swap block；One swap block includes 4 remaining units, in which: each residue unit wraps in each resolution ratio Containing two 3 × 3 convolution and a crosspoint across different resolution；A total of 8 crosspoints, that is, carried out 8 times Multiscale Fusion.

3. a kind of single Attitude estimation method based on novel high-resolution network model according to claim 1, special Sign is, in the step 2), in the mode of picture Random-Rotation, the rotation angle of picture is -45 °~45 °, i.e. the rotation of picture Gyration is 45 ° counterclockwise to 45 ° clockwise.

4. a kind of single Attitude estimation method based on novel high-resolution network model according to claim 1, special Sign is, in the step 2), the mode of picture overturning includes flip horizontal and flip vertical.

5. a kind of single Attitude estimation method based on novel high-resolution network model according to claim 1, special Sign is, in the step 2), every picture passes through any two kinds in data enhancement method at random.

6. a kind of single Attitude estimation method based on novel high-resolution network model according to claim 1, special Sign is, in the step 4), training set resolution ratio is arranged to fixed size 256px × 192px, and described 256 × 192 be picture Plain value matrix, wherein wherein 256 each column pixel number, 192 be number of pels per line.