CN108108722A

CN108108722A - A kind of accurate three-dimensional hand and estimation method of human posture based on single depth image

Info

Publication number: CN108108722A
Application number: CN201810046261.5A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2018-01-17
Filing date: 2018-01-17
Publication date: 2018-06-01

Abstract

A kind of the accurate three-dimensional hand and estimation method of human posture based on single depth image proposed in the present invention, main contents include：Network model, improved target location, the input of system, voxel predict network to voxel, its process is, the overall architecture of network is provided first, then utilize and the position of target is improved based on the method for convolutional neural networks, then the input of system is constructed using Back-projection technique, finally four class building block of block is up-sampled with volume basic block, volume residual block, volume down-sampling block and volume and encoder and decoder composition voxel predicts network to voxel.The present invention solves the problems, such as perspective distortion and Nonlinear Mapping, can obtain the three-dimensional hand of pinpoint accuracy and human body attitude estimation, and takes less, can accomplish to carry out human body behavior prediction and estimation in real time.

Description

A kind of accurate three-dimensional hand and estimation method of human posture based on single depth image

Technical field

The present invention relates to three-dimensional hand and human body attitude estimation field, more particularly, to a kind of based on single depth image Accurate three-dimensional hand and estimation method of human posture.

Background technology

Human body behavior interaction is computer by positioning and identifying the mankind, tracking human limb's movement locus, tracking expression Feature so as to understand the action of the mankind and behavior, and responds.Its application background is very extensive, is concentrated mainly on man-machine friendship Mutually, virtual reality, smart home, intelligent security guard, intelligent video monitoring, patient monitoring system, sportsman's supplemental training, in addition base In the method that many human body behaviors interactions have also been used in video frequency searching and intelligent image compression of content etc..Such as by train It stands, the suspicious hand motion or posture of the detection of the public arenas such as airport and estimation personage, Security Personnel can be assisted to judge that it is No is that will implement theft or the suspect of other hazardous acts, so as to effectively reduce the generation of thievery and hazard event. For another example, by the camera supervised patient with major disease of dispensary's fitting depth, detection and the gesture for estimating patient And human body attitude, medical staff can so be helped to judge whether patient wants help, and make corresponding processing in time.People The main task of machine behavior interaction is three-dimensional hand and human body attitude estimation.With the appearance of cheap depth camera, based on single Three-dimensional hand and the human body attitude estimation of depth image are increasingly subject to the concern of people.Recently, the method based on convolutional neural networks It is used for the three-dimensional hand of single depth image and human body attitude estimation problem and achieves great accuracy.But this kind of side Method still have limitation, particularly when there are it is serious self block, depth image is second-rate when.It is in addition, traditional Three-dimensional hand and estimation method of human posture tool there are two deficiency：First be there are the perspective distortion of two-dimensional depth image, so as to Cause to estimate distortion；Second is there are the Nonlinear Mapping relation of height between depth image and three-dimensional coordinate, this is non-linear Mapping relations hinder the study course of system, and influence the three-dimensional coordinate that network accurately estimates target.

The present invention proposes a kind of accurate three-dimensional hand and estimation method of human posture based on single depth image, gives first Go out the overall architecture of network, then utilize and the position of target is improved based on the method for convolutional neural networks, then use Back projection's means construct the input of system, finally in volume basic block, volume residual block, volume down-sampling block and volume Four class building block of sampling block and encoder and decoder composition voxel predict network to voxel.The present invention solves perspective distortion And the problem of Nonlinear Mapping, the three-dimensional hand of pinpoint accuracy and human body attitude estimation can be obtained, and take it is less, can be with Accomplish to carry out human body behavior prediction and estimation in real time.

The content of the invention

The problem of for perspective distortion and Nonlinear Mapping, it is an object of the invention to provide one kind to be based on single depth The accurate three-dimensional hand and estimation method of human posture of image provide the overall architecture of network first, then using based on convolution god Method through network is improved the position of target, and the input of system is then constructed using Back-projection technique, finally uses body Product basic block, volume residual block, volume down-sampling block and volume up-sampling four class building block of block and encoder and decoder It forms voxel and network is predicted to voxel.

For the certainly solution above problem, the present invention provides a kind of accurate three-dimensional hand and human body attitude based on single depth image and estimates Meter method, main contents include：

(1) network model；

(2) improved target location；

(3) input of system；

(4) voxel predicts network to voxel.

Wherein, the network model, the task of model is the articulate three-dimensional coordinate of estimation institute, is broadly divided into following three A step：First, by point back projection to three dimensions and the continuous space of discretization, turning so as to fulfill by two-dimensional depth figure Turn to three-D volumes expression；Second, using the data of three-dimensional voxel as input of the voxel to voxel prediction network, for estimating Count the likelihood value of each voxel in each joint；3rd, find out the position corresponding to the maximum likelihood value in each joint And the true coordinate representated by it, and using this as the final result of model.

Wherein, the improved target location, precondition be need one comprising the hand in three dimensions or The three-dimensional frame of human body.

Further, the three-dimensional frame, position is generally near reference point；And reference point can select to demarcate Common point or can by the region of hand limit a simple depth threshold after choose its barycenter.

Further, the common point demarcated and barycenter, with following limitation：

First, for the common point demarcated, it is not easy to obtain in practical applications；

Second, for barycenter, in complex environment, since barycenter is there are error, so as to cause it cannot be guaranteed that target is accurate Really inside obtained three-dimensional frame.

Further, the limitation, can be by one simple two-dimensional convolution neutral net of training, for estimating One accurate reference point.

Further, the two-dimensional convolution neutral net, by limiting a simple depth threshold in the region of hand, It is as a reference point to calculate its barycenter；A depth image is inputted, and exports the public position for calculating the reference point of gained and having demarcated 3-D migration amount between the central point put；Then in the reference point obtained by calculating before, in addition this offset, is improved Reference point.

Wherein, the input of the system, first, each pixel back projection of two-dimensional depth figure to three dimensions； Then, three dimensions is discretized into as pre-defined voxel size；Then, three-dimensional frame is drawn around reference point, extracts mesh Mark；Finally, it is 1 to set the voxel value consistent with depth point position, and the voxel value of other positions is 0.

Wherein, the voxel predicts network to voxel, mainly including following three parts：

First, using four class building blocks, i.e., adopted in volume basic block, volume residual block, volume down-sampling block and volume Sample block；

Second, network is built, then network passes through three continuous bodies by volume basic block and volume down-sampling BOB(beginning of block) Product residual block extracts useful local feature, subsequently enters encoder and decoder；

3rd, three-dimensional hotspot graph is constructed to supervise the pre- voxel likelihood function in each joint, wherein, the average quilt of Gaussian peak The common point demarcated is fixed on, i.e.,：

Meanwhile

Cost function is represented using the mean square error function shown in above formula.

Further, the encoder and decoder, for encoder, volume down-sampling block reduces the space of characteristic pattern Size, volume residual block increase the quantity of channel；For decoder, volume up-sampling block increases the bulk of characteristic pattern, when During up-sampling, network reduces the quantity of channel, so as to compress the feature of extraction.

Description of the drawings

Fig. 1 is that the present invention is a kind of based on the accurate three-dimensional hand of single depth image and the voxel pair of estimation method of human posture Voxel predicts the integrated stand composition of network.

Fig. 2 is a kind of three-dimensional appearance of accurate three-dimensional hand and estimation method of human posture based on single depth image of the present invention The constitutional diagram of the different input and output of state estimation network.

Fig. 3 is that the present invention is a kind of based on the accurate three-dimensional hand of single depth image and the reference point of estimation method of human posture Improve network.

Fig. 4 is that the present invention is a kind of based on the accurate three-dimensional hand of single depth image and the voxel pair of estimation method of human posture Voxel predicts the coder structure figure of network.

Fig. 5 is that the present invention is a kind of based on the accurate three-dimensional hand of single depth image and the voxel pair of estimation method of human posture Voxel predicts the decoder architecture figure of network.

Specific embodiment

It should be noted that in the case where there is no conflict, the feature in embodiment and embodiment in the application can phase It mutually combines, the present invention is described in further detail in the following with reference to the drawings and specific embodiments.

Fig. 1 is that the present invention is a kind of based on the accurate three-dimensional hand of single depth image and the voxel pair of estimation method of human posture Voxel predicts the integrated stand composition of network.First, by point back projection to three dimensions and the continuous space of discretization, thus It realizes and two-dimensional depth figure is converted into three-D volumes expression；Then, it is the data of three-dimensional voxel are pre- to voxel as voxel The input of survey grid network, for estimating the likelihood value of each voxel in each joint；Finally, the maximum in each joint is found out Position corresponding to likelihood value and the true coordinate representated by it, and using this as the final result of model.

Fig. 2 is a kind of three-dimensional appearance of accurate three-dimensional hand and estimation method of human posture based on single depth image of the present invention The constitutional diagram of the different input and output of state estimation network.In order to solve the problems, such as perspective distortion and non-linear projection, the present invention A kind of voxel is provided, Attitude estimation is used for voxel prediction network.Unlike pervious method, voxel is to the pre- survey grid of voxel Network estimates the likelihood value of each voxel in each joint using voxelization grid as inputting.

By two-dimensional depth image being converted into the form of three-dimensional voxel, as the input of network, network can be without mistake The actual look of true ground display target object.Meanwhile the likelihood value of each voxel by estimating each joint, it can allow network The more easily task of Expectation of Learning.

Fig. 3 is that the present invention is a kind of based on the accurate three-dimensional hand of single depth image and the reference point of estimation method of human posture Improve network.For positioning joint, precondition is to need to include the hand or the three-dimensional frame of human body in three dimensions. The position of three-dimensional frame is generally near reference point；And reference point can select the common point demarcated or can pass through Its barycenter is chosen after limiting a simple depth threshold in the region of hand.But the common point demarcated is with following Limitation：

It therefore, can be by one simple two-dimensional convolution neutral net of training, for estimating in order to overcome more than limitation Count an accurate reference point.Specifically, by limiting a simple depth threshold in the region of hand, its barycenter work is calculated For reference point；Input a depth image, and export the central point of common point that calculates the reference point of gained and demarcated it Between 3-D migration amount；Then in the reference point obtained by calculating before, in addition this offset, obtains improved reference point.

Fig. 4 is that the present invention is a kind of based on the accurate three-dimensional hand of single depth image and the voxel pair of estimation method of human posture Voxel predicts the coder structure figure of network.Voxel mainly includes following three parts to voxel prediction network：

Meanwhile

For encoder, volume down-sampling block reduces the bulk of characteristic pattern, and volume residual block increases the quantity of channel.

Fig. 5 is that the present invention is a kind of based on the accurate three-dimensional hand of single depth image and the voxel pair of estimation method of human posture Voxel predicts the decoder architecture figure of network.For decoder, the bulk of volume up-sampling block increase characteristic pattern, when above adopting During sample, network reduces the quantity of channel, so as to compress the feature of extraction.

For those skilled in the art, the present invention is not limited to the details of above-described embodiment, in the essence without departing substantially from the present invention In the case of refreshing and scope, the present invention can be realized in other specific forms.In addition, those skilled in the art can be to this hair Bright to carry out various modification and variations without departing from the spirit and scope of the present invention, these improvements and modifications also should be regarded as the present invention's Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and change.

Claims

1. a kind of accurate three-dimensional hand and estimation method of human posture based on single depth image, which is characterized in that mainly include Network model (one)；Improved target location (two)；The input (three) of system；Voxel is to voxel prediction network (four).

2. based on the network model (one) described in claims 1, which is characterized in that the task of model is that estimation institute is articulate Three-dimensional coordinate is broadly divided into following three steps：

First, by the way that point back projection to three dimensions and the continuous space of discretization, is converted so as to fulfill by two-dimensional depth figure For three-D volumes expression；

Second, using the data of three-dimensional voxel as input of the voxel to voxel prediction network, for estimating each joint The likelihood value of each voxel；

3rd, find out the position corresponding to the maximum likelihood value in each joint and the true coordinate representated by it, and by this Final result as model.

3. the improved target location (two) described in based on claims 1, which is characterized in that its precondition is to need one Three-dimensional frame comprising the hand in three dimensions or human body.

4. the three-dimensional frame described in based on claims 3, which is characterized in that its position is generally near reference point；And it refers to Point can select the common point demarcated or can be by being selected after limiting a simple depth threshold in the region of hand Take its barycenter.

5. based on the common point demarcated and barycenter described in claims 4, which is characterized in that it is with following limitation Property：

Second, for barycenter, in complex environment, since barycenter is there are error, so as to cause it cannot be guaranteed that target exactly Inside obtained three-dimensional frame.

6. based on the limitation described in claims 5, which is characterized in that in order to overcome limitation, training one can be passed through Simple two-dimensional convolution neutral net, for estimating an accurate reference point.

7. based on the two-dimensional convolution neutral net described in claims 6, which is characterized in that by limiting one in the region of hand It is as a reference point to calculate its barycenter for simple depth threshold；Input a depth image, and export calculate gained reference point with 3-D migration amount between the central point for the common point demarcated；Then in the reference point obtained by calculating before, in addition this Offset obtains improved reference point.

8. the input (three) based on the system described in claims 1, which is characterized in that first, each of two-dimensional depth figure A pixel back projection is to three dimensions；Then, three dimensions is discretized into as pre-defined voxel size；Then, joining Three-dimensional frame is drawn around examination point, extracts target；Finally, it is 1 to set the voxel value consistent with depth point position, the body of other positions Element value is 0.

9. based on the voxel described in claims 1 to voxel prediction network (four), which is characterized in that mainly including following three Part：

First, use four class building blocks, i.e. volume basic block, volume residual block, volume down-sampling block and volume up-sampling block；

Second, network is built, network is then residual by three continuous volumes by volume basic block and volume down-sampling BOB(beginning of block) Remaining block extracts useful local feature, subsequently enters encoder and decoder；

3rd, three-dimensional hotspot graph is constructed to supervise the pre- voxel likelihood function in each joint, wherein, the average of Gaussian peak is fixed In the common point demarcated, i.e.,：

<mrow> <msubsup> <mi>H</mi> <mi>n</mi> <mo>*</mo> </msubsup> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <mfrac> <mrow> <msup> <mrow> <mo>(</mo> <mi>i</mi> <mo>-</mo> <msub> <mi>i</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>+</mo> <msup> <mrow> <mo>(</mo> <mi>j</mi> <mo>-</mo> <msub> <mi>j</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>+</mo> <msup> <mrow> <mo>(</mo> <mi>k</mi> <mo>-</mo> <msub> <mi>k</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> <mrow> <mn>2</mn> <msup> <mi>&sigma;</mi> <mn>2</mn> </msup> </mrow> </mfrac> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

Meanwhile

<mrow> <mi>L</mi> <mo>=</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </msubsup> <msub> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>,</mo> <mi>k</mi> </mrow> </msub> <mo>|</mo> <mo>|</mo> <msubsup> <mi>H</mi> <mi>n</mi> <mo>*</mo> </msubsup> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>H</mi> <mi>n</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>|</mo> <msup> <mo>|</mo> <mn>2</mn> </msup> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>

10. based on the encoder and decoder described in claims 9, which is characterized in that for encoder, volume down-sampling block The bulk of characteristic pattern is reduced, volume residual block increases the quantity of channel；For decoder, volume up-sampling block increase feature The bulk of figure, when up-sampling, network reduces the quantity of channel, so as to compress the feature of extraction.