CN108830150B

CN108830150B - One kind being based on 3 D human body Attitude estimation method and device

Info

Publication number: CN108830150B
Application number: CN201810426144.1A
Authority: CN
Inventors: 吕蕾; 张凯; 张桂娟; 刘弘
Original assignee: Shandong Normal University
Current assignee: Beijing Micro-Chain Daoi Technology Co.,Ltd.
Priority date: 2018-05-07
Filing date: 2018-05-07
Publication date: 2019-05-28
Anticipated expiration: 2038-05-07
Also published as: CN108830150A

Abstract

The invention discloses one kind to be based on 3 D human body Attitude estimation method and device, wherein this method includes S1: with the depth image and RGB color image of monocular camera acquisition human body different angle；S2: constructing skeleton critical point detection neural network based on RGB color image, obtains key point mark image；S3: construction hand joint node 2D-3D mapping network；S4: the depth image and key point of calibration human body equal angular mark image, and then carry out three-dimensional point cloud coloring conversion to respective depth image, obtain coloring depth image；S5: based on key point mark image and coloring depth image, the skeleton key point marked corresponding position in depth image is predicted using default learning network；S6: merging the output of step S3 and step S5, realizes that the fining to 3 D human body Attitude estimation is estimated.

Description

One kind being based on 3 D human body Attitude estimation method and device

Technical field

The invention belongs to computer vision, image procossing, computer graphics and deep learning application fields, more particularly to One kind being based on 3 D human body Attitude estimation method and device.

Background technique

So-called human body attitude estimation, which refers to, matches abstraction hierarchy feature with manikin, to obtain different moments Posture locating for target.Human body attitude estimation is the key problem of human body motion capture.The posture expression of human body includes two sides Face, first is that position and direction of the entire human body in world coordinates；Second is that the angle in body parts joint and being influenced by joint angle Skin deformation.The main application fields of human motion Attitude estimation can be divided into three general orientation: monitoring, control and analysis:

(1) in monitoring application aspect, some traditional applications be included in detect and position automatically in airport or subway pedestrian, Demographics or crowd's flowing, congestion analysis etc..With the raising of awareness of safety, occur some novel answer in recent years With --- the analysis of personal or crowd behavior and movement.Such as in queuing and shopping, irregular behavior or progress are detected Identification etc..

(2) in control application aspect, people control target using motion estimation result or attitude parameter.This Application in terms of human-computer interaction is at most.In entertainment industry such as film and game animation etc., using also increasingly wider.People can benefit With the shape, appearance and movement of the people captured, to make 3D film or rebuild the threedimensional model of the people in game.

(3) in analysis application aspect, including automatic diagnosis, the analysis and improvement to player motion to surgical patient Deng.In terms of visual media, there is the application such as content based video retrieval system, video compress.In addition, in terms of automobile industry also Relevant application, such as automatic control, sleep detection and the pedestrian detection of air bag etc. are arrived.

The human movement capture system of comparative maturity has based on motor machine currently on the market, electromagnetism and special optical The types such as mark.Magnetic or optical label is attached on the limbs of people, their three-dimensional track is used to description target fortune Dynamic, these systems are automatic, but its disadvantages are that: equipment is very heavy, and expensive, is unable to get extensive Using.

Therefore, research hotspot is had become based on computer vision human body motion capture technology.It utilizes computer vision Basic principle, 3 d human motion sequence is directly extracted from video.This method, which does not need to add on human synovial, appoints What sensor, ensure that human motion is unrestricted, and low cost, high-efficient.Method mostly uses be based on greatly currently popular The matching technique of manikin.The target of this method is that one group of attitude parameter is found in state space, so that corresponding to this The human body attitude of parameter meets the most with the low-level image feature extracted from observed image.

In this field of motion tracking based on computer vision, the research method generally used is:

In the beginning of tracking, determine in image sequence the position of human body of first frame, in subsequent sequence the determination of human body target according to Rely the continuity and kinematical constraint condition in human motion.Wherein it is determined that there are two types of methods for first frame position of human body:

First is that the first posture of artificial regulation target or headed by manikin is set frame approximate posture, this is unfavorable for The automation of human body tracking.

Second is that determining each position of body, this method using location detection method after removing the background other than human body The strict guarantee that can partially realize automation, but people's scape is needed to divide.

In subsequent human body tracking and 3 d pose estimation, there is the method based on model and model-free.Wherein:

(1) conventional method based on model is the 3D model for establishing human body in advance, by the first frame of model and motion sequence Match, it is further using optimization methods such as gradient decline or stochastical samplings using conditions such as kinematic parameter limitations in subsequent tracking The model parameter of each frame is estimated, to obtain model sport sequence.The shortcomings that this method is: the tracking of subsequent frame exists tired Product error, tracks be easy error for a long time.

(2) model-free methods do not need to establish manikin, but according to the geometry of human motion presentation, texture, color Etc. information, estimate using study or based on the method for sample human motion posture.The shortcomings that this method, is: human motion Posture is difficult to depend on priori knowledge, and can only track specific behavior aggregate with limited state description.

Monocular-camera all can be used based on both of model and model-free trackings or multi-lens camera is realized.By There is the ambiguousness from three-dimensional to two-dimensional map in the reconstruction in the normal image for not having depth information, and for compound movement Attitude estimation is extremely difficult, therefore in the research of more than ten years in past, and being all based on for most of human body motion tracking technologies is more It is realized under the conditions of lens camera, depth information is obtained with this.But be using the condition of multi-lens camera: need calibrate and Inconvenience is arranged in average family, is unfavorable for the application popularization of movement capturing technology into huge numbers of families.

In conclusion for the limitation of multi-lens camera use condition in the prior art and in order to quickly and easily identify Depth image out needs a kind of effective solution scheme.

Summary of the invention

In order to solve the deficiencies in the prior art, the first object of the present invention is to provide a kind of based on 3 D human body Attitude estimation Method can accurately identify the 3 D human body posture in depth image.

A kind of technical solution based on 3 D human body Attitude estimation method of the invention are as follows:

One kind being based on 3 D human body Attitude estimation method, comprising:

S1: with the depth image and RGB color image of monocular camera acquisition human body different angle；

S2: skeleton critical point detection neural network is constructed based on RGB color image, obtains key point mark figure Picture；

S3: image is marked based on corresponding RGB color image and key point, construction hand joint node 2D-3D maps net Network；

S4: the depth image and key point of calibration human body equal angular mark image, and then carry out to respective depth image Three-dimensional point cloud coloring conversion, obtains coloring depth image；

S5: based on key point mark image and coloring depth image, the human body of mark is predicted using default learning network Bone key point corresponding position in depth image；

S6: merging the output of step S3 and step S5, realizes that the fining to 3 D human body Attitude estimation is estimated.

In the step 1, monocular camera can be realized using Kinect camera.

Kinect is more more intelligent than general camera.Firstly, it can emit infrared ray, to carry out to entire room Stereoscopic localized.Camera can then identify the movement of human body by infrared ray.In addition to this, one be combined on Xbox 360 A little high end softwares can carry out real-time tracing to 48 positions of human body.

It should be noted that monocular camera is other than Kinect camera, can also using other existing monocular cameras come It realizes.

Further, skeleton critical point detection neural network is constructed based on RGB color image in the step S2, It specifically includes:

The skeleton key point in RGB color image is marked, data set is constructed；

The data set of building is divided into training set and test set, and training set is input to default skeleton key point It is trained in detection neural network；

Skeleton critical point detection neural network after testing training using test set, until reaching preset requirement.

In the step S2, trained human body is formed by marking skeleton key point to the RGB color image of acquisition The data set of bone critical point detection neural network can rapidly and accurately obtain the skeleton key point of preset requirement in this way Detect neural network.Wherein, preset requirement is the skeleton key point of skeleton critical point detection neural network output Precision is being preset in accuracy rating.

Wherein, skeleton critical point detection neural network can (T be more than or equal to 1 by being connected to T after VGG-19 network Positive integer) a stage, each stage has the structure of 2 full convolutional networks to constitute.

Wherein, VGG (Visual Geometry Group) belongs to Scientific Engineering system, Oxford University, has issued some column The convolutional network model started with VGG.

It should be noted that skeleton critical point detection neural network may be other existing neural network moulds Type.

Further, in the step S3, the hand joint node 2D-3D mapping network of construction exports hand Segmentation figure Picture, the structure of hand joint node 2D-3D mapping network are as follows: in (convolutional layer+ReLu active coating)+maximum pond layer+bilinearity Sampling.

The loss function of above-mentioned hand joint node 2D-3D mapping network uses softmax and cross entropy loss function.

In the present invention, by 2D hand test problems be converted into segmentation problem eliminate different manpowers size dimension difference it is right Network accuracy influences.

It should be noted that hand joint node 2D-3D mapping network is in addition to the foregoing structure, other can also be used Existing neural network structure is realized.

Further, it in the step S4, obtains specifically including the step of colouring depth image:

The depth image and key point that human body equal angular is demarcated using chessboard method mark image；

Match the key point mark image and depth image of human body equal angular；

It adjusts the depth image size after matching and carries out volume rending point cloud.

The present invention demarcates the depth image of human body equal angular using chessboard method and key point marks image, can be accurate Obtain the coordinate information of key point in image.

Further, in the step S5, presetting learning network is U-shaped intensified learning network.

Wherein, U-shaped intensified learning network is the mapping learnt from ambient condition to behavior, so that the behavior of intelligent body selection The maximum award of environment can be obtained, so that evaluation (or whole system of the external environment to learning system under certain meaning Runnability) it is best.

The structure of U-shaped intensified learning network are as follows: the convolution operation of preset times and the pond of preset times are carried out to input It operates (max pool down-sampling), each convolution is followed by one layer of ReLU active coating, repeated several times, the convolution filter after down-sampling Device quantity increases corresponding multiple；

The preset step-length of convolution operation and preset times to the result progress preset times obtained after down-sampling goes to roll up Product operation (up-sampling), each convolution are followed by a ReLU active coating, repeated several times, number of filters reduction phase when up-sampling Answer multiple；Obtained result and corresponding left part convolution results carries out convolution after being attached again；

Finally export accordingly result.

It should be noted that default learning network may be Q type intensified learning network.

Second purpose of invention is to provide one kind based on 3 D human body attitude estimating device, can accurately identify depth Spend the 3 D human body posture in image.

A kind of technical solution based on 3 D human body attitude estimating device of the invention are as follows:

One kind being based on 3 D human body attitude estimating device, comprising:

Image acquisition units, with the depth image and RGB color image of monocular camera acquisition human body different angle；

Key point marks unit, is used to construct skeleton critical point detection neural network based on RGB color image, Obtain key point mark image；

Hard recognition unit is used to construct hand joint based on corresponding RGB color image and key point mark image Node 2D-3D mapping network；

Depth image coloring units, the depth image and key point for being used to demarcate human body equal angular mark image, into And three-dimensional point cloud coloring conversion is carried out to respective depth image, obtain coloring depth image；

Depth image key point predicting unit is used for based on key point mark image and coloring depth image, using pre- If learning network come predict mark skeleton key point in depth image corresponding position；

3 D human body Attitude estimation unit is used to merge hard recognition unit and depth image key point predicting unit Output realizes that the fining to 3 D human body Attitude estimation is estimated.

Wherein, monocular camera can be realized using Kinect camera.

Further, the key point marks unit, comprising:

Data set constructs subelement, is used to mark the skeleton key point in RGB color image, constructs data Collection；

Neural metwork training subelement is used to for the data set of building to be divided into training set and test set, and will train Collection, which is input in default skeleton critical point detection neural network, to be trained；

Neural network detection sub-unit, the skeleton critical point detection mind after being used to test training using test set Through network, until reaching preset requirement.

In key point mark unit, formed by marking skeleton key point to the RGB color image of acquisition The data set of training skeleton critical point detection neural network, can rapidly and accurately obtain the human body bone of preset requirement in this way Bone critical point detection neural network.Wherein, preset requirement is the skeleton of skeleton critical point detection neural network output The precision of key point is being preset in accuracy rating.

Further, in the hard recognition unit, the hand joint node 2D-3D mapping network of construction exports hand Segmented image, the structure of hand joint node 2D-3D mapping network are as follows: (convolutional layer+ReLu active coating)+maximum pond layer+bis- Linear up-sampling.

Further, the depth image coloring units, comprising:

Subelement is demarcated, is used to demarcate the depth image and key point mark figure of human body equal angular using chessboard method Picture；

Coupling subelement is used to match the key point mark image and depth image of human body equal angular；

Volume rending point cloud subelement, the depth image size for being used to adjust after matching simultaneously carry out volume rending point cloud.

Further, in the depth image key point predicting unit, presetting learning network is U-shaped intensified learning net Network.

Finally export accordingly result.

Compared with prior art, the beneficial effects of the present invention are:

(1) present invention is solved with the depth image and RGB color image of monocular camera acquisition human body different angle It is limited in human body attitude estimation field using the condition of more mesh cameras, this method is easier to realize, and can accurately identify 3 D human body posture in depth image out.

(2) present invention can be by identifying 3 D human body posture to reaching after neural metwork training in real time.

(3) trained neural network model can be stored in Miniature Terminal equipment by the present invention, be conveniently integrated into intelligence In energy household, intelligent interactive equipment.

Detailed description of the invention

The accompanying drawings constituting a part of this application is used to provide further understanding of the present application, and the application's shows Meaning property embodiment and its explanation are not constituted an undue limitation on the present application for explaining the application.

Fig. 1 is of the invention based on 3 D human body Attitude estimation method flow diagram；

Fig. 2 is one embodiment schematic diagram of the invention based on 3 D human body Attitude estimation method；

Fig. 3 is one embodiment schematic diagram of skeleton critical point detection neural network of the invention；

Fig. 4 is neural network one embodiment schematic diagram of hand joint node 2D-3D mapping of the invention；

Fig. 5 is a kind of U-shaped intensified learning neural network one embodiment schematic diagram of the invention；

Fig. 6 is of the invention based on 3 D human body attitude estimating device structural schematic diagram.

Specific embodiment

It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the application.Unless another It indicates, all technical and scientific terms used herein has usual with the application person of an ordinary skill in the technical field The identical meanings of understanding.

It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singular Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.

As shown in Figure 1, of the invention based on 3 D human body Attitude estimation method, including step S1~step S6.

Specifically, below with reference to of the invention to illustrate based on one embodiment of 3 D human body Attitude estimation method Technical solution, as shown in Figure 2:

It is of the invention based on 3 D human body Attitude estimation method, comprising:

S1: with the depth image and RGB color image of monocular camera acquisition human body different angle.

In the step 1, monocular camera can be realized using Kinect camera.

S2: skeleton critical point detection neural network is constructed based on RGB color image, obtains key point mark figure Picture.

Wherein, skeleton critical point detection neural network is constructed based on RGB color image in the step S2, specifically Include:

Step S21: the skeleton key point in mark RGB color image constructs data set；

Specifically, the step of constructing data set are as follows:

Step S211: 12 kinect depth cameras are used, three positions different in a room, each position are placed on 4 kinect depth cameras are placed, four different visual angles is formed in each position, several male and female is shot not respectively With the image of human posture, by collected photo finishing at a picture library.

Step S212: gesture data collection is established using more depth cameras；This data set is acquisition 20 people, 39 differences Gesture motion image, data set is divided into a training set and a test set, then to the intensity of illumination in image, Background image carries out random rendering and expands data diversity.

Step S213: carrying out bone key point mark for the obtained picture library of step S211 and step S212, will be crucial Label of point coordinate information (x, y, d) as image, writes script using shell, is lmdb by image and image tag unloading Or hdf5 formatted file.Wherein: x, y are transverse and longitudinal coordinate of the key point in depth image, and d is depth coordinate.

Step S22: the data set of building is divided into training set and test set, and training set is input to default human body bone It is trained in bone critical point detection neural network；

Step S23: the skeleton critical point detection neural network after testing training using test set, until reaching pre- If it is required that.

Wherein, as shown in figure 3, skeleton critical point detection neural network can (T be by being connected to T after VGG-19 network More than or equal to 1 positive integer) a stage, each stage has the structure of 2 full convolutional networks to constitute.

Specifically, the treatment process of skeleton critical point detection neural network is as follows in this example:

S222: it first using the 2D-RGB image of the w*h obtained by kinect as input, is obtained via first 10 layers of VGG-19 Obtain characteristic pattern F, the input as each branch of model first stage.

S223: each branch in the model first stage, stage generates a series of detection confidence map S respectively¹=ρ¹(F) With one group of local relation domain L¹=φ¹(F)；Wherein ρ¹(F) and Φ¹It (F) is one Zhong Liang branch convolutional neural networks of stage respectively Inference.

S224: the specific design of full convolutional network branch 1 is as follows:

(a) because in the present invention 3 d pose identification can be carried out to more people simultaneously, first to each of RGB image Generate independent confidence map

(b) x is used_{J, k}∈R²Indicate the actual position of k-th of people, j-th of physical feeling in image.Wherein, j and k is big In 0 positive integer；

(c) keep detected physical feeling key point highlighted using Gaussian Profile:

(d) the maximum key point of Gauss value is taken in every width confidence map:

Wherein, p is pixel coordinate.

S225: full convolutional network branch 2 is used to detect the position and direction information of key point line, and specific design is as follows:

(a): construction supervises true local association domainWherein c is c-th liang of key point connection on k-th of human body Line segment.Construction process is as follows:

(b): enablingWithIn separated image on k-th of human body c-th liang of key point line key point.

(c): the local association vector of the body limb on c-th of line is found out using following formula:

Wherein if p equation (3) on limbs c is v, otherwise equation (3) is 0

(d): two key points on c line do linear difference, approximately find out pixel p and are located at k people in line c On pixel coordinate:

p_u=(1-u) x_j1-ux_j2, 0≤u≤1 (5)

(e): found out using formula (5) has the proprietary relation domain of overlapping relation on c line in image:

Wherein n_c(p) it is non-vanishing vector number in point p.

(f): the local relation domain of prediction being sampled, L is used_cThe confidence of k people's lap of measurement is gone along line segment c Degree:

S223: Liang Ge branch per stage is all made of 33 × 3 and 22 × 2 convolutional layers；

The output of first stage full convolutional network and primitive character figure F: being incorporated as the input of second stage by S224, with This iterates to stage T；

S225: two branch models constantly refine each branch target with T stage, in order to effectively gradient be avoided to disappear Losing each stage is added L₂Loss function, as supervisory role.A branch penalty function is defined as follows:

Wherein S^*It is the value of the true confidence map marked when building database,The confidence map values of prediction are represented, t, which is represented, to be divided Branch model stage, t ∈ [1,2 ..., T], m represent key point position coordinates in figure, and j represents j-th of key point, are two with W (p) System mark, if W (p)=0 when key point labeled data lacks, is otherwise 1, avoids punishing true position in network training Set prediction.

S226: the human body position confidence map that obtains after stage T to two branches and artis relationship are using greedy Center algorithm obtains the 2D key point image of people.Formula (10) is the model formation of entire critical point detection network:

S3: image is marked based on corresponding RGB color image and key point, construction hand joint node 2D-3D maps net Network.

Wherein, in the step S3, the hand joint node 2D-3D mapping network of construction exports hand Segmentation image, The structure of hand joint node 2D-3D mapping network are as follows: adopted in (convolutional layer+ReLu active coating)+maximum pond layer+bilinearity Sample.

Wherein, the detailed process of hand joint node 2D-3D mapping network is constructed, as shown in Figure 4:

S31: being 256*256*3 as the input of hand images segmentation network, network using original 2DRGB Image Adjusting size Using (convolutional layer+ReLu active coating)+maximum pond layer+bilinearity up-sampling structure, loss function uses softmax and friendship Entropy loss function is pitched, the hand Segmentation image of 256*256*3 is exported.

S32: using one and the neural network of S31 same structure, using the output of S31 as input, the neural network pair 21 joints generation bounding boxes of hand, and it is 0 that mean value, which is added, at bounding box center, the Gaussian noise that variance is 10, network will divide It Sheng Cheng not 21 32 × 32 × 1 artis thermal maps.

S33: ask 21 2D artis thermal maps to the estimated value of 3D, the specific method is as follows；

S34: a three-dimensional hand joint point coordinate set w is defined first_i=(x_i, y_i, z_i), i ∈ [1, J], J=21.

S35: 3 dimensional database of hand, the one full convolutional neural networks of training obtained using S12 use L₂Loss function. Network uses the structure of (convolutional layer+ReLu active coating)+full articulamentum.

S36: the priori knowledge obtained using the full convolutional neural networks of S35 training builds each key point of 2D hand images Vertical regularization coordinate set, formula are as follows:

S=| | w_k+1-w_k|| (12)

Wherein [1,20] k ∈.

S37: establishing relative coordinate system, and artis position is opposite caused by eliminating because of reasons such as hand size differences loses Very.Using the first joint of index finger as root node, i.e. s=1 in this example, it is opposite furthermore remaining node will to be found out using formula (13) In the relative position of the first artis of index finger.

R is index finger first node.

Wherein, it in the step S4, obtains specifically including the step of colouring depth image:

Match the key point mark image and depth image of human body equal angular；

The present invention demarcates the depth image of human body equal angular using chessboard method and key point marks image, can be accurate Obtain the coordinate information of key point in image

S41: it is demarcated using RGB camera of the chessboard method to kinect, utilizes Matlab Camera Calibration Toolbox calculates RGB internal reference.

S42: it is demarcated using depth camera of the chessboard method to kinect, utilizes Matlab Camera Calibration Toolbox calculates RGB internal reference.

S43: 2D-RGB camera and 3D depth camera are registrated, the specific steps are as follows:

S44: depth image space coordinates are established using formula (14):

P_ir=H_irP_ir (14)

Wherein P_irFor the space coordinate that certain is put under depth camera coordinate, p_irFor the point in the plane projection coordinate (x, Y unit is pixel, and z is depth value, and unit is millimeter), H_irFor the internal reference matrix of depth camera.

S45: being that RGB camera establishes space coordinate using formula (15), (16):

P_rgb=RP_ir+T (15)

p_rgb=H_rgbP_rgb (16)

Wherein P_rgbFor the space coordinate of the same point under RGB camera coordinate, p_rgbFor this in RGB as the throwing in plane Shadow coordinate, H_rgbFor the internal reference matrix of RGB camera, R is spin matrix, and T is translation vector.

S46: using matrix is joined outside camera, by the point transformation in global coordinate system to camera matrix, transformation for mula is such as Formula (17):

Wherein spin matrix R_ir(R_rgb) and translation vector T_ir(T_rgb) be depth camera (RGB camera) outer ginseng square Battle array

S47: the volume rending point cloud matrix for being 64 × 64 × 64 by the Image Adjusting size after registration.

Wherein, in the step S5, presetting learning network is U-shaped intensified learning network.

Finally export accordingly result.

Specifically, U-shaped intensified learning network structure, as shown in Figure 5:

S52: 23 × 3 × 3 convolution operations are carried out to the input of S2, S4 and 12 × 2 × 2 pondization operates (max pool Down-sampling), each convolution is followed by one layer of ReLU active coating, is repeated 4 times, and the Convolution Filter quantity after down-sampling increases by 2 times.

S53: the behaviour of deconvoluting of 23 × 3 convolution operations and 1 hyposynchronization a length of 2 × 2 is carried out to the result obtained after down-sampling Make (up-sampling), each convolution is followed by a ReLU active coating, is repeated 4 times, and 2 times of reduction of number of filters when up-sampling, obtains Result and corresponding left part convolution results be attached after carry out convolution again, at this moment Convolution Filter quantity reduces 2 times.

S54: key point confidence map in output point cloud.

It is of the invention based on 3 D human body Attitude estimation method, with the depth map of monocular camera acquisition human body different angle Picture and RGB color image, are solved and are limited in human body attitude estimation field using the condition of more mesh cameras, and this method is easier It realizes, and can accurately identify the 3 D human body posture in depth image.

As shown in fig. 6, a kind of technical solution based on 3 D human body attitude estimating device of the invention are as follows:

One kind being based on 3 D human body attitude estimating device, comprising:

(1) image acquisition units, with the depth image and RGB color figure of monocular camera acquisition human body different angle Picture；

Wherein, monocular camera can be realized using Kinect camera.

(2) key point marks unit, is used to construct skeleton critical point detection nerve net based on RGB color image Network obtains key point mark image；

Wherein, the key point marks unit, comprising:

(3) hard recognition unit is used to construct hand based on corresponding RGB color image and key point mark image Articulation nodes 2D-3D mapping network；

In the hard recognition unit, the hand joint node 2D-3D mapping network of construction exports hand Segmentation image, The structure of hand joint node 2D-3D mapping network are as follows: adopted in (convolutional layer+ReLu active coating)+maximum pond layer+bilinearity Sample.

(4) depth image coloring units, the depth image and key point for being used to demarcate human body equal angular mark image, And then three-dimensional point cloud coloring conversion is carried out to respective depth image, obtain coloring depth image；

Wherein, the depth image coloring units, comprising:

(5) depth image key point predicting unit is used to utilize based on key point mark image and coloring depth image Default learning network come predict mark skeleton key point in depth image corresponding position；

Wherein, in the depth image key point predicting unit, presetting learning network is U-shaped intensified learning network.

Finally export accordingly result.

(6) 3 D human body Attitude estimation unit is used to merge hard recognition unit and depth image key point prediction list The output of member realizes that the fining to 3 D human body Attitude estimation is estimated.

It is of the invention based on 3 D human body attitude estimating device, with the depth map of monocular camera acquisition human body different angle Picture and RGB color image, are solved and are limited in human body attitude estimation field using the condition of more mesh cameras, and this method is easier It realizes, and can accurately identify the 3 D human body posture in depth image.

It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system, device or computer Program product.Therefore, hardware embodiment, software implementation or embodiment combining software and hardware aspects can be used in the present invention Form.It can be used moreover, the present invention can be used in the computer that one or more wherein includes computer usable program code The form for the computer program product implemented on storage medium (including but not limited to magnetic disk storage and optical memory etc.).

The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random AccessMemory, RAM) etc..

Above-mentioned, although the foregoing specific embodiments of the present invention is described with reference to the accompanying drawings, not protects model to the present invention The limitation enclosed, those skilled in the art should understand that, based on the technical solutions of the present invention, those skilled in the art are not Need to make the creative labor the various modifications or changes that can be made still within protection scope of the present invention.

Claims

1. one kind is based on 3 D human body Attitude estimation method characterized by comprising

S2: constructing skeleton critical point detection neural network based on RGB color image, obtains key point mark image；

S3: image is marked based on corresponding RGB color image and key point, constructs hand joint node 2D-3D mapping network；

S5: based on key point mark image and coloring depth image, the skeleton of mark is predicted using default learning network Key point corresponding position in depth image；

2. as described in claim 1 a kind of based on 3 D human body Attitude estimation method, which is characterized in that base in the step S2 Skeleton critical point detection neural network is constructed in RGB color image, is specifically included:

The skeleton key point in RGB color image is marked, data set is constructed；

The data set of building is divided into training set and test set, and training set is input to default skeleton critical point detection It is trained in neural network；

3. as described in claim 1 a kind of based on 3 D human body Attitude estimation method, which is characterized in that in the step S3 In, the hand joint node 2D-3D mapping network of construction exports hand Segmentation image, hand joint node 2D-3D mapping network Structure are as follows: (convolutional layer+ReLu active coating)+maximum pond layer+bilinearity up-samples.

4. as described in claim 1 a kind of based on 3 D human body Attitude estimation method, which is characterized in that in the step S4 In, it obtains specifically including the step of colouring depth image:

Match the key point mark image and depth image of human body equal angular；

5. as described in claim 1 a kind of based on 3 D human body Attitude estimation method, which is characterized in that in the step S5 In, presetting learning network is U-shaped intensified learning network.

6. one kind is based on 3 D human body attitude estimating device characterized by comprising

Key point marks unit, is used to construct skeleton critical point detection neural network based on RGB color image, obtain Key point marks image；

Hard recognition unit is used to construct hand joint node based on corresponding RGB color image and key point mark image 2D-3D mapping network；

Depth image coloring units, the depth image and key point for being used to demarcate human body equal angular mark image, and then right Respective depth image carries out three-dimensional point cloud coloring conversion, obtains coloring depth image；

Depth image key point predicting unit is used to utilize default based on key point mark image and coloring depth image Network is practised to predict the skeleton key point marked corresponding position in depth image；

3 D human body Attitude estimation unit is used to merge the defeated of hard recognition unit and depth image key point predicting unit Out, realize that the fining to 3 D human body Attitude estimation is estimated.

7. as claimed in claim 6 a kind of based on 3 D human body attitude estimating device, which is characterized in that the key point mark Unit, comprising:

Data set constructs subelement, is used to mark the skeleton key point in RGB color image, constructs data set；

Neural metwork training subelement is used to the data set of building being divided into training set and test set, and training set is defeated Enter into default skeleton critical point detection neural network and is trained；

Neural network detection sub-unit, the skeleton critical point detection nerve net after being used to test training using test set Network, until reaching preset requirement.

8. as claimed in claim 6 a kind of based on 3 D human body attitude estimating device, which is characterized in that in the hard recognition In unit, the hand joint node 2D-3D mapping network of construction exports hand Segmentation image, hand joint node 2D-3D mapping The structure of network are as follows: (convolutional layer+ReLu active coating)+maximum pond layer+bilinearity up-sampling.

9. as claimed in claim 6 a kind of based on 3 D human body attitude estimating device, which is characterized in that the depth image Color element, comprising:

Subelement is demarcated, the depth image and key point for being used to demarcate human body equal angular using chessboard method mark image；

10. as claimed in claim 6 a kind of based on 3 D human body attitude estimating device, which is characterized in that in the depth map It is U-shaped intensified learning network as in key point predicting unit, presetting learning network.