CN107066935A

CN107066935A - Hand gestures method of estimation and device based on deep learning

Info

Publication number: CN107066935A
Application number: CN201710061286.8A
Authority: CN
Inventors: 张波; 丛林; 赵辰; 李晓燕
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Hangzhou Yixian Advanced Technology Co ltd
Priority date: 2017-01-25
Filing date: 2017-01-25
Publication date: 2017-08-18
Anticipated expiration: 2037-01-25
Also published as: CN107066935B

Abstract

Embodiments of the present invention are related to Communications And Computer technical field, propose a kind of hand gestures method of estimation and device based on deep learning：Hand area-of-interest detection is carried out to depth image, and hand images are partitioned into from the hand area-of-interest；The three-dimensional point cloud image of hand is obtained according to the hand images；And hand gestures estimation is carried out to the three-dimensional point cloud image of the hand using depth learning technology.In this scenario, the hand gestures method of estimation based on depth learning technology, eliminates the operation for extracting hand-characteristic by hand, greatly improves the robustness of hand gestures estimation effect.

Description

Hand gestures method of estimation and device based on deep learning

Technical field

Embodiments of the present invention are related to Communications And Computer technical field, more specifically, embodiments of the present invention are related to And a kind of hand gestures method of estimation and device based on deep learning.

Background technology

This part is it is intended that the embodiments of the present invention stated in claims provide background or context.Herein Description recognizes it is prior art not because not being included in this part.

Man-machine interaction refers to specialize in the field of interactive relation between system and user, and it is played more in daily life Carry out more important effect, can largely lift Consumer's Experience.Traditional man-machine interaction mode, such as mouse and keyboard, Though disclosure satisfy that a certain degree of interaction, its convenience is greatly limited.During Gesture Recognition is man-machine interaction One important technology, belongs to the hot topic of current research, and it sends identification by either statically or dynamically being recognized to gesture Instruction, allows system to perform dependent instruction, so as to reach interactive purpose.

Disclosure discussion is three-dimensional hand gestures estimation in Gesture Recognition, big due to human-computer interaction technology Heat, the technology was also obtaining the attention of academia and industrial quarters in recent years.Three-dimensional hand gestures estimation technique is divided into two major classes： 1) it is based on discriminative model (Discrimitive Model)；2) it is based on production model (Generative Model).Differentiate Formula model is the method based on study, and it first carries out feature extraction operation to hand area image, then by building grader Hand gestures are identified.Wherein production model is after tracking and failing when hand, it is difficult to recover posture again, and speed compared with Slowly, practicality is relatively low.Production method is computationally intensive, and accuracy is often relatively low, discriminate speed, but the result of estimation There is certain error, and posture is limited.

Chinese patent CN201510670919.6 disclosed a kind of three-dimensional hand based on depth data on March 09th, 2016 Gesture Attitude estimation method and system, this method comprises the following steps：1. extracting depth data, the ROI region of hand is extracted；(1) The information of skeleton point is obtained using SDK, hand ROI region is detected by some skeleton point of hand；(2) if can not obtain The information of the skeleton point of hand, then detect hand ROI region by the way of based on the colour of skin；2. at the beginning of the three-dimensional global direction of hand Step estimation, first carry out feature extraction, then according to train come grader come realize hand the overall situation direction recurrence；3. it is three-dimensional The joint Attitude estimation of gesture：According to train come grader realize hand gestures estimate, finally carry out attitude updating.The party Method first completes the segmentation to hand ROI data using two ways mutual cooperation, then complete using regression algorithm on this basis Into the global direction estimation of hand, regression algorithm is finally reused by auxiliary of this data and realizes three-dimension gesture Attitude estimation.

Chinese patent CN201610321710.3 disclosed a kind of based on depth information and correction on October 26th, 2016 The hand gestures method of estimation of mode, comprises the following steps：1. the depth data of hand is obtained, and divide from hand depth data Cut hand region；2. palm posture is detected according to hand region；3. combine palm posture and hand standard skeleton model calculates hand The position of each artis of portion；4. calculate the projection properties of each artis of hand；5. according to the projection properties of each artis of hand Carry out finger gesture correction.The present invention is directly with based on depth data, by splitting hand region and calculating palm posture, Then finger gesture is estimated by depth image and the mode of attitude updating.

The content of the invention

But, there is following Railway Project in aforementioned patent CN201510670919.6 and CN201610321710.3：

1. during estimating posture, the method that the former uses random forest using the method returned, the latter, Be required for hand-designed feature, and carry out hand-characteristic extracting operation, and the process of characteristic Design is cumbersome, at the same it is designed go out The feature come can might not completely characterize the feature of hand, and it plays very big influence to final Attitude estimation effect.

2. hand ROI region is detected：It directly detects hand skeleton point using the related SDK of Kinect v2 sensors, Change after other sensors, can not necessarily detect these skeleton points, therefore, applicability is not too strong；Meanwhile, it utilizes the colour of skin The method of detection splits hand, and it is influenceed larger by environmental factors such as the external worlds, while the colour of skin to different ethnic groups will make spy Other places are managed, and applicability is not very strong.

Therefore, a kind of improved hand gestures method of estimation and device based on deep learning are highly desirable to, it is existing to solve Have present in technology and artificial to extract the poor defect of hand gestures estimation effect robustness caused by hand-characteristic.

In the present context, embodiments of the present invention are expected to provide a kind of hand gestures estimation side based on deep learning Method and device.

There is provided a kind of hand gestures estimation side based on deep learning in the first aspect of embodiment of the present invention Method, including：

Hand area-of-interest detection is carried out to depth image, and hand figure is partitioned into from the hand area-of-interest Picture；

The three-dimensional point cloud image of hand is obtained according to the hand images；And

Hand gestures estimation is carried out to the three-dimensional point cloud image of the hand using deep learning technology.

In certain embodiments, method described according to the abovementioned embodiments of the present invention, hand sense is carried out to depth image Interest region detection, including：

The depth image is obtained using the depth transducer in image capture device；And

Using the relative depth relationships of front and rear scape in the depth image, the hand is extracted from the depth image Area-of-interest.

In certain embodiments, the method according to any of the above-described embodiment of the present invention, interested from the hand Hand images are partitioned into region, including：

Rim detection and contour detecting are carried out to the hand area-of-interest, hand region is detected；

Denoising is carried out to the hand region, the hand images are partitioned into.

In certain embodiments, the method according to any of the above-described embodiment of the present invention, interested in the hand Region carries out contour detecting using the concavo-convex point detecting method of image.

In certain embodiments, the method according to any of the above-described embodiment of the present invention, according to the hand images The three-dimensional point cloud image for including hand is obtained, including：

The inner parameter of described image collecting device is demarcated；

The three-dimensional point cloud image for including hand is tentatively obtained according to the inner parameter；

The three-dimensional point cloud image of the hand tentatively obtained is subjected to size normalized, obtain processing completion includes hand The three-dimensional point cloud image in portion.

In certain embodiments, the method according to any of the above-described embodiment of the present invention, using deep learning technology The step of hand gestures estimation is carried out to the three-dimensional point cloud image of the hand, including：

Hand exercise data set is made, the hand node label information of the hand exercise data set is obtained；

The three-dimensional point cloud region of hand is extracted by the hand exercise data set；

According to the three-dimensional point cloud region of the hand and the hand node label Information Pull convolutional neural networks model Training forms hand gestures model；

The hand gestures model according to the three-dimensional point cloud imagery exploitation of the hand, obtains each articulation nodes position of hand Put.

In certain embodiments, the method according to any of the above-described embodiment of the present invention, makes hand exercise data Collection, obtains the hand node label information of the hand exercise data set, including：

Extract the three dimensional point cloud of hand region in depth image；

The loss function L of the three-dimensional hand model of initialization and the three dimensional point cloud fitting of the hand region is built, Optimization is iterated to the loss function L；

When the loss function L, which is iterated optimization, meets the default condition of convergence, fitting successfully three-dimensional hand is obtained Portion's model；

The hand node label letter of the hand exercise data set is obtained according to the successful three-dimensional hand model of the fitting Breath.

In certain embodiments, the method according to any of the above-described embodiment of the present invention, makes hand exercise data Collection, obtains the hand node label information of the hand exercise data set, in addition to：

Coloured image and its corresponding depth image are gathered, wherein the coloured image includes the hand with color wrist Portion；

The position of color wrist in the coloured image is partitioned into the hand region.

In certain embodiments, the method according to any of the above-described embodiment of the present invention, the convolutional neural networks Model includes after multiple convolutional layers, multiple pond layers, multiple full articulamentums and each convolutional layer and each described connected entirely Connect the active coating after layer.

In certain embodiments, the method according to any of the above-described embodiment of the present invention, the multiple convolutional layer bag Include the convolutional layer that two convolution kernel sizes are 5 × 5 and the convolutional layer that a convolution kernel size is 3 × 3.

In certain embodiments, the method according to any of the above-described embodiment of the present invention, the multiple pond layer bag Include the pond layer that two step-lengths are 2 and the pond layer that a step-length is 1.

In certain embodiments, the method according to any of the above-described embodiment of the present invention, the active coating is ReLU Function.

In certain embodiments, the method according to any of the above-described embodiment of the present invention, according to the three of the hand Tie up point cloud chart picture and utilize the hand gestures model, obtain each articulation nodes position of hand, including：

The three-dimensional point cloud image of the hand is inputted to the convolutional neural networks model；

Last full articulamentum in the multiple full articulamentum exports the hand gestures parameter of the predetermined number free degree To the hand gestures model；

The hand gestures model exports each articulation nodes position of hand.

In certain embodiments, the method according to any of the above-described embodiment of the present invention, the hand gestures model Each articulation nodes position of hand is exported, including：

Design the global loss function G (Ψ) in the hand gestures model；

Optimization is iterated to the global loss function using presetting method；

When the global loss function reaches the default condition of convergence, each articulation nodes position of hand is exported.

In certain embodiments, the method according to any of the above-described embodiment of the present invention, the global loss function G (Ψ) includes node location loss function G_joint(Ψ) and joint constraint loss function G_{DoFconstraint}(Ψ), wherein the overall situation Loss function G (Ψ) meets following formula：

G (Ψ)=G_joint(Ψ)+λG_constraint(Ψ)

Wherein λ is the weight regulatory factor of the global loss function G (Ψ).

In certain embodiments, the method according to any of the above-described embodiment of the present invention, the node location loss Function G_joint(Ψ) is：

Wherein,For forward momentum mathematic(al) function,Ψ joins for the hand gestures of the predetermined number free degree Number,For the anglec of rotation of the articulation nodes of corresponding hand in the predetermined number free degree, Y_gtFor the hand exercise number According to the hand node label information of concentration.

In certain embodiments, the method according to any of the above-described embodiment of the present invention, the joint constraint loss Function G_{DoFconstraint}(Ψ) is：

Wherein,Constrained for the lower limit of the degree of freedom on a node basis,Constrained for the upper limit of the degree of freedom on a node basis.

Estimate dress there is provided a kind of hand gestures based on deep learning in the second aspect of embodiment of the present invention Put, including：

Hand images split module, for carrying out hand area-of-interest detection to depth image, and from the hand sense Interest is partitioned into hand images in region；

Point cloud chart is as acquisition module, for obtaining the three-dimensional point cloud image for including hand according to the hand images；And

Hand gestures estimation module, for carrying out hand to the three-dimensional point cloud image of the hand using deep learning technology Attitude estimation.

In certain embodiments, device described according to the abovementioned embodiments of the present invention, the hand images segmentation figure picture Module includes image capture device and hand region of interesting extraction unit, wherein：

Described image collecting device, for obtaining the depth image by depth transducer therein；

The hand region of interesting extraction unit, using the relative depth relationships of scape before and after in the depth image, from The hand area-of-interest is extracted in the depth image.

In certain embodiments, the device according to any of the above-described embodiment of the present invention, the hand images segmentation Module includes hand region detection unit and hand images cutting unit, wherein：

The hand region detection unit, for carrying out rim detection and contour detecting to the hand area-of-interest, Detect hand region；

The hand images cutting unit, for carrying out denoising to the hand region, is partitioned into the hand figure Picture.

In certain embodiments, the device according to any of the above-described embodiment of the present invention, the hand gestures estimation Module includes training dataset and makes submodule, point cloud sector domain extracting sub-module, hand gestures modelling submodule and hand Portion's Attitude estimation submodule, wherein：

The training dataset makes submodule, for making hand exercise data set, obtains the hand exercise data The hand node label information of collection；

Described cloud sector domain extracting sub-module, the three-dimensional point cloud sector for extracting hand by the hand exercise data set Domain；

The hand gestures modelling submodule, for the three-dimensional point cloud region according to the hand and the hand section Point markup information utilizes convolutional neural networks model training formation hand gestures model；

The hand gestures estimate submodule, for hand gestures described in the three-dimensional point cloud imagery exploitation according to the hand Model, obtains each articulation nodes position of hand.

In certain embodiments, the device according to any of the above-described embodiment of the present invention, the convolutional neural networks Model includes after multiple convolutional layers, multiple pond layers, multiple full articulamentums and each convolutional layer and each described connected entirely Connect the active coating after layer.

In certain embodiments, the device according to any of the above-described embodiment of the present invention, the hand gestures estimation Submodule includes point cloud chart as input block, attitude parameter output unit and articulation nodes position output unit, wherein：

The point cloud chart is as input block, for the three-dimensional point cloud image of the hand to be inputted to the convolutional Neural net Network model；

The attitude parameter output unit, it is default for last full articulamentum output in the multiple full articulamentum The hand gestures parameter of the number free degree is to the hand gestures model；

Articulation nodes position output unit, each articulation nodes position of hand is exported for the hand gestures model Put.

In certain embodiments, the device according to any of the above-described embodiment of the present invention, the articulation nodes position Output unit includes global loss function design subelement, iteration optimization subelement and articulation nodes position output subelement, Wherein：

The global loss function design subelement, for designing the global loss function G in the hand gestures model (Ψ)；

The iteration optimization subelement, optimization is iterated using presetting method to the global loss function；

The articulation nodes position exports subelement, for reaching the default condition of convergence when the global loss function When, export each articulation nodes position of hand.

In certain embodiments, the device according to any of the above-described embodiment of the present invention, the global loss function G (Ψ) includes node location loss function G_joint(Ψ) and joint constraint loss function G_{DoFconstraint}(Ψ), wherein the overall situation Loss function G (Ψ) meets following formula：

G (Ψ)=G_joint(Ψ)+λG_constraint(Ψ)

Wherein λ is the weight regulatory factor of the global loss function G (Ψ).

In certain embodiments, the device according to any of the above-described embodiment of the present invention, the node location loss Function G_joint(Ψ) is：

In certain embodiments, the device according to any of the above-described embodiment of the present invention, the joint constraint loss Function G_{DoFconstraint}(Ψ) is：

According to hand gestures method of estimation of the embodiment of the present invention based on deep learning and device：Depth image is entered Row hand area-of-interest is detected, and is partitioned into hand images from the hand area-of-interest；According to the hand images Obtain the three-dimensional point cloud image of hand；And hand appearance is carried out to the three-dimensional point cloud image of the hand using deep learning technology State estimates that in this scenario, the hand gestures method of estimation based on deep learning technology eliminates and extracts hand-characteristic by hand Operation, greatly improves the robustness of hand gestures estimation effect, therefore, solves artificial extraction hand in the prior art special Levy the defect for causing hand gestures estimation effect not good.

In addition, according to some embodiments, scene can be directed to, directly closed using the relative depth of front and rear scape in depth image System extracts hand area-of-interest and is partitioned into hand images by the method for contour detecting, therefore, further increases hand The effect of portion's detection.

Brief description of the drawings

Detailed description below, above-mentioned and other mesh of exemplary embodiment of the invention are read by reference to accompanying drawing , feature and advantage will become prone to understand.In the accompanying drawings, if showing the present invention's by way of example, and not by way of limitation Dry embodiment, wherein：

Fig. 1 is schematically shown according to the hand gestures method of estimation based on deep learning of embodiment of the present invention A kind of flow chart；

Fig. 2 schematically shows a kind of stream that hand images are obtained from depth image according to embodiment of the present invention Cheng Tu；

Fig. 3 schematically shows the three-dimensional point cloud to hand according to the utilization deep learning technology of embodiment of the present invention Image carries out a kind of flow chart of hand gestures estimation；

Fig. 4 schematically shows a kind of flow chart of the making hand exercise data set according to embodiment of the present invention；

Fig. 5 schematically shows the one of the hand gestures method of estimation according to embodiments of the present invention based on deep learning Plant schematic diagram；

Fig. 6 schematically shows the one of the hand gestures estimation unit according to embodiments of the present invention based on deep learning Plant schematic diagram；

Fig. 7 schematically shows the another of the hand gestures estimation unit according to embodiments of the present invention based on deep learning A kind of schematic diagram；

Fig. 8 schematically shows the illustrative diagram of computer-readable recording medium according to embodiments of the present invention；

In the accompanying drawings, identical or corresponding label represents identical or corresponding part.

Embodiment

The principle and spirit of the present invention is described below with reference to some illustrative embodiments.It should be appreciated that providing this A little embodiments are used for the purpose of better understood when those skilled in the art and then realizing the present invention, and not with any Mode limits the scope of the present invention.On the contrary, these embodiments are provided so that the disclosure is more thorough and complete, and energy It is enough that the scope of the present disclosure is intactly conveyed into those skilled in the art.

Art technology technical staff knows, embodiments of the present invention can be implemented as a kind of system, device, equipment, Method or computer program product.Therefore, the disclosure can be implemented as following form, i.e.,：It is complete hardware, complete soft Part (including firmware, resident software, microcode etc.), or the form that hardware and software is combined.

The principle of the hand gestures method of estimation based on deep learning to being used in the present invention is illustrated below.

Deep learning is a frontier in machine learning research, and its motivation, which is to set up, simulate human brain is analyzed The neutral net of study, it imitates the mechanism of human brain to explain data, such as image, sound and text.It is exemplary in the disclosure In embodiment, the convolutional neural networks technology (convolutional neural network, CNN) used is deep learning skill One kind in art.Convolutional neural networks are exactly mutually to tie the two-dimensional discrete convolution algorithm in image procossing with artificial neural network Close, convolution refers to that there is convolutional layer front end.This convolution algorithm can be used for automatically extracting feature.

Herein, any number of elements in accompanying drawing is used to example and unrestricted, and any name is only used for Distinguish, without any limitation.

Below with reference to the principle and spirit of some representative embodiments of the present invention, in detail the explaination present invention.

Summary of the invention

The inventors discovered that, can be in the following way when the hand gestures estimation based on deep learning：To depth Spend image and carry out hand area-of-interest detection, and hand images are partitioned into from the hand area-of-interest；According to described Hand images obtain the three-dimensional point cloud image of hand；And the three-dimensional point cloud image of the hand is entered using deep learning technology Row hand gestures estimate that in this scenario, the hand gestures method of estimation based on deep learning technology eliminates and extracts hand by hand The operation of portion's feature, greatly improves the robustness of hand gestures estimation effect, therefore, solves and manually takes out in the prior art Hand-characteristic is taken to cause the not good defect of hand gestures estimation effect.

After the general principle of the present invention is described, lower mask body introduces the various non-limiting embodiment party of the present invention Formula.

Illustrative methods

The estimation of the hand gestures based on deep learning according to exemplary embodiment of the invention is described below with reference to Fig. 1 Method.Understand spirit and principles of the present invention it should be noted that following illustrative embodiments are for only for ease of and show, Embodiments of the present invention are unrestricted in this regard.

The hand gestures method of estimation based on deep learning, the first depth image to collecting, it is possible to use prospect Detection and the method for contour detecting, Accurate Segmentation go out complete hand area-of-interest, the convolution in deep learning are utilized afterwards Neutral net builds hand-characteristic automatically, obtain hand 3D node locations (Wherein j_i=(x_i,y_i,z_i) it is single The 3D positions of depth map), the purpose of hand gestures estimation is reached, while also disclosing making hand exercise in the embodiment of the present disclosure The method of data set.

Fig. 1 is schematically shown according to the hand gestures method of estimation based on deep learning of embodiment of the present invention Schematic flow sheet.As shown in figure 1, this method can include step S10, S20 and S30.

This method starts from step S10, and carrying out hand area-of-interest (ROI) to depth image detects, and from the hand Hand images are partitioned into area-of-interest.

In the embodiment of the present invention, with reference to Fig. 2, step S10 may further include step S11, S12, S13 and S14.

Wherein in step s 11, the depth image is obtained using the depth transducer in image capture device.

For example, the depth transducer can be Astra depth cameras, but the disclosure is not construed as limiting to this.Can be from Original depth image is directly obtained in the depth transducer.

The present embodiment is based primarily upon depth data, and the purpose is to estimate the posture state of hand in depth data.This reality Example is applied using depth data as input, compared to traditional camera, depth transducer results in the distance of subject Information, it is easy to by target and background segment.

In step s 12, using the relative depth relationships of front and rear scape in the depth image, carried from the depth image Take out the hand area-of-interest.

In the embodiment of the present invention, it is possible to use the relative depth relationships of scape before and after in the depth image, subtracted by background The method removed, such as frame difference method, the ROI region comprising hand is extracted from the depth image.

In step s 13, rim detection and contour detecting are carried out to the hand area-of-interest, detects hand area Domain.

In the embodiment of the present invention, contour detecting is carried out to the hand area-of-interest can be using the detection of image sags and crests Method, the general profile region detection of hand is come out, the cutting operation to hand is tentatively completed.

In step S14, denoising is carried out to the hand region, the hand images are partitioned into.

After tentatively completing to the cutting operation of hand, next following denoising can be carried out to the hand region detected Processing, is finally partitioned into complete hand images；

1) medium filtering：To reject depth data shake and the influence of partial noise；

2) Morphological scale-space：Hand images hand is first expanded, post-etching is operated, and reaches that smoothed profile and edge are enhanced Purpose；

3) filling-up hole is handled：According to the neighborhood relevance of image, neighborhood position is filled out using the depth value of black hole neighbourhood Fill, suppress the influence that the black hole in hand images is produced to gesture recognition.

It should be noted that the mode of above-mentioned denoising is only for what illustrative illustrated, the disclosure is to specifically going Processing method of making an uproar is not construed as limiting, and it can use any one image denoising processing method of the prior art.

After execution of step S10, step S20 is can also carry out, the three-dimensional of hand is obtained according to the hand images Point cloud chart picture.

In the embodiment of the present invention, three-dimensional (3D) the point cloud chart picture for including hand is obtained according to the hand images, can be adopted Use following manner：

The inner parameter of described image collecting device is demarcated；

Specifically, image capture device is demarcated, to obtain the inner parameter of image capture device, afterwards by these Inner parameter, primary Calculation goes out the 3D point cloud of hand images.Afterwards, the 3D point cloud image comprising hand is converted into default size (such as 128*128 pixels, but be not limited), while depth value is normalized between [- 1,1], suppresses Hand geometry and becomes Change on resulting influence.Afterwards, input of the 3D point cloud image that final process can be completed as convolutional neural networks.

After execution of step S20, step S30 is can also carry out, using deep learning technology to the three of the hand Tie up point cloud chart picture and carry out hand gestures estimation.

In the embodiment of the present invention, with reference to Fig. 3, hand is carried out to the three-dimensional point cloud image of the hand using deep learning technology Portion's Attitude estimation, can include step S31, S32, S33 and S34.

In step S31, hand exercise data set is made, the hand node label letter of the hand exercise data set is obtained Breath.

In the embodiment of the present invention, with reference to Fig. 4, hand exercise data set is made, the hand of the hand exercise data set is obtained Portion's node label information, can include step S311, S312, S313, S314, S315, S316, S317 and S318.

In step S311, collection coloured image and its corresponding depth image, wherein the coloured image includes band The hand of color wrist.

In the embodiment of the present invention, the collection of hand exercise data set remains unchanged and can use Astra cameras.Pass through Astra Camera collection coloured image (for example, RGB image) and its corresponding depth image, it is possible to SDK pairs carried using Astra The coloured image and its corresponding depth image carry out alignment operation.

In step S312, the position of the color wrist in the coloured image is partitioned into the hand region.

Ratio in the depth data of acquisition shared by hand is very small, while hand and arm are partially due to depth value phase Seemingly it is difficult to differentiate between, therefore, background and arm part can all be impacted to hand Attitude estimation.In order to reduce disturbing factor, The position of wrist point is introduced in the embodiment of the present invention by way of wearing colored wrist in wrist portion, it is assumed that wrist point position is Know, hand is partitioned into by analyzing the position of the color wrist in the coloured image, so can preferably split and sell Wrist and hand part.The irrelevant contents beyond hand region are excluded, only retain hand region, are sold from the segmentation of hand depth data Portion region.

In step S313, the three dimensional point cloud of hand region in depth image is extracted.

Extract the three-dimensional point cloud of hand region in the corresponding depth image of the coloured image, concrete methods of realizing can be with With reference to above-mentioned steps S20, it will not be repeated here.

In step S314, the three-dimensional hand model of initialization and the three dimensional point cloud fitting of the hand region are built Loss function L, optimization is iterated to the loss function L.

By the 3D point cloud and three-dimensional hand model of the hand region of acquisition, the loss function being fitted between the two is constructed L(F_3Dcloud,F_model), wherein F_3DcloudFor 3D point cloud data, F_modelFor the three-dimensional hand model of fitting, for example, it can pass through height This-Newton method is to loss function L (F_3Dcloud,F_model)Optimization is iterated, until it meets the following condition of convergence：

L(F_3Dcloud,F_model)<λ_threshold (1)

Wherein λ_thresholdFor default threshold value.

In step S315, judge whether the loss function L restrains；When the loss function L restrains, step is jumped to Rapid S317；Conversely, into step S316.

When meeting the above-mentioned condition of convergence, then it is assumed that three-dimensional hand model is fitted successfully, is terminated iterative process, is then proceeded to Carry out the fit operation between the three dimensional point cloud of next frame hand region and three-dimensional hand model.

In step S316, three-dimensional hand model is reinitialized.

If three-dimensional hand model fitting failure, i.e. formula (1) are not restrained, it is necessary to reinitialize three-dimensional hand model, Then the fit operation between the three dimensional point cloud of hand region and three-dimensional hand model is carried out again.

In step S317, fitting successfully three-dimensional hand model is obtained.

In step S318, the hand of the hand exercise data set is obtained according to the successful three-dimensional hand model of the fitting Portion's node label information.

If three-dimensional hand model be fitted successfully, i.e. above-mentioned formula convergence, by fit come three-dimensional hand model it is defeated Go out the hand node data of hand training dataset, it is labeled, obtain the hand node of the hand exercise data set Markup information.

In step s 32, the three-dimensional point cloud region of hand is extracted by the hand exercise data set.

In step S33, according to the three-dimensional point cloud region of the hand and the hand node label Information Pull convolution Neural network model trains to form hand gestures model.

Although it should be noted that the deep learning technology of convolutional neural networks has been only exemplified by the embodiment of the present invention to be used for Hand gestures are estimated, but in fact, disclosed method can apply to arbitrary deep learning technology, the disclosure is not made to this Limit.Convolutional neural networks, because the network avoids the complicated early stage pretreatment to image, can directly input original graph Picture, thus be widely used.

The three-dimensional point cloud region of hand is extracted from the above-mentioned hand exercise data set made, then by hand exercise Hand point cloud and hand node label information in data set, the hand gestures model is trained using convolutional neural networks.

Using CNN train come hand gestures model and the obtained three-dimensional point cloud images of hand of abovementioned steps S20, By the propagated forward process of convolutional neural networks, the nodal information of hand gestures is obtained, the purpose of hand gestures estimation is reached.

The network model of the convolutional neural networks can design as follows：

In the embodiment of the present invention, the convolutional neural networks model includes multiple convolutional layers, multiple pond layers, multiple connected entirely Connect the active coating after layer and each convolutional layer and after each full articulamentum.

Multilayer convolution is used in the embodiment of the present invention, full articulamentum is then reused and is trained, the purpose of multilayer convolution It is that the feature that one layer of convolution is acquired is often local, the number of plies is higher, feature more globalization acquired.

With reference to shown in Fig. 5, the convolutional neural networks model includes three convolutional layers C1, C2, C3, while in each convolutional layer There is a pond layer again below, be followed successively by：P1, P2, P3, finally connect 3 full articulamentums, are followed successively by FC1, FC2, FC3, preceding two Individual full articulamentum can include 1024 dimensional features.

In the embodiment of the present invention, it is 5 × 5 convolutional layer and a volume that the multiple convolutional layer, which includes two convolution kernel sizes, The convolutional layer that product core size is 3 × 3.

Convolutional neural networks then can be with retaining space information, while also will not space by small convolution kernel (core is evened up) Information modeling, so as to accomplish to translate robust.Wherein convolution kernel size has the effect for extracting feature well for 5 × 5 convolutional layer, Parameter seldom causes amount of calculation less, convenient to realize；Convolution kernel size is 3 × 3 convolutional layer can finally strengthen in network The validity feature of extraction, so as to increase the training parameter ability of network.

Each convolution is a kind of feature extraction mode, by (activation value is more big more eligible) eligible in image Part is screened.In other embodiments, multiple convolution kernels can also be added, such as 32 convolution kernels can learn 32 kinds Feature.Each convolution kernel can generate image another piece image.Such as two convolution kernels can just generate two images, and this two Width image can be regarded as the different passages of an image.

After obtaining feature (features) by convolution, classification is done using these features.It can be extracted with all Obtained feature removes to train grader, such as one 96 be multiplied by 96 pixels image, it is assumed that learnt to have obtained 400 definition The feature in 8 inputs is multiplied by 8, each feature and image convolution can obtain one (96-8+1) and be multiplied by (96-8+1)=7921 The convolution feature of dimension, due to there is 400 features, so each sample (example) can obtain one 892 and be multiplied by 400= The convolution characteristic vector of 3168400 dimensions, the grader that study one has more than the input of 3,000,000 features is inconvenient, and is held Easily there is over-fitting.

In the embodiment of the present invention, the multiple pond layer includes the pond that the pond layer that two step-lengths are 2 and a step-length are 1 Change layer.

The problem of in order to solve over-fitting, big image is described, the feature to diverse location carries out aggregate statistics, for example, The average value (or maximum) of some special characteristic on one region of image can be calculated, these summary features not only have low Dimension (comparing using all features extracted and obtained) much, while can also improve result (being not easy over-fitting).It is this poly- The operation of conjunction is just called pond (pooling), and otherwise referred to as averagely pond or maximum pond (depends on the side of computing pool Method).

In the embodiment of the present invention, the active coating is ReLU functions.

Its form of ReLU functions is as follows：

F (x)=max (0, x) (2)

Wherein, the active coating in the present embodiment after each convolutional layer selects ReLU functions, and ReLU functions can be by non-positive element Plain zero setting, has good effect in terms of formal neuron is retained, and then is prevented effectively from the problem of gradient is exploded, in each volume The hidden layer using ReLU functions as activation primitive is added behind lamination, the activation primitive can remove the neuron less than 0, Erected so as to filter out effective feature after the convolutional neural networks model for needing to learn, by constantly reducing loss function Numerical value carrys out the parameter of training network model, so as to improve the quality of image.Training convolutional neural networks model forms corresponding hand Portion's attitude mode is to construct input picture to the mapping of hand artis position, eventually through effective mapping of foundation to corresponding Image is handled, and can obtain the position of each artis of hand.

The embodiment of the present invention is by introducing convolutional layer and active coating, and learning ability and the sieve of active coating by convolutional layer The learning ability of the feature that ability has been obtained, greatly strength neural network is selected, is learnt exactly from input picture to output figure The mapping of picture is input to the mapping of output to set up, so as to by learn mapping carry out hand gestures prediction and Estimation.

Pass through the number of plies of convolutional layer and convolution kernel chosen in the convolutional neural networks model set up in the embodiment of the present invention Size, on the basis of the ability of neutral net is ensured, it is to avoid gradient blast, over-fitting and calculating in the training process occur The problems such as complexity；In the convolutional neural networks model in training the embodiment of the present invention, pond layer is introduced, is easy to train and has Enough abilities obtain good denoising effect.

In step S34, the hand gestures model according to the three-dimensional point cloud imagery exploitation of the hand obtains hand Each articulation nodes position.

Hand gestures estimate during, input be the hand that abovementioned steps S20 is handled well three-dimensional point cloud image, it is defeated What is gone out is the 3D nodal informations of hand.During hand gestures estimation is carried out, the convolution in deep learning technology has been used Nerual network technique.

In the embodiment of the present invention, the hand gestures model according to the three-dimensional point cloud imagery exploitation of the hand obtains hand Each articulation nodes position in portion, can be in the following way：

The hand gestures model exports each articulation nodes position of hand.

In the embodiment of the present invention, the hand gestures model exports each articulation nodes position of hand, can use as follows Mode：

Design the global loss function G (Ψ) in the hand gestures model；

Optimization is iterated to the global loss function using presetting method；

With continued reference to Fig. 5, in the 3rd full articulamentum FC3, can exporting such as 26, (quantity of the specific free degree can be with Pre-set, can also be 27, the 30 more frees degree of grade or less free degree) the hand gestures parameter of the free degree Ψ, it is connected with hand gestures model HML, and each articulation nodes position of hand is exported using forward dynamics model F (Ψ) Put i.e. 3D nodes position (Wherein j_i=(x_i,y_i,z_i) for the 3D positions of single depth map), wherein forward dynamics mould Type meets following chain structure：

Hand gestures parameter is set to Ψ ∈ R^D, it is the free degree (DoF) of hand joint point wherein D=26 can be set, its In 3 frees degree be global hand (centre of the palm) position, 3 frees degree are global hand (centre of the palm) direction.The remaining free degree is each (each finger has 4 frees degree to the anglec of rotation of individual node, and wherein thumb has 3 frees degree in fingers and palms node, refers to intermediate node There is one degree of freedom；Remaining finger has 2 frees degree at fingers and palms node, there is two nodes between referring to, and each node has 1 freedom Degree, a total of 20 frees degree), to each anglec of rotationWhereinFor the constraint of its bound.Obtain hand All artis in portion, that is, obtain the position of whole hand joint point.

In the embodiment of the present invention, wherein for the hand gestures model HML in deep learning, designing following global loss Function G (Ψ).

The global loss function G (Ψ) includes node location loss function G_joint(Ψ) and joint constraint loss function G_{DoFconstraint}(Ψ), wherein the global loss function G (Ψ) meets following formula：

G (Ψ)=G_joint(Ψ)+λG_constraint(Ψ) (4)

Wherein λ is the weight regulatory factor of the global loss function G (Ψ), Ψ ∈ R^DFor hand gestures parameter.

In the embodiment of the present invention, the node location loss function G_joint(Ψ) is：

Wherein,For forward momentum mathematic(al) function,Ψ joins for the hand gestures of the predetermined number free degree Number,For the anglec of rotation of the articulation nodes of corresponding hand in the predetermined number free degree, Y_gtFor the hand exercise number It is the hand node location marked in hand exercise data set according to the hand node label information of concentration.

In the embodiment of the present invention, the joint constraint loss function G_{DoFconstraint}(Ψ) is：

In above-mentioned global loss function iterative optimization procedure, standard stochastic gradient descent method can be used to carry out parameter Optimization.It should be noted that the invention is not limited in the optimization that parameter is carried out using above-mentioned standard stochastic gradient descent method, It can be used for the algorithm for carrying out parameter optimization using any one.

In the embodiment of the present invention, a kind of hand gestures method of estimation based on deep learning is proposed：Depth image is carried out Hand area-of-interest is detected, and is partitioned into hand images from the hand area-of-interest；Obtained according to the hand images Take the three-dimensional point cloud image of hand；And hand gestures are carried out to the three-dimensional point cloud image of the hand using deep learning technology Estimation, in this scenario, the hand gestures method of estimation based on deep learning technology eliminate the behaviour for extracting hand-characteristic by hand Make, greatly improve the robustness of hand gestures estimation effect, therefore, solve and utilize traditional engineering in the prior art Learning method (such as recurrence or random forest) needs the artificial hand-characteristic that extracts to cause the not good defect of hand gestures estimation effect.Together When, add joint constraint condition and hand gestures are estimated so that the robustness of this method has larger lifting.It is same with this When, present embodiments provide a whole set of hand gestures method of estimation, the segmentation comprising hand, the making of hand exercise data set with And hand gestures estimate this whole set of method.

Example devices

After the method for exemplary embodiment of the invention is described, next, with reference to Fig. 6 to exemplary reality of the invention Apply mode, the hand gestures estimation unit based on deep learning illustrates.

Fig. 6 schematically shows the hand gestures estimation unit 10 based on deep learning according to embodiment of the present invention Schematic diagram.As shown in fig. 6, the device 10 can include：

Hand images split module 100, can be used for carrying out hand area-of-interest detection to depth image, and from described Hand images are partitioned into hand area-of-interest；

Point cloud chart can be used for obtaining the three-dimensional point cloud atlas for including hand according to the hand images as acquisition module 110 Picture；And

Hand gestures estimation module 120, can be used for the three-dimensional point cloud image to the hand using deep learning technology Carry out hand gestures estimation.

In the embodiment of the present invention, alternatively, the hand images segmentation image module 100 can include image capture device With hand region of interesting extraction unit, wherein：

In the embodiment of the present invention, alternatively, the hand images segmentation module 100 can also include hand region and detect single Member and hand images cutting unit, wherein：

In the embodiment of the present invention, alternatively, the hand gestures estimation module 120 can include training dataset and make son Module, point cloud sector domain extracting sub-module, hand gestures modelling submodule and hand gestures estimation submodule, wherein：

In the embodiment of the present invention, the convolutional neural networks model can include multiple convolutional layers, multiple pond layers, multiple Active coating after full articulamentum and each convolutional layer and after each full articulamentum.

In the embodiment of the present invention, alternatively, hand gestures estimation submodule can include point cloud chart as input block, Attitude parameter output unit and articulation nodes position output unit, wherein：

In the embodiment of the present invention, alternatively, articulation nodes position output unit includes global loss function design Unit, iteration optimization subelement and articulation nodes position output subelement, wherein：

In the embodiment of the present invention, alternatively, the global loss function G (Ψ) includes node location loss function G_joint (Ψ) and joint constraint loss function G_{DoFconstraint}(Ψ), wherein the global loss function G (Ψ) meets following formula：

G (Ψ)=G_joint(Ψ)+λG_constraint(Ψ)

Wherein λ is the weight regulatory factor of the global loss function G (Ψ).

In the embodiment of the present invention, alternatively, the node location loss function G_joint(Ψ) is：

In the embodiment of the present invention, alternatively, the joint constraint loss function G_{DoFconstraint}(Ψ) is：

In the embodiment of the present invention, a kind of hand gestures estimation unit based on deep learning is proposed：Depth image is carried out Hand area-of-interest is detected, and is partitioned into hand images from the hand area-of-interest；Obtained according to the hand images Take the three-dimensional point cloud image of hand；And hand gestures are carried out to the three-dimensional point cloud image of the hand using deep learning technology Estimation, in this scenario, the hand gestures method of estimation based on deep learning technology eliminate the behaviour for extracting hand-characteristic by hand Make, greatly improve the robustness of hand gestures estimation effect, therefore, solve artificial extraction hand-characteristic in the prior art The defect for causing hand gestures estimation effect not good.

Example devices

After the method and apparatus of exemplary embodiment of the invention are described, next, introducing according to the present invention's The hand gestures estimation unit based on deep learning of another exemplary embodiment.

Person of ordinary skill in the field it is understood that various aspects of the invention can be implemented as system, method or Program product.Therefore, various aspects of the invention can be implemented as following form, i.e.,：It is complete hardware embodiment, complete Complete Software Implementation (including firmware, microcode etc.), or the embodiment combined in terms of hardware and software, it can unite here Referred to as " circuit ", " module " or " system ".

, can be with according to the hand gestures estimation unit based on deep learning of the present invention in some possible embodiments At least include at least one processing unit and at least one memory cell.Wherein, the memory cell has program stored therein generation Code, when described program code is performed by the processing unit so that it is above-mentioned " exemplary that the processing unit performs this specification Described in method " part according to various illustrative embodiments of the invention be used for keep step in data consistency method Suddenly.For example, the processing unit can perform step S10 as shown in fig. 1：Hand area-of-interest is carried out to depth image Detection, and it is partitioned into hand images from the hand area-of-interest；Step S20：Hand is obtained according to the hand images Three-dimensional point cloud image；Step S30：Hand gestures are carried out using deep learning technology to the three-dimensional point cloud image of the hand to estimate Meter.

The estimation of the hand gestures based on deep learning according to the embodiment of the invention is described referring to Fig. 7 Device 50.The hand gestures estimation unit 50 based on deep learning that Fig. 7 is shown is only an example, should not be to of the invention real Apply the function of example and carry out any limitation using range band.

As shown in fig. 7, the hand gestures estimation unit 50 based on deep learning is showed in the form of universal computing device.Base It can include but is not limited in the component of the hand gestures estimation unit 50 of deep learning：At least one above-mentioned processing unit 500, The bus of at least one above-mentioned memory cell 510, connection different system component (including memory cell 510 and processing unit 500) 530。

Bus 530 represents the one or more in a few class bus structures, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.

Memory cell 510 can include the computer-readable recording medium of form of volatile memory, such as random access memory (RAM) 512 and/or cache memory 514, can also further read-only storage (ROM) 516.

Memory cell 510 can also include program/utility with one group of (at least one) program module 5182 518, such program module 5182 includes but is not limited to：Operating system, one or more application program, other program modules And routine data, the realization of network environment is potentially included in each or certain combination in these examples.

Hand gestures estimation unit 50 based on deep learning (can also for example be tried with one or more external equipments 560 Paper detection device, keyboard, sensing equipment, bluetooth equipment etc.) communication, it can also enable a user to be based on this with one or more The equipment communication that the hand gestures estimation unit 50 of deep learning is interacted, and/or with causing the hand appearance based on deep learning Any equipment (such as router, modulation /demodulation that state estimation unit 50 can be communicated with one or more of the other computing device Device etc.) communication.This communication can be carried out by input/output (I/O) interface 520.Also, the hand based on deep learning Attitude estimating device 50 can also pass through network adapter 550 and one or more network (such as LAN (LAN), wide area Net (WAN) and/or public network, such as internet) communication.As illustrated, network adapter 550 by bus 530 with being based on Other modules communication of the hand gestures estimation unit 50 of deep learning.It should be understood that although not shown in the drawings, base can be combined Other hardware and/or software module are used in the hand gestures estimation unit 50 of deep learning, is included but is not limited to：Microcode, Device driver, redundant processing unit, external disk drive array, RAID system, tape drive and data backup storage System etc..

Exemplary process product

In some possible embodiments, various aspects of the invention are also implemented as a kind of shape of program product Formula, it includes program code, when described program product is run on the terminal device, and described program code is used to make the terminal Equipment performs the base according to various illustrative embodiments of the invention described in above-mentioned " illustrative methods " part of this specification Step in the hand gestures method of estimation of deep learning, for example, the terminal device can perform step as shown in Figure 1 Rapid S10：Hand area-of-interest detection is carried out to depth image, and hand figure is partitioned into from the hand area-of-interest Picture；Step S20：The three-dimensional point cloud image of hand is obtained according to the hand images；Step S30：Using deep learning technology pair The three-dimensional point cloud image of the hand carries out hand gestures estimation.

Described program product can use any combination of one or more computer-readable recording mediums.Computer-readable recording medium can be readable letter Number medium or readable storage medium storing program for executing.Readable storage medium storing program for executing for example may be-but not limited to-electricity, magnetic, optical, electromagnetic, red System, device or the device of outside line or semiconductor, or any combination above.The more specifically example of readable storage medium storing program for executing (non exhaustive list) includes：Electrical connection, portable disc with one or more wires, hard disk, random access memory (RAM), read-only storage (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc Read memory (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.

As shown in figure 8, describing the journey of the estimation of the hand gestures based on deep learning according to the embodiment of the present invention Sequence product 60, it can be using portable compact disc read only memory (CD-ROM) and including program code, it is possible in terminal Run in equipment, such as PC.However, the program product not limited to this of the present invention, in this document, readable storage medium storing program for executing Can be it is any include or storage program tangible medium, the program can be commanded execution system, device or device and use Or it is in connection.

Readable signal medium can be included in a base band or as the data-signal of carrier wave part propagation, wherein carrying Readable program code.The data-signal of this propagation can take various forms, including --- but being not limited to --- electromagnetism letter Number, optical signal or above-mentioned any appropriate combination.Readable signal medium can also be beyond readable storage medium storing program for executing it is any can Read medium, the computer-readable recording medium can send, propagate or transmit for by instruction execution system, device or device use or Program in connection.

The program code included on computer-readable recording medium can be transmitted with any appropriate medium, including --- but being not limited to --- Wirelessly, wired, optical cable, RF etc., or above-mentioned any appropriate combination.

It can be write with any combination of one or more programming languages for performing the program that the present invention is operated Code, described program design language includes object oriented program language-Java, C++ etc., in addition to conventional Procedural programming language-such as " C " language or similar programming language.Program code can be fully in user Perform, partly perform on a user device on computing device, being performed as an independent software kit, partly in user's calculating Its upper side point is performed or performed completely in remote computing device or server on a remote computing.It is remote being related to In the situation of journey computing device, remote computing device can be by the network of any kind --- including LAN (LAN) or wide Domain net (WAN)-be connected to user calculating equipment, or, it may be connected to external computing device (for example utilizes Internet service Provider comes by Internet connection).

If although it should be noted that being referred to the hand gestures estimation unit based on deep learning in above-detailed Dry module or unit, but this division is only not enforceable.In fact, according to the embodiment of the present invention, retouching above The feature and function for two or more devices stated can embody in one apparatus.Conversely, an above-described device Feature and function can be further divided into being embodied by multiple devices.

In addition, although the operation of the inventive method is described with particular order in the accompanying drawings, this do not require that or Hint must be performed according to the particular order these operation, or the operation having to carry out shown in whole could realize it is desired As a result.Additionally or alternatively, it is convenient to omit some steps, multiple steps are merged into a step execution, and/or by one Step is decomposed into execution of multiple steps.

Although describing spirit and principles of the present invention by reference to some embodiments, it should be appreciated that, this Invention is not limited to disclosed embodiment, and the division to each side does not mean that the feature in these aspects can not yet Combination is this to divide merely to the convenience of statement to be benefited.It is contemplated that cover appended claims spirit and In the range of included various modifications and equivalent arrangements.

Claims

1. a kind of hand gestures method of estimation based on deep learning, including：

Hand area-of-interest detection is carried out to depth image, and hand images are partitioned into from the hand area-of-interest；

The three-dimensional point cloud image for including hand is obtained according to the hand images；And

2. the method as described in claim 1, hand area-of-interest detection is carried out to depth image, including：

Using the relative depth relationships of front and rear scape in the depth image, the hand sense is extracted from the depth image emerging Interesting region.

3. the method as described in claim 1, hand images are partitioned into from the hand area-of-interest, including：

4. method as claimed in claim 3, is carried out contour detecting to the hand area-of-interest and is examined using image sags and crests Survey method.

5. method as claimed in claim 2, the three-dimensional point cloud image for including hand is obtained according to the hand images, including：

The inner parameter of described image collecting device is demarcated；

The three-dimensional point cloud image of the hand tentatively obtained is subjected to size normalized, what acquisition processing was completed includes hand Three-dimensional point cloud image.

6. the method as described in claim 1, hand is carried out using deep learning technology to the three-dimensional point cloud image of the hand The step of Attitude estimation, including：

According to the three-dimensional point cloud region of the hand and the hand node label Information Pull convolutional neural networks model training Form hand gestures model；

The hand gestures model according to the three-dimensional point cloud imagery exploitation of the hand, obtains each articulation nodes position of hand.

7. method as claimed in claim 6, makes hand exercise data set, the hand section of the hand exercise data set is obtained Point markup information, including：

Extract the three dimensional point cloud of hand region in depth image；

The loss function L of the three-dimensional hand model of initialization and the three dimensional point cloud fitting of the hand region is built, to institute State loss function L and be iterated optimization；

When the loss function L, which is iterated optimization, meets the default condition of convergence, fitting successfully three-dimensional hand mould is obtained Type；

The hand node label information of the hand exercise data set is obtained according to the successful three-dimensional hand model of the fitting.

8. method as claimed in claim 7, makes hand exercise data set, the hand section of the hand exercise data set is obtained Point markup information, in addition to：

9. method as claimed in claim 6, the convolutional neural networks model includes multiple convolutional layers, multiple pond layers, many Active coating after individual full articulamentum and each convolutional layer and after each full articulamentum.

10. a kind of hand gestures estimation unit based on deep learning, including：

Hand images split module, for carrying out hand area-of-interest detection to depth image, and interested from the hand Hand images are partitioned into region；

Hand gestures estimation module, for carrying out hand gestures to the three-dimensional point cloud image of the hand using deep learning technology Estimation.