CN110852311A

CN110852311A - Three-dimensional human hand key point positioning method and device

Info

Publication number: CN110852311A
Application number: CN202010034582.0A
Authority: CN
Inventors: 陈俊逸
Original assignee: Changsha Small Cobalt Technology Co Ltd
Current assignee: Changsha Small Cobalt Technology Co Ltd
Priority date: 2020-01-14
Filing date: 2020-01-14
Publication date: 2020-02-28

Abstract

The embodiment of the invention provides a method and a device for positioning key points of a three-dimensional human hand, wherein the method comprises the following steps: acquiring a depth image of an actual scene; carrying out palm region segmentation on the depth image through a first neural network to obtain a segmented palm region; carrying out normalization processing and size transformation on the palm area to obtain a depth map of the normalized palm area; judging whether the actual scene corresponding to the depth map of the normalized palm area contains the real palm or not through a second neural network; if yes, predicting the key point coordinates of the depth map of the normalized palm region through a third neural network, and determining the key point coordinates of the palm in the depth image of the actual scene through the predicted key point coordinates of the depth map of the normalized palm region, so that the practicability and reliability of three-dimensional positioning of the key points of the human hand can be improved.

Description

Three-dimensional human hand key point positioning method and device

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a three-dimensional human hand key point positioning method, a three-dimensional human hand key point positioning device, terminal equipment and a computer readable medium.

Background

Palm print, palm vein and other biological feature recognition technologies and gesture recognition technologies all need to perform palm detection and key point positioning on the image. In palm print recognition, it is critical to quickly and accurately recognize the palm area, which may affect the performance of palm print recognition. In the gesture recognition process, if the coordinate positions of the joint points of the fingers and the palms of the human body can be obtained, the gesture can be judged according to the relative position relation of the fingers and the palms. Therefore, palm region positioning and palm keypoint positioning are very important. The existing palm key point detection is mainly based on two-dimensional RGB images or near-infrared images, and the main technologies include three types. The first type is that the palm and the background are segmented by utilizing the color information of the palm, and the position of a key point is deduced according to the contour information of the palm; the second type is that the contour extraction algorithm is directly used for carrying out contour extraction on the palm in the image to obtain contour information of parts including fingers, the palm, the wrist and the like, and further key point inference is carried out according to the contour information; the third type is that a deep learning technology is utilized, a deep neural network for object detection is used for an image, a rectangular frame containing a palm is directly obtained, then finger knuckle line segments are positioned, and then joint point positions of fingers and the palm, namely key point positions, are obtained.

The most similar patent to the present invention is CN 108427942 a, which comprises the following steps: s1, collecting training samples; s2, constructing a network model: constructing a CNN (convolutional neural network) feature extraction network, an RPN (region generation network) candidate region extraction network and a discrimination network; s3, training a network model: initializing a CNN feature extraction network, an RPN candidate region extraction network and a discrimination network; s4, constructing a detection model; s5: palm detection and key point positioning. The technology provided by the patent uses an object detection framework fast R-CNN with highest performance at that time to perform palm region fast positioning, and uses a key point positioning network model to perform palm contour detection and key point positioning on a palm image to be detected. This patent uses near-infrared image as the input of object detection frame and key point location network model, compares with RGB image, and it is less relatively to receive the light influence, but light still can cause the influence, and unable anti-fake simultaneously if using the static two-dimentional palm picture of printing, also can direct recognition, and this makes subsequent biological feature identification and gesture recognition unreliable. In addition, this patent uses two-dimentional near-infrared image to do key point location, can't fix a position the three-dimensional information of key point, and when the palm had the angle deflection, the gesture recognition of two-dimensional picture can receive serious influence.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for positioning a three-dimensional human hand key point, a terminal device, and a computer readable medium, which can improve the practicability and reliability of three-dimensional positioning of a human hand key point.

The first aspect of the embodiments of the present invention provides a method for positioning a three-dimensional human hand key point, including:

acquiring a depth image of an actual scene;

carrying out palm region segmentation on the depth image through a first neural network to obtain a segmented palm region;

carrying out normalization processing and size transformation on the palm area to obtain a depth map of the normalized palm area;

judging whether the actual scene corresponding to the depth map of the normalized palm area contains the real palm or not through a second neural network;

if yes, predicting the key point coordinates of the depth map of the normalized palm area through a third neural network, and determining the key point coordinates of the palm in the depth image of the actual scene through the predicted key point coordinates of the depth map of the normalized palm area.

A second aspect of the embodiments of the present invention provides a three-dimensional human hand key point positioning device, including:

the acquisition module is used for acquiring a depth image of an actual scene;

the segmentation module is used for carrying out palm region segmentation on the depth image through a first neural network to obtain a segmented palm region;

the normalizing module is used for carrying out normalization processing and size conversion on the palm area to obtain a depth map of the normalized palm area;

the anti-counterfeiting module is used for judging whether the actual scene corresponding to the depth map of the normalized palm area contains the real palm or not through a second neural network;

and the positioning module is used for predicting the key point coordinates of the depth map of the normalized palm area through a third neural network when the anti-counterfeiting module detects that the actual scene corresponding to the depth map contains the real palm, and determining the key point coordinates of the palm in the depth image of the actual scene through the predicted key point coordinates of the depth map of the normalized palm area.

A third aspect of the embodiments of the present invention provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the three-dimensional human hand keypoint location method when executing the computer program.

A fourth aspect of the embodiments of the present invention provides a computer-readable medium, which stores a computer program, and when the computer program is processed and executed, the computer program implements the steps of the above-mentioned three-dimensional human hand key point positioning method.

The three-dimensional human hand key point positioning method provided by the embodiment of the invention can acquire the depth image of the actual scene, performing palm region segmentation on the depth image through a first neural network to obtain a segmented palm region, carrying out normalization processing and size transformation on the palm area to obtain a depth map of the normalized palm area, and judging whether the actual scene corresponding to the depth map of the normalized palm region contains the real palm through a second neural network, if so, the keypoint coordinates of the depth map of the normalized palm region are predicted by a third neural network, and determining the coordinates of the key points of the palm in the depth image of the actual scene according to the predicted coordinates of the key points of the depth image of the normalized palm area, so that the practicability and reliability of three-dimensional positioning of the key points of the human hand can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a flowchart of a three-dimensional human hand key point positioning method according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a three-dimensional human hand key point positioning device according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a refinement of the segmentation module in FIG. 2;

FIG. 4 is a schematic diagram of a detailed structure of the anti-counterfeiting module in FIG. 2;

fig. 5 is a schematic diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

Referring to fig. 1, fig. 1 is a diagram illustrating a method for positioning key points of a three-dimensional human hand according to an embodiment of the present invention. As shown in fig. 1, the three-dimensional human hand key point positioning method of the present embodiment includes the following steps:

s101: a depth image of an actual scene is acquired.

In the embodiment of the invention, in an actual scene, a depth (depth) image to be recognized can be acquired through the depth camera equipment. The value of each pixel point in the depth image represents the distance value of an object on the pixel point from the depth camera.

S102: and carrying out palm region segmentation on the depth image through a first neural network to obtain a segmented palm region.

In the embodiment of the invention, the depth image can be subjected to size conversion to obtain a depth image with a fixed size; inputting the depth image with the fixed size into a first neural network, and performing palm region segmentation on the depth image with the fixed size by using the first neural network; and finally, denoising the depth image with the fixed size after the palm region segmentation is carried out, so as to obtain a segmented palm region. More specifically, the first neural network is a convolutional neural network including an encoder and a decoder. The encoder comprises four 8 convolutional layers, each convolutional layer is connected with an activation function layer, and the activation function is preferably a linear rectification function, namely ReLU. The first 6 convolutional layers are followed by the down-sampling layer, the down-sampling window size is 2 x2, and the back 2 convolutional layers are not followed by the down-sampling layer. The convolution kernel size of each of the 8 convolution layers is 3 x 3, and the number of output feature maps is 64, 64,128, 128,256, 256,512 and 512. The decoder is connected with the encoder, the input of the decoder is the output of the encoder, the decoder comprises 9 convolutional layers, the first 8 convolutional layers are followed by an activation function ReLU, and the last 1 convolutional layer is followed by an activation function Sigmoid. The up-sampling window is 2 x2, and the up-sampling layer is not connected after the front 6 convolutional layers + activation function layers of the decoder. The convolution kernel size of the first 8 convolutional layers of the decoder is 3 x 3, and the number of output signatures is 256, 256, 128, 128, 64, 64, 32, respectively. The convolution kernel size of the last 1 convolutional layer of the decoder is 1 x1, and the number of output signatures is 1. The output size of the decoder is consistent with the input size of the convolutional neural network, and whether each pixel belongs to the palm region or not is indicated, if yes, the point value is 1, and if not, the point value is 0. Further, the segmented palm region output by the first neural network often includes noise, for example, some non-palm connected regions are erroneously determined as palm regions, so that the segmented palm region output by the first neural network needs to be denoised, and only one connected region with a palm (i.e. palm region) is included in the segmented palm region output by the first neural network, and other connected regions are eliminated. The specific method of denoising can be: 1) selecting a first initial point from the divided depth image with the fixed size, extracting nearby pixel points communicated with the initial point from the initial point, storing the pixel points in a first list, and marking the pixel points as processed; the pixel value of the starting point is equal to the segmentation value of the pixel point communicated with the starting point; 2) continuously searching points which are adjacent to the points in the first list and have the same segmentation value in unprocessed pixel points in the segmented depth image with the fixed size, adding the points into the first list until all the pixel points which are communicated with the first starting point are found and marked as processed, and adding the pixel points into the first list; 3) selecting a second starting point from unprocessed pixel points in the divided depth image with the fixed size, finding and marking all pixel points communicated with the second starting point, and adding the pixel points into the second list; 4) selecting other starting points except the first starting point and the second starting point from unprocessed pixel points in the segmented depth image with the fixed size, finding and marking all pixel points communicated with the other starting points until all the points in the segmented depth image with the fixed size are marked as processed, and obtaining a plurality of lists, wherein each list maintains all the pixel points corresponding to 1 communicated region; 5) and finding a list with the pixel point value of 1 and the largest number of pixel points in all the lists, and determining that the list with the pixel point value of 1 and the largest number of pixel points corresponds to the palm area.

S103: and carrying out normalization processing and size transformation on the palm area to obtain a depth map of the normalized palm area.

In the embodiment of the present invention, the maximum value and the minimum value of all pixel points in the palm region relative to the maximum value and the minimum value in the horizontal and vertical coordinates of the coordinate system corresponding to the depth image of the fixed size in S102 are selected to obtain the maximum value and the minimum value

Obtaining the diagonal position coordinates of the rectangular frame containing the human hand according to the maximum value and the minimum value of the horizontal and vertical coordinates; wherein the position coordinate of the upper left corner is () The coordinate of the lower right corner is

) (ii) a And cutting the depth image through the rectangular frame to obtain a depth map of the palm area, normalizing the depth value of the palm area, and carrying out size transformation on the palm area to obtain a normalized depth map of the palm area with a fixed size.

S104: and judging whether the actual scene corresponding to the depth map of the normalized palm region contains the real palm or not through a second neural network.

In the embodiment of the invention, the depth map of the normalized palm region is fed into a second neural network; then, a judgment value can be obtained through the second neural network, and whether the actual scene corresponding to the depth map of the normalized palm area contains the real palm or not is judged through the judgment value; the second neural network comprises 5 convolutional layers and 3 fully-connected layers, the sizes of convolutional cores of the convolutional layers are 3 x 3, the number of output feature graphs is 32,64,128,256 and 512 respectively, a down-sampling layer is connected behind the first three convolutional layers, a down-sampling window is 3 x 3, a ReLU activation function is connected behind all the convolutional layers, the number of nodes of the fully-connected layers is 4096,1024 and 3 respectively, a Dropout function is connected behind the first two fully-connected layers of the fully-connected layers, and the Dropout function is used for randomly setting the output to zero at a certain probability to prevent overfitting, and 3 nodes of the last layer of the fully-connected layers represent 3 categories and represent real persons, picture attacks and video attacks respectively. If the actual scene corresponding to the depth map of the normalized palm area contains the real palm, the process goes to S105; if it is determined that the actual scene corresponding to the depth map of the normalized palm region does not include the real palm (for example, it includes a picture or a video), the process returns to S101 to re-acquire the depth image of the actual scene.

S105: and predicting the key point coordinates of the depth map of the normalized palm region through a third neural network, and determining the key point coordinates of the palm in the depth image of the actual scene through the predicted key point coordinates of the depth map of the normalized palm region.

In this embodiment of the present invention, the depth map of the normalized palm region is sent to a third neural network, and the third neural network outputs predicted coordinate values of the key points of the normalized palm region, where the number of the predicted coordinate values is m × 3, m represents the number of the palm key points, each of the coordinate values represents an abscissa, an ordinate or a depth value, for example, m key points in total, the coordinate value of the first key point is (x 1, y1, d 1), and the coordinate value of the second key point is (x 2, y2, d 2) … … th key point is (xm, ym, dm); then, the predicted coordinate values may be converted into coordinate values in a palm region (palm region in S102) before being subjected to size conversion; wherein, the conversion formula is:

（1）

wherein the content of the first and second substances,

and

respectively representing width and height expansion factors in the size transformation process, (xi, yi) respectively representing an abscissa and an ordinate in predicted coordinate values of the key points, (xi ', yi') respectively representing an abscissa and an ordinate in coordinate values of the key points in the palm area before size transformation, and m represents the number of the key points; finally, the horizontal and vertical coordinate values in the coordinate values of the key points in the palm area before the size transformation are respectively added with the hand before the size transformationMinimum value in horizontal and vertical coordinates of all pixel points in the palm area relative to a coordinate system corresponding to the depth image with the fixed size (

) Obtaining the coordinates of the key points of the palm in the fixed-size depth image in S102, and then combining the depth values of the keys in the fixed-size depth image to obtain the coordinates of the key points in the three-dimensional space. Preferably, the third neural network is a convolutional neural network, and the convolutional neural network has the following structure: the convolution filter comprises 8 convolution layers and 3 fully-connected layers, the convolution kernel size of each convolution layer is 3 x 3, the number of feature graphs output by the convolution layers is 32, 32,64, 64,128, 128,256 and 256 respectively, a ReLU activation function layer is connected behind each convolution layer, a down-sampling layer is connected behind the first 4 convolution layers and the activation function layer respectively, the size of a down-sampling window is 2 x2, the number of nodes of the fully-connected layers is 4096,2048 and m x 3 respectively, a ReLU activation function is connected behind the first two fully-connected layers, and then a Dropout function is connected, and the Dropout can randomly set the output to zero at a certain probability.

In the three-dimensional human hand positioning method provided in fig. 1, because the depth information (depth image) is used as an input, the method is insensitive to illumination, and can still perform human hand positioning work under extreme light conditions, so that the method has higher practicability. Moreover, the three-dimensional anti-counterfeiting algorithm provided by the embodiment of the invention can effectively prevent malicious attacks on pictures and videos, and improve the reliability of manual positioning. In addition, the embodiment of the invention can directly obtain the coordinates of the key points in the three-dimensional space, and when the palm deflects, the angle transformation can be carried out in the three-dimensional space to obtain the positive palm, so that the negative influence on the subsequent gesture recognition algorithm is smaller.

Referring to fig. 2, fig. 2 is a block diagram of a three-dimensional human hand key point positioning device according to an embodiment of the present invention. As shown in fig. 2, the three-dimensional human hand key point positioning device 2 of the present embodiment includes an obtaining module 21, a dividing module 22, a normalizing module 23, an anti-counterfeiting module 24, and a positioning module 25. The obtaining module 21, the dividing module 22, the normalizing module 23, the anti-counterfeiting module 24 and the positioning module 25 are respectively configured to execute the specific methods in S101, S102, S103, S104 and S105 in fig. 1, and details can be referred to in the related description of fig. 1 and are only briefly described here:

an obtaining module 21, configured to obtain a depth image of an actual scene.

And the segmentation module 22 is configured to perform palm region segmentation on the depth image through a first neural network to obtain a segmented palm region.

And the normalizing module 23 is configured to perform normalization processing and size conversion on the palm area to obtain a depth map of the normalized palm area.

And the anti-counterfeiting module 24 is configured to judge whether the actual scene corresponding to the depth map of the normalized palm region includes a real palm through a second neural network.

And the positioning module 25 is configured to predict, when the anti-counterfeit module 24 detects that the actual scene corresponding to the depth map includes a real palm, the key point coordinates of the depth map of the normalized palm region through a third neural network, and determine the key point coordinates of the palm in the depth image of the actual scene through the predicted key point coordinates of the depth map of the normalized palm region.

Further, as can be seen in fig. 3, the segmentation module 22 may specifically include a transformation unit 221, a segmentation unit 222, and a denoising unit 223:

a transforming unit 221, configured to perform size transformation on the depth image to obtain a depth image with a fixed size.

A segmentation unit 222, configured to input the fixed-size depth image into a first neural network, and perform palm region segmentation on the fixed-size depth image using the first neural network.

And a denoising unit 223, configured to perform denoising processing on the depth image with the fixed size after the palm region is segmented, so as to obtain a segmented palm region.

Further, referring to fig. 4, the anti-counterfeit module 24 may specifically include an input unit 241 and a determination unit 242:

an input unit 241, configured to input the normalized depth map of the palm region into a second neural network.

A determining unit 242, configured to obtain a determination value through the second neural network, and determine whether the actual scene corresponding to the depth map of the normalized palm region includes a real palm according to the determination value; the second neural network is a convolutional neural network and comprises 5 convolutional layers and 3 full-connection layers, the sizes of convolutional cores of the convolutional layers are 3 x 3, the number of output feature graphs is 32,64,128,256 and 512 respectively, the first three convolutional layers are connected with a down-sampling layer, a down-sampling window is 3 x 3, a linear rectification function ReLU is arranged behind all the convolutional layers, the number of nodes of the full-connection layers is 4096,1024 and 3 respectively, a Dropout function is connected behind the first two full-connection layers of the full-connection layers, the Dropout function is used for randomly setting the output to zero at a certain probability so as to prevent overfitting, and 3 nodes of the last layer of the full-connection layers represent 3 categories and represent a real palm, picture attack and video attack respectively.

The three-dimensional human hand positioning device provided by fig. 2 can not be sensitive to illumination by taking depth information (depth image) as input, so that the positioning work of the human hand can still be carried out under the extreme light condition, and the practicability is higher. Moreover, the three-dimensional anti-counterfeiting algorithm adopted by the three-dimensional hand positioning device can effectively prevent malicious attacks on pictures and videos, and the reliability of hand positioning is improved. In addition, the three-dimensional hand positioning device can directly obtain the coordinates of the key points in the three-dimensional space, and when the palm deflects, the angle transformation can be carried out in the three-dimensional space to obtain the positive palm, so that the negative influence on the subsequent gesture recognition algorithm is smaller.

Fig. 5 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 5, the terminal device 5 of this embodiment includes: a processor 50, a memory 51 and a computer program 52 stored in said memory 51 and executable on said processor 50, such as a program for performing a three-dimensional human hand keypoint localization method. The processor 50, when executing the computer program 52, implements the steps in the above-described method embodiments, e.g., S101 to S105 shown in fig. 1. Alternatively, the processor 50, when executing the computer program 52, implements the functions of the modules/units in the system embodiments, such as the functions of the modules 21 to 25 shown in fig. 2.

Illustratively, the computer program 52 may be partitioned into one or more modules/units that are stored in the memory 51 and executed by the processor 50 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 52 in the terminal device 5. For example, the computer program 52 may be partitioned into an acquisition module 21, a partitioning module 22, a normalization module 23, an anti-counterfeiting module 24, and a positioning module 25. (modules in the virtual system), the specific functions of each module are as follows:

an obtaining module 21, configured to obtain a depth image of an actual scene.

The terminal device 5 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device 5 may include, but is not limited to, a processor 50, a memory 51. Those skilled in the art will appreciate that fig. 5 is merely an example of a terminal device 5 and does not constitute a limitation of terminal device 5 and may include more or fewer components than shown, or some components may be combined, or different components, e.g., the terminal device may also include input-output devices, network access devices, buses, etc.

The Processor 50 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 51 may be an internal storage unit of the terminal device 5, such as a hard disk or a memory of the terminal device 5. The memory 51 may also be an external storage device of the terminal device 5, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) and the like provided on the terminal device 5. Further, the memory 51 may also include both an internal storage unit of the terminal device 5 and an external storage device. The memory 51 is used for storing the computer programs and other programs and data required by the terminal device 5. The memory 51 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the functional units, sub-units and modules described above are illustrated as examples, and in practical applications, the functions may be distributed as needed to different functional units, sub-units and modules, that is, the internal structure of the system may be divided into different functional units, sub-units or modules to complete all or part of the functions described above. Each functional unit, sub-unit, and module in the embodiments may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit, and the integrated units or sub-units may be implemented in a form of hardware, or may be implemented in a form of software functional units. In addition, specific names of the functional units, the sub-units and the modules are only used for distinguishing one from another, and are not used for limiting the protection scope of the application. The specific working processes of the units, sub-units, and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed system/terminal device and method can be implemented in other ways. For example, the above-described system/terminal device embodiments are merely illustrative, and for example, the division of the modules, units or sub-units is only one logical function division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A three-dimensional human hand key point positioning method is characterized by comprising the following steps:

acquiring a depth image of an actual scene;

2. The three-dimensional human hand key point positioning method according to claim 1, wherein the first neural network is a convolutional neural network, and the palm region segmentation of the depth image by the first neural network to obtain a segmented palm region comprises:

carrying out size transformation on the depth image to obtain a depth image with a fixed size;

inputting the depth image with the fixed size into a first neural network, and performing palm region segmentation on the depth image with the fixed size by using the first neural network;

and denoising the depth image with the fixed size after the palm region segmentation to obtain a segmented palm region.

3. The three-dimensional human hand key point positioning method according to claim 2, wherein the denoising processing is performed on the fixed-size depth image after the palm region segmentation to obtain a segmented palm region, and the denoising processing comprises:

selecting a first initial point from the divided depth image with the fixed size, extracting nearby pixel points communicated with the initial point from the initial point, storing the pixel points in a first list, and marking the pixel points as processed; the pixel value of the starting point is equal to the segmentation value of the pixel point communicated with the starting point;

continuously searching points which are adjacent to the points in the first list and have the same segmentation value in unprocessed pixel points in the segmented depth image with the fixed size, adding the points into the first list until all the pixel points which are communicated with the first starting point are found and marked as processed, and adding the pixel points into the first list;

selecting a second starting point from unprocessed pixel points in the divided depth image with the fixed size, finding and marking all pixel points communicated with the second starting point, and adding the pixel points into a second list;

selecting other starting points except the first starting point and the second starting point from unprocessed pixel points in the segmented depth image with the fixed size, finding and marking all pixel points communicated with the other starting points until all the points in the segmented depth image with the fixed size are marked as processed, and obtaining a plurality of lists, wherein each list maintains all the pixel points corresponding to 1 communicated region;

and finding a list with the pixel point value of 1 and the largest number of pixel points in all the lists, and determining that the list with the pixel point value of 1 and the largest number of pixel points corresponds to the palm area.

4. The method according to claim 2, wherein the normalizing and the size transforming are performed on the palm region to obtain a depth map of the normalized palm region, and the method comprises:

selecting the maximum value and the minimum value in the horizontal and vertical coordinates of all pixel points in the palm area relative to the coordinate system corresponding to the depth image with the fixed size to obtainObtaining the diagonal position coordinates of the rectangular frame containing the human hand according to the maximum value and the minimum value of the horizontal and vertical coordinates; wherein the position coordinate of the upper left corner is (

) The coordinate of the lower right corner is

)；

And cutting the depth image through the rectangular frame to obtain a depth map of the palm area, normalizing the depth value of the palm area, and carrying out size transformation on the palm area to obtain a normalized depth map of the palm area with a fixed size.

5. The method according to claim 1, wherein the second neural network is an anti-fake convolutional neural network for determining whether the acquired depth image belongs to a malicious attack, and the determining whether the actual scene corresponding to the depth image of the normalized palm region includes a real palm by the second neural network includes:

sending the normalized depth map of the palm region into a second neural network;

obtaining a judgment value through the second neural network, and judging whether the actual scene corresponding to the depth map of the normalized palm area contains the real palm or not through the judgment value; the second neural network comprises 5 convolutional layers and 3 fully-connected layers, the sizes of convolutional cores of the convolutional layers are 3 x 3, the number of output feature graphs is 32,64,128,256 and 512 respectively, a down-sampling layer is connected behind the first three convolutional layers, a down-sampling window is 3 x 3, a linear rectification function ReLU is arranged behind all the convolutional layers, the number of nodes of the fully-connected layers is 4096,1024 and 3 respectively, a Dropout function is connected behind the first two fully-connected layers of the fully-connected layers, the Dropout function is used for randomly setting the output to zero according to a preset probability to prevent overfitting, and 3 nodes of the last layer of the fully-connected layers represent 3 categories and represent real people, palm attacks and video attacks respectively.

6. The three-dimensional human hand keypoint location method of claim 4, wherein said third neural network is a convolutional neural network, said predicting, by said third neural network, keypoint coordinates of the depth map of the normalized palm region and determining, by the predicted keypoint coordinates of the depth map of the normalized palm region, keypoint coordinates of the palm in the depth image of the actual scene, comprising:

sending the depth map of the normalized palm region into a third neural network, and outputting predicted coordinate values of key points of the normalized palm region by the third neural network, wherein the number of the predicted coordinate values is m × 3, m represents the number of the palm key points, and each coordinate value represents an abscissa, an ordinate or a depth value;

converting the predicted coordinate values into coordinate values in a palm area before size conversion; wherein, the conversion formula is:

（1）

wherein the content of the first and second substances,

and

respectively representing width and height expansion factors in the size transformation process, (xi, yi) respectively representing an abscissa and an ordinate in predicted coordinate values of the key points, (xi ', yi') respectively representing an abscissa and an ordinate in coordinate values of the key points in the palm area before size transformation, and m represents the number of the key points;

adding the minimum value of the horizontal and vertical coordinates of the coordinate system corresponding to all the pixel points in the palm region before the size transformation relative to the depth image with the fixed size to the horizontal and vertical coordinates of the coordinate values of the key points in the palm region before the size transformation (the minimum value of the horizontal and vertical coordinates of the coordinate system corresponding to all the pixel points in the palm region before the size transformation relative to the depth image with the fixed size) ((

) And obtaining the coordinates of the key points of the palm in the depth image with the fixed size.

7. A three-dimensional human hand key point positioning device is characterized by comprising:

the acquisition module is used for acquiring a depth image of an actual scene;

8. The three-dimensional human hand keypoint locating device of claim 7, wherein said segmentation module comprises:

the transformation unit is used for carrying out size transformation on the depth image to obtain a depth image with a fixed size;

a segmentation unit, configured to input the fixed-size depth image into a first neural network, and perform palm region segmentation on the fixed-size depth image using the first neural network;

and the denoising unit is used for denoising the depth image with the fixed size after the palm region is segmented to obtain the segmented palm region.

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1-6 when executing the computer program.

10. A computer-readable medium, in which a computer program is stored which, when being processed and executed, carries out the steps of the method according to any one of claims 1 to 6.