CN109635630B

CN109635630B - Hand joint point detection method, device and storage medium

Info

Publication number: CN109635630B
Application number: CN201811238319.2A
Authority: CN
Inventors: 沈辉; 高原; 刘霄; 李旭斌; 孙昊; 文石磊; 丁二锐
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-10-23
Filing date: 2018-10-23
Publication date: 2023-09-01
Anticipated expiration: 2038-10-23
Also published as: CN109635630A

Abstract

The application provides a hand joint point detection method, a hand joint point detection device and a storage medium, wherein the method comprises the following steps: and intercepting a hand image from an image to be detected through a hand detection algorithm, inputting the hand image into a convolutional neural network for joint point prediction to obtain the position of a hand joint point in the hand image, optimizing the position of the hand joint point through a preset cascade structure according to the position of the hand joint point and constraint conditions of the hand joint point, and outputting detection results of all hand joint points in the image to be detected. The technical scheme realizes accurate positioning of the hand joint point in the image to be detected, and solves the problem of inaccurate positioning of the hand joint point in the prior art.

Description

Hand joint point detection method, device and storage medium

Technical Field

The present application relates to the field of image recognition technologies, and in particular, to a method and apparatus for detecting a hand joint point, and a storage medium.

Background

Gestures play an important role in the life of people, people can communicate with each other by using gestures, deaf-mutes can communicate with other people by using gestures, and gesture recognition has very large potential application scenes in application scenes such as man-machine interaction, virtual reality, simulated games and the like. Since hand joints can assist in determining gestures, how to detect and locate hand joints is of great significance to gesture confirmation and gesture recognition.

In the prior art, the position information of the hand joint point is mainly acquired based on an RGB-D method. Specifically, a depth camera is utilized to collect a picture comprising a hand, the picture is processed to obtain a depth picture in an RGB color mode, and further, the position information of a hand joint point is obtained according to the depth picture in the RGB color mode, so that the space information of the hand is restored to a certain extent, and the problem of gesture recognition is simplified. Among them, the RGB-D picture refers to a picture having RGB (red), green, blue (blue) color patterns and depth map (depth map) features.

However, in the RGB-D method, since the cost of the depth camera is high and the effective distance is short, if the distance between the hand and the depth camera is long, the effect of gesture recognition is poor, resulting in inaccurate positioning of the hand joint point.

Disclosure of Invention

The application provides a hand joint detection method, a hand joint detection device and a storage medium, which are used for solving the problem of inaccurate hand joint positioning in the prior art.

The hand joint point detection method provided by the first aspect of the application comprises the following steps:

intercepting a hand image from an image to be detected through a hand detection algorithm;

inputting the hand image into a convolutional neural network to predict joint points, and obtaining the positions of hand joint points in the hand image;

and optimizing the positions of the hand joints through a preset cascade structure according to the positions of the hand joints and constraint conditions of the hand joints, and outputting detection results of all the hand joints in the image to be detected.

Optionally, in a possible implementation manner of the first aspect, the capturing, by a human hand detection algorithm, a hand image from an image to be detected includes:

according to the human hand detection algorithm, determining the position of the human hand in the image to be detected;

and cutting the region where the hand position is located to obtain the hand image.

Optionally, in another possible implementation manner of the first aspect, the inputting the hand image into a convolutional neural network to perform joint point prediction, to obtain a position of a hand joint point in the hand image includes:

normalizing the hand image;

performing joint point prediction on the normalized hand image by adopting two continuous hourglass models to obtain a position prediction result, wherein the position prediction result comprises the position of the hand joint point; wherein the convolutional neural network comprises the two continuous hourglass models.

Optionally, in the foregoing possible implementation manner of the first aspect, the location prediction result further includes: and the corresponding heat degree map of each hand joint point is used for representing the confidence degree of the hand joint point at the corresponding position.

Optionally, in still another possible implementation manner of the first aspect, the optimizing, according to the position of the hand node and the constraint condition of the hand node, the position of the hand node through a preset cascade structure, and outputting a detection result of all hand nodes in the image to be detected includes:

inputting the position of the hand joint point into a first stage structure in the cascade structure according to the constraint condition of the hand joint point to obtain a prediction result;

and inputting the prediction result into a next stage structure to perform optimization to obtain a new prediction result, and repeating the steps until the last stage structure of the cascade structure outputs detection results of all hand joints in the image to be detected.

Optionally, the cascade structure includes a six-stage optimization structure.

A second aspect of the present application provides a hand joint point detection apparatus, comprising: the device comprises an acquisition module, a processing module and an output module;

the acquisition module is used for intercepting a hand image from the image to be detected through a hand detection algorithm;

the processing module is used for inputting the hand image into a convolutional neural network to conduct joint point prediction, and obtaining the position of a hand joint point in the hand image;

the output module is used for optimizing the positions of the hand joints through a preset cascade structure according to the positions of the hand joints and constraint conditions of the hand joints, and outputting detection results of all the hand joints in the image to be detected.

Optionally, in one possible implementation manner of the second aspect, the acquiring module is specifically configured to determine a hand position in the image to be detected according to the hand detection algorithm, and cut an area where the hand position is located, so as to obtain the hand image.

Optionally, in another possible implementation manner of the second aspect, the processing module is specifically configured to perform normalization processing on the hand image, and perform joint point prediction on the hand image after normalization processing by using two continuous hourglass models to obtain a position prediction result, where the position prediction result includes a position of the hand joint point; wherein the convolutional neural network comprises the two continuous hourglass models.

Optionally, in the foregoing possible implementation manner of the second aspect, the position prediction result further includes: and the corresponding heat degree map of each hand joint point is used for representing the confidence degree of the hand joint point at the corresponding position.

Optionally, in still another possible implementation manner of the second aspect, the output module is specifically configured to input, according to a constraint condition of the hand node, a position of the hand node to a first stage structure in the cascade structure to obtain a prediction result, input the prediction result to a next stage structure to perform optimization to obtain a new prediction result, and repeat the step until a last stage structure of the cascade structure outputs a detection result of all hand nodes in the image to be detected.

Optionally, the cascade structure includes a six-stage optimization structure.

A third aspect of the present application provides a hand joint detection device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing the method according to the first aspect and the various possible implementations of the first aspect when executing the program.

A fourth aspect of the application provides a storage medium having stored therein instructions which when run on a computer cause the computer to perform the above-described first aspect and the various possible implementations of the first aspect.

According to the hand joint detection method, the hand joint detection device and the storage medium, joint prediction is carried out on the acquired hand image to obtain the position of the joint in the hand image, and finally, the position of the joint is optimized according to the position of the joint and constraint conditions of the hand joint to output the detection result of the hand joint of the image, so that accurate positioning of the hand joint is achieved, and the problem of inaccurate positioning of the hand joint in the prior art is solved.

Drawings

Fig. 1 is a schematic flow chart of a first embodiment of a hand node detection method according to an embodiment of the present application;

fig. 2 is a schematic flow chart of a second embodiment of a hand node detection method according to the embodiment of the present application;

fig. 3 is a schematic flow chart of a third embodiment of a hand node detection method according to the embodiment of the present application;

FIG. 4 is a schematic diagram of a sand leakage pattern according to an embodiment of the present application;

FIG. 5 is a heat map of a hand joint according to an embodiment of the present application;

fig. 6 is a schematic flow chart of a fourth embodiment of a hand node detection method according to the present application;

fig. 7 is a schematic structural diagram of a first embodiment of a hand joint detection device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a second embodiment of a hand joint detection device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In the field of image recognition, gesture recognition is now difficult in view of some characteristics of the human hand itself. First, since the flexibility of the human hand is high, the gesture of the human hand is a continuous state, and thus the defined boundary of the gesture is blurred. Second, the self-occlusion phenomenon is very severe in certain postures of the human hand, and certain parts of the human hand are not visible at all angles. For example, the fist gesture, a person cannot observe the tips of all fingers at any view angle.

Moreover, another problem with gesture recognition is that the same gesture differs very much from different perspectives, and the gap between different gestures differs very little from some perspectives. For example, a person looking at the palm from the front can clearly see five fingers, but a person looking at the side may see only one or two fingers. Finally, the same gesture is performed by different people, and the difference is very large, for example, the number 3 is marked by the hand, and the fingers extending from different people may be completely different. These characteristics of gestures make gesture recognition difficult.

Aiming at the problem of inaccurate positioning of hand joint points in the prior art, the embodiment of the application provides a hand joint point detection method, a hand joint point detection device and a storage medium, the position of the joint point in an acquired hand image is obtained by predicting the joint point of the hand image, and finally, the position of the joint point is optimized according to the position of the joint point and constraint conditions of the hand joint point to output a detection result of the hand joint point of the image, so that the accurate positioning of the hand joint point is realized.

The technical scheme of the application is described in detail through specific embodiments. It should be noted that the following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.

Fig. 1 is a flowchart of a first embodiment of a method for detecting a hand node according to an embodiment of the present application. As shown in fig. 1, the hand joint point detection method may include the following steps:

step 11: and intercepting a hand image from the image to be detected through a hand detection algorithm.

Aiming at the problems of high cost and short effective distance of the depth camera in the prior art, the embodiment of the application can acquire the image containing the hand by using the common camera, and process the image by using processing software, namely using an RGB color mode to obtain the image to be detected of the RGB color mode.

It should be noted that, in this embodiment, the image to be detected processed by the human hand detection algorithm is an RGB image. That is, a hand image including a human hand can be determined and acquired from the image to be detected using a deep convolutional neural network corresponding to the human hand detection algorithm.

For a specific implementation of this step 11, reference may be made to the following description in the embodiment shown in fig. 2, which is not repeated here.

Step 12: and inputting the hand image into a convolutional neural network to predict joint points, and obtaining the positions of hand joint points in the hand image.

In this embodiment, a convolutional neural network may be designed, and the convolutional neural network may be used to process a hand image including a human hand, so as to predict the position of a hand node in the hand image.

Optionally, in order to further improve accuracy of a prediction result obtained by the convolutional neural network, normalization processing may be performed on the hand image first, and then position prediction of the hand optical node may be performed on the hand image after normalization processing by using the convolutional neural network.

For a specific implementation of this step 12, reference may be made to the following description in the embodiment shown in fig. 3, which is not repeated here.

Step 13: and optimizing the positions of the hand joints through a preset cascade structure according to the positions of the hand joints and constraint conditions of the hand joints, and outputting detection results of all the hand joints in the image to be detected.

In this embodiment, the detection result of the position of the hand node in the hand image obtained in the step 12 may be inaccurate, that is, only the position of the hand node with obvious features may be obtained in normal cases, but the position of the hand node with insignificant features may not be obtained.

In this embodiment, the constraint condition of the hand node can be used, and the position of the hand node with obvious characteristics can be combined, so that the position of the hand node which is difficult to detect can be obtained in an auxiliary manner, and then the obtained position of the hand node is optimized by using a preset cascade structure, so that the detection result of the hand node in the image to be detected is continuously optimized, and the detection results of all the hand nodes in the image to be detected are further output.

According to the hand joint point detection method provided by the embodiment of the application, the hand image is intercepted from the image to be detected through a hand detection algorithm, then the hand image is input into the convolutional neural network for joint point prediction, the position of the hand joint point in the hand image is obtained, finally the position of the hand joint point is optimized through a preset cascade structure according to the position of the hand joint point and the constraint condition of the hand joint point, and the detection result of all the hand joint points in the image to be detected is output. According to the technical scheme, the detection result of the hand joint point can be obtained without a depth camera, and the problem that in the prior art, the hand joint point is positioned inaccurately due to the fact that the cost of the depth camera is high and the effective distance is short is solved.

Optionally, based on the foregoing embodiment, fig. 2 is a schematic flow chart of a second embodiment of a hand node detection method according to the embodiment of the present application. As shown in fig. 2, in this embodiment, the step 11 (capturing the hand image from the image to be detected by the hand detection algorithm) may be specifically implemented by the following steps:

step 21: and determining the hand position in the image to be detected according to a hand detection algorithm.

Alternatively, in this embodiment, the hand key points may be obtained from the RGB image by using a hand detection algorithm corresponding to the deep convolutional neural network, so as to determine the hand position in the image to be detected.

Specifically, in this embodiment, the human hand detection algorithm is a general term of SSD detection algorithm and mobilet+ fpn structure, and the combination of this method determines the position of the human hand in the image to be detected, so that the accuracy and speed of the detection result corresponding to the position of the human hand can be ensured.

The SSD detection algorithm is a single lens (single shot detector, SSD) detection algorithm, adopts a multi-scale feature fusion mode, does not have an up-sampling process, namely, extracts features with different scales from different layers of a network to make predictions, and does not increase extra calculation amount.

mobilets are based on a streamlined architecture that uses depth-separable convolution to construct a lightweight deep neural network, i.e., in embodiments of the present application a mobilet network is used as the feature extraction network for extracting the human hand feature matrix from the image to be detected. Optionally, in this embodiment, the standard convolution layer may be split into two mobilent networks of the convolution layers, so that the calculation time and the parameter number may be greatly reduced on the premise of basically ensuring the accuracy.

In addition, in this embodiment, in order to ensure that a smaller object can be detected, a feature pyramid (feature pyramid networks, FPN) network structure is adopted, the above-obtained feature matrix (human hand feature matrix) is up-sampled, so as to obtain a feature matrix with the same resolution as the bottom layer feature matrix, and then the feature matrix is fused with the bottom layer feature matrix, so that the effect of prediction is achieved by fusing the features of different layers by using the high-resolution of the low-layer features and the high-semantic information of the high-layer features, and the prediction is performed on each fused feature layer independently, which is different from the conventional feature fusion mode.

Step 22: and cutting the region where the hand is positioned to obtain a hand image.

In this embodiment, after the hand position in the image to be detected is located using the hand detection algorithm, the region where the hand position is located may be cut under the condition that the hand is ensured not to be distorted and the size of the hand can be kept in a fixed size ratio, so as to obtain the hand image.

It should be noted that, in this embodiment, after the right hand image is obtained by using the above-mentioned hand detection algorithm, the left hand image is equivalent to the image after the right hand image is horizontally flipped, and the detection problem of the hand joint point is further simplified to the detection problem of the single hand joint point.

According to the hand joint point detection method provided by the embodiment of the application, the hand position in the image to be detected is determined according to the hand detection algorithm, and the area where the hand position is located is cut to obtain the hand image. According to the technical scheme, the hand position in the image to be detected can be accurately determined, the accuracy and the speed of the detection result corresponding to the hand position are guaranteed, and a precondition is provided for subsequent detection of the hand joint point.

Optionally, on the basis of the foregoing embodiment, fig. 3 is a schematic flow chart of a third embodiment of a hand node detection method according to the embodiment of the present application. As shown in fig. 3, in this embodiment, the step 12 (inputting the hand image into the convolutional neural network to perform joint prediction, to obtain the position of the hand joint point in the hand image) may be specifically implemented by the following steps:

step 31: and carrying out normalization processing on the hand image.

Alternatively, the image normalization process refers to finding a set of parameters using invariant moments of the image so that it can eliminate the effect of other transformation functions on the image transformation, i.e., converting the image into a unique standard form to resist affine transformation, or interference due to light non-uniformity. Therefore, in this embodiment, after obtaining the hand image in the image to be detected, normalization processing may be performed on the hand image first, so as to obtain a hand image after normalization processing.

Specifically, the method of normalization can be explained as follows: normalized image pixel value= (original image pixel value-pixel mean value)/pixel amplitude range, wherein the pixel mean value is the mean value of all the image pixel values to be detected, and the pixel amplitude range is the amplitude range of all the image pixel values to be detected.

Step 32: and carrying out joint point prediction on the normalized hand image by adopting two continuous hourglass models to obtain a position prediction result.

In this embodiment, the position prediction result includes the position of the hand joint point; wherein the convolutional neural network comprises two continuous hourglass models.

Alternatively, the present embodiment employs a convolutional neural network comprising two sequential hourglass models to predict the location of hand nodes in hand images. Specifically, the hand image after normalization processing is used as input of a convolutional neural network, so that the initialized hand joint point position is obtained. The feature matrix corresponding to the hand image after normalization is input into a network of two continuous hourglass model (hoursclass) structures to obtain a feature matrix with regional features and global semantic features, and the feature matrix is output to predict the position of an initialized hand joint point through a layer of convolution layer.

Fig. 4 is a schematic structural diagram of a sand leakage model in an embodiment of the application. As shown in fig. 4, the upper half of the hourglass model performs convolution (convolution) and pooling (pooling) operations on the bottom feature matrix corresponding to the hand image after normalization processing, reduces the resolution of the bottom feature matrix, then performs upsampling by using the lower half of the hourglass model to obtain a feature matrix with the same resolution as the bottom feature matrix, and finally adds the feature matrix and the bottom feature matrix with the same resolution bit by bit, so that the obtained feature matrix has both regional features and global semantic features.

For example, the operation of the hourglass model may be formulated as X' =bilinearupsample (MultiConv (X))+x, where MultiConv represents a multi-layer convolution operation and BilinearUpsample represents bilinear interpolation upsampling. The method comprises the steps of firstly carrying out convolution operation on a bottom-layer feature matrix X by an hourglass model to obtain a high-layer feature matrix, secondly carrying out up-sampling on the high-layer feature matrix by double linear interpolation up-sampling to obtain a feature matrix with the same resolution as the bottom-layer feature matrix, and finally adding the up-sampled feature matrix and the input feature matrix bit by bit to obtain a feature matrix with regional features and global semantic features.

In summary, the hand image in this embodiment is continuously passed through two similar hourglass model (hoursglass) networks, and then based on the obtained feature matrix, a coarse prediction of the position of the hand joint is output through several convolution layers.

According to the hand joint point detection method provided by the embodiment of the application, the hand image is normalized, and two continuous hourglass models are adopted to conduct joint point prediction on the hand image after normalization processing, so that a position prediction result is obtained, wherein the position prediction result comprises the position of the hand joint point; wherein the convolutional neural network comprises two continuous hourglass models. According to the technical scheme, the hand joint point position of the hand image after normalization processing is predicted by using the hourglass model, so that the prediction accuracy of the hand joint point is improved.

Optionally, in this embodiment, the location prediction result further includes: and the corresponding heat degree map of each hand joint point is used for representing the confidence degree of the hand joint point at the corresponding position.

Specifically, two continuous hourglass models are adopted to conduct joint point prediction on the hand image after normalization processing, and a position prediction result is obtained. The position prediction result not only comprises the position of each hand joint point, but also comprises a heat map of each hand joint point. And determining the confidence coefficient of the position of each hand joint obtained through prediction according to the heat map, namely the probability of accurate occurrence of the joint at the corresponding position. If the confidence that the obtained hand joint points appear at the corresponding positions is high, a precondition is provided for obtaining accurate prediction results of all hand joint points in the image to be detected later.

Exemplary, fig. 5 is a heat map of a hand joint according to an embodiment of the present application. As shown in fig. 5, the hand image is input into a convolutional neural network with two continuous hourglass models, and a feature matrix is output, and the feature matrix can obtain a new feature matrix through a plurality of convolutional layers. And then carrying out convolution operation on the new feature matrix by using 21 convolution cores of 3x3, and obtaining 21 values after convolution of each pixel point, wherein each value represents the possibility that the corresponding hand joint point exists in the pixel point. Thus, each pixel has 21 such values, and 21 heat maps are obtained. The brighter each heat map is, the higher the likelihood that the location belongs to the corresponding hand node.

Optionally, based on the foregoing embodiment, fig. 6 is a schematic flow chart of a fourth embodiment of a hand node detection method according to an embodiment of the present application. As shown in fig. 6, in this embodiment, the step 13 (optimizing the position of the hand node through a preset cascade structure according to the position of the hand node and the constraint condition of the hand node, and outputting the detection result of all the hand nodes in the image to be detected) may be implemented by the following steps:

step 61: and inputting the position of the hand node into a first stage structure in the cascade structure according to the constraint condition of the hand node to obtain a prediction result.

Optionally, in this embodiment, due to the self-occlusion factor of the human hand and the effect of the viewing angle on the visibility of the human hand, some of the hand nodes in the hand image are occluded and these hand nodes are not visible in the hand image. Therefore, the convolutional neural network can only predict and obtain the position of the relatively obvious hand joint point, and the obtained hand joint point is relatively coarse.

However, because there are some constraint relationships between hand nodes, e.g., the distance between finger nodes is less than or equal to the length of the finger, the root nodes of the fingers are always sequentially adjacent. For example, the gesture of making a fist, when viewed from the back, can only see the root node of the finger and the third node of the finger. But the fingertip and the second joint point of the finger are not visible due to occlusion. However, according to potential constraints, the distance of the finger joints does not exceed the length of the finger, and therefore, each finger is occluded around the visible joint.

Therefore, after accurately predicting the positions of the hand nodes with obvious characteristics, the positions of the rest hand nodes are predicted in an auxiliary mode according to constraint conditions among the hand nodes on the human hand.

By way of example, the embodiment of the present application uses a cascade structure to continuously optimize the prediction results of the hand nodes, i.e., the positions of the hand nodes in the above-mentioned position prediction results may be first input into the first stage structure in the cascade structure to obtain the prediction results.

Step 62: and inputting the prediction result into a next stage structure to perform optimization to obtain a new prediction result, and repeating the steps until the last stage structure of the cascade structure outputs detection results of all hand joints of the image to be detected.

In this embodiment, after the first stage structure of the cascade structure obtains the prediction result, the prediction result is input into the second stage structure for optimization, so as to obtain the prediction result of the information. Specifically, each stage structure can be realized through a convolution layer, so that the positions of hand joint points with unobvious characteristics are further predicted. Similarly, the output result of each stage structure in the cascade structure is input into the next stage structure, so that the prediction results of all hand joint points in the hand image are further optimized. And finally, taking the output of the last stage structure of the cascade structure as the detection result of all hand joint points in the image to be detected.

According to the hand joint point detection method provided by the embodiment of the application, according to the constraint condition of the hand joint point, the position of the hand joint point can be input into the first stage structure in the cascade structure to obtain the prediction result, then the prediction result is input into the next stage structure to be optimized to obtain the new prediction result, and the steps are repeated until the last stage structure of the cascade structure outputs the detection results of all the hand joint points of the image to be detected. According to the technical scheme, the detection results of all hand joint points in the image to be detected can be obtained according to the positions of the hand joint points in the position detection results, so that the detection scheme of the hand joint points is simplified, the positioning accuracy of the hand joint points is improved, and the method has guiding significance for gesture recognition.

Illustratively, in this embodiment, the cascade structure includes a six-stage optimization structure.

Optionally, the cascade structure may include six stages, except that the input of the first stage structure is the position prediction result obtained above, and the input of each stage structure is the output of the hand node output by the previous stage structure, that is, the position detection result predicts the positions of all hand nodes in the image to be detected through several layers of convolution layers.

The following are examples of the apparatus of the present application that may be used to perform the method embodiments of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method of the present application.

Fig. 7 is a schematic structural diagram of a first embodiment of a hand joint detection device according to an embodiment of the present application. As shown in fig. 7, the apparatus may include: an acquisition module 71, a processing module 72 and an output module 73.

The acquiring module 71 is configured to intercept a hand image from an image to be detected through a hand detection algorithm;

the processing module 72 is configured to input the hand image into a convolutional neural network for joint prediction, so as to obtain a position of a hand joint point in the hand image;

the output module 73 is configured to optimize the position of the hand node according to the position of the hand node and the constraint condition of the hand node through a preset cascade structure, and output a detection result of all hand nodes in the image to be detected.

Optionally, in one possible implementation manner of this embodiment, the obtaining module 71 is specifically configured to determine a hand position in the image to be detected according to the hand detection algorithm, and cut an area where the hand position is located, so as to obtain the hand image.

Optionally, in another possible implementation manner of this embodiment, the processing module 72 is specifically configured to perform normalization processing on the hand image, and perform joint point prediction on the hand image after normalization processing by using two continuous hourglass models, so as to obtain a position prediction result, where the position prediction result includes a position of the hand joint point; wherein the convolutional neural network comprises the two continuous hourglass models.

Optionally, in the foregoing possible implementation manner of this embodiment, the position prediction result further includes: and the corresponding heat degree map of each hand joint point is used for representing the confidence degree of the hand joint point at the corresponding position.

Optionally, in one possible implementation manner of this embodiment, the output module 72 is specifically configured to input, according to constraint conditions of the hand node, a position of the hand node to a first stage structure in the cascade structure to obtain a prediction result, input the prediction result to a next stage structure to perform optimization to obtain a new prediction result, and repeat the steps until a last stage structure of the cascade structure outputs detection results of all hand nodes in the image to be detected.

Illustratively, the cascade structure includes a six-stage optimization structure.

The device provided in the embodiment of the present application may be used to perform the methods in the embodiments shown in fig. 1 to 6, and its implementation principle and technical effects are similar, and are not described herein again.

It should be noted that, it should be understood that the division of the modules of the above apparatus is merely a division of a logic function, and may be fully or partially integrated into a physical entity or may be physically separated. And these modules may all be implemented in software in the form of calls by the processing element; or can be realized in hardware; the method can also be realized in a form of calling software by a processing element, and the method can be realized in a form of hardware by a part of modules. For example, the determining module may be a processing element that is set up separately, may be implemented in a chip of the above apparatus, or may be stored in a memory of the above apparatus in the form of program code, and may be called by a processing element of the above apparatus and execute the functions of the determining module. The implementation of the other modules is similar. In addition, all or part of the modules can be integrated together or can be independently implemented. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in a software form.

For example, the modules above may be one or more integrated circuits configured to implement the methods above, such as: one or more specific integrated circuits (application specific integrated circuit, ASIC), or one or more microprocessors (digital signal processor, DSP), or one or more field programmable gate arrays (field programmable gate array, FPGA), or the like. For another example, when a module above is implemented in the form of a processing element scheduler code, the processing element may be a general purpose processor, such as a central processing unit (central processing unit, CPU) or other processor that may invoke the program code. For another example, the modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

Fig. 8 is a schematic structural diagram of a second embodiment of a hand joint detection device according to an embodiment of the present application. As shown in fig. 8, the hand joint point detection device may include: a processor 81 and a memory 82 and a computer program stored on the memory 82 and executable on the processor 81, which when executed by the processor 81 implements the method of the embodiments as described above with reference to fig. 1 to 6.

Optionally, an embodiment of the present application further provides a storage medium, where instructions are stored, when the storage medium runs on a computer, to cause the computer to perform the method of the embodiment shown in fig. 1 to 6.

Optionally, an embodiment of the present application further provides a chip for executing instructions, where the chip is configured to perform the method of the embodiment shown in fig. 1 to fig. 6.

The term "plurality" herein refers to two or more. The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship; in the formula, the character "/" indicates that the front and rear associated objects are a "division" relationship.

It will be appreciated that the various numerical numbers referred to in the embodiments of the present application are merely for ease of description and are not intended to limit the scope of the embodiments of the present application.

It should be understood that, in the embodiment of the present application, the sequence number of each process does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims

1. A method for detecting a hand joint point, comprising:

optimizing the positions of the hand joints through a preset cascade structure according to the positions of the hand joints and constraint conditions of the hand joints, and outputting detection results of all the hand joints in the image to be detected;

the method for capturing the hand image from the image to be detected through the hand detection algorithm comprises the following steps:

cutting the region where the hand position is located to obtain a hand image;

inputting the hand image into a convolutional neural network for joint point prediction to obtain the position of a hand joint point in the hand image, wherein the method comprises the following steps:

normalizing the hand image;

performing joint point prediction on the normalized hand image by adopting two continuous hourglass models to obtain a position prediction result, wherein the position prediction result comprises the position of a hand joint point; wherein the convolutional neural network comprises the two continuous hourglass models;

optimizing the positions of the hand nodes through a preset cascade structure according to the positions of the hand nodes and constraint conditions of the hand nodes, and outputting detection results of all the hand nodes in the image to be detected, wherein the detection results comprise:

inputting the position of the hand joint point into a first stage structure in a cascade structure according to the constraint condition of the hand joint point to obtain a prediction result;

2. The method of claim 1, wherein the position prediction result further comprises: and the corresponding heat degree map of each hand joint point is used for representing the confidence degree of the hand joint point at the corresponding position.

3. The method of claim 1, wherein the cascade structure comprises a six-stage optimization structure.

4. A hand joint point detection device, comprising: the device comprises an acquisition module, a processing module and an output module;

the output module is used for optimizing the positions of the hand joint points through a preset cascade structure according to the positions of the hand joint points and constraint conditions of the hand joint points and outputting detection results of all the hand joint points in the image to be detected;

the acquisition module is specifically configured to determine a hand position in the image to be detected according to the hand detection algorithm, and cut an area where the hand position is located to obtain the hand image;

the processing module is specifically configured to perform normalization processing on the hand image, perform joint point prediction on the hand image after normalization processing by using two continuous hourglass models, and obtain a position prediction result, where the position prediction result includes a position of a hand joint point; wherein the convolutional neural network comprises the two continuous hourglass models;

the output module is specifically configured to input the position of the hand node to a first stage structure in the cascade structure according to the constraint condition of the hand node to obtain a prediction result, input the prediction result to a next stage structure to perform optimization to obtain a new prediction result, and repeat the steps until the last stage structure of the cascade structure outputs the detection results of all the hand nodes in the image to be detected.

5. The apparatus of claim 4, wherein the position predictor further comprises: and the corresponding heat degree map of each hand joint point is used for representing the confidence degree of the hand joint point at the corresponding position.

6. The apparatus of claim 4, wherein the cascade structure comprises a six-stage optimization structure.

7. A hand joint detection device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-3 when executing the program.

8. A storage medium having stored therein instructions which, when run on a computer, cause the computer to perform the method of any of claims 1-3.