CN111222486A

CN111222486A - Training method, device and equipment for hand gesture recognition model and storage medium

Info

Publication number: CN111222486A
Application number: CN202010042559.6A
Authority: CN
Inventors: 陈逸飞; 吴建宝; 范伟
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2020-06-02
Anticipated expiration: 2040-01-15
Also published as: CN111222486B

Abstract

The application provides a training method, a training device, equipment and a storage medium for a hand gesture recognition model, and relates to the technical field of AI. The method comprises the following steps: acquiring a training sample, wherein the training sample comprises a sample hand image and sample hand posture information corresponding to the sample hand image; obtaining confidence maps corresponding to m sections of limbs respectively according to the sample hand posture information, wherein m is a positive integer; generating a synthesized hand part cutting image corresponding to the sample hand image according to the confidence level maps corresponding to the m limbs respectively; and training the hand gesture recognition model by adopting the sample hand image, the synthesized hand part segmentation image and the sample hand gesture information. Compared with the prior art, according to the technical scheme provided by the embodiment of the application, after the sample hand posture information is obtained, the synthetic hand segmentation graph is automatically obtained, manual marking is not needed, and the model is trained based on the synthetic hand segmentation graph, so that the labor cost and the time cost required by model training are reduced.

Description

Training method, device and equipment for hand gesture recognition model and storage medium

Technical Field

The embodiment of the application relates to the technical field of Artificial Intelligence (AI), in particular to a training method, a device, equipment and a storage medium for a hand gesture recognition model.

Background

The hand gesture recognition means that the positions of human hand skeleton nodes are accurately recognized from the images. Hand gestures play an important role in many AI applications, such as human-computer interaction, virtual reality, augmented reality, and the like.

In the related art, the human hand gesture recognition usually employs a multi-task gesture recognition model, such as Mask-position masked CNN (Convolutional Neural Networks), which is a combination of gesture recognition and instance segmentation. In the training process of the gesture recognition model, a large number of hand images and hand segmentation maps corresponding to the hand images are required to be obtained as training samples, and then the hand key points are predicted.

In the related art, a large number of hand segmentation maps need to be labeled manually, which results in high labor cost and time cost for model training.

Disclosure of Invention

The embodiment of the application provides a training method, a training device, equipment and a storage medium for a hand gesture recognition model, which can be used for reducing the labor cost and the time cost required by model training. The technical scheme is as follows:

in one aspect, an embodiment of the present application provides a method for training a hand gesture recognition model, where the method includes:

acquiring a training sample, wherein the training sample comprises a sample hand image and sample hand posture information corresponding to the sample hand image;

obtaining confidence maps corresponding to m sections of limbs according to the sample hand posture information, wherein the value of a pixel point in the confidence map corresponding to the ith section of limb is used for representing the confidence of the pixel point belonging to the ith section of limb, m is a positive integer, and i is a positive integer less than or equal to m;

generating a synthesized hand part segmentation map corresponding to the sample hand image according to the confidence level maps corresponding to the m limbs respectively, wherein the synthesized hand part segmentation map is an image obtained by segmenting a hand region and a non-hand region in the sample hand image, and the value of a pixel point in the synthesized hand part segmentation map is used for representing the confidence level of the pixel point belonging to the limbs;

and training the hand gesture recognition model by adopting the sample hand image, the synthesized hand part segmentation map and the sample hand gesture information.

In another aspect, an embodiment of the present application provides a hand gesture recognition method, where the method includes:

acquiring a target hand image;

calling a hand gesture recognition model, wherein the hand gesture recognition model is obtained by training a sample hand image, a synthesized hand segmentation graph corresponding to the sample hand image and sample hand gesture information corresponding to the sample hand image, the synthesized hand segmentation graph is obtained according to the sample hand gesture information, and the value of a pixel point in the synthesized hand segmentation graph is used for representing the confidence coefficient that the pixel point belongs to a limb;

and determining hand gesture information corresponding to the target hand image through the hand gesture recognition model.

Optionally, the hand gesture recognition model comprises a feature extraction part, a structure prediction part and a gesture prediction part;

the characteristic extraction part is used for extracting a characteristic map of the target hand image;

the structure prediction part is used for acquiring a predicted hand segmentation graph according to the feature graph, wherein the predicted hand segmentation graph comprises a predicted finger segmentation subgraph, a predicted palm segmentation subgraph and a predicted hand segmentation subgraph which are respectively corresponding to each finger, the predicted finger segmentation subgraph corresponding to the target finger is an image segmented with a target finger region and a non-target finger region, the predicted palm segmentation subgraph is an image segmented with a palm region and a non-palm region, and the predicted hand segmentation subgraph is an image segmented with a hand region and a non-hand region;

and the gesture prediction part is used for obtaining hand gesture information corresponding to the target hand image according to the feature map and the predicted hand segmentation map.

Optionally, after determining, by the hand gesture recognition model, hand gesture information corresponding to the target hand image, the method further includes:

determining a hand skeleton model corresponding to the target hand image according to the hand posture information corresponding to the target hand image;

and determining a hand gesture corresponding to the target hand image based on the hand skeleton model.

In another aspect, an embodiment of the present application provides an image processing method, where the method includes:

acquiring a target video, wherein each image frame of the target video comprises a target user hand;

acquiring hand gesture information corresponding to each image frame through a hand gesture recognition model, wherein the hand gesture recognition model is obtained by training a sample hand image, a synthesized hand segmentation image corresponding to the sample hand image and sample hand gesture information corresponding to the sample hand image, the synthesized hand segmentation image is acquired according to the sample hand gesture information, and the value of a pixel point in the synthesized hand segmentation image is used for representing the confidence coefficient that the pixel point belongs to a limb;

determining the hand postures corresponding to the image frames according to the hand posture information corresponding to the image frames;

and determining the gesture recognition result of the hand of the target user according to the hand gestures respectively corresponding to the image frames.

In another aspect, an embodiment of the present application provides a training apparatus for a hand gesture recognition model, the apparatus including:

the system comprises a sample acquisition module, a training sample acquisition module and a training sample processing module, wherein the sample acquisition module is used for acquiring a training sample, and the training sample comprises a sample hand image and sample hand posture information corresponding to the sample hand image;

a confidence coefficient obtaining module, configured to obtain confidence coefficient maps corresponding to m segments of limbs respectively according to the sample hand posture information, where a value of a pixel point in the confidence coefficient map corresponding to the ith segment of limb is used to represent a confidence coefficient that the pixel point belongs to the ith segment of limb, m is a positive integer, and i is a positive integer less than or equal to m;

a segmentation map generation module, configured to generate a synthesized hand segmentation map corresponding to the sample hand image according to the confidence maps corresponding to the m limbs, where the synthesized hand segmentation map is an image obtained by segmenting a hand region and a non-hand region in the sample hand image, and a value of a pixel point in the synthesized hand segmentation map is used to represent a confidence that the pixel point belongs to a limb;

and the model training module is used for training the hand gesture recognition model by adopting the sample hand image, the synthesized hand part segmentation map and the sample hand gesture information.

In another aspect, an embodiment of the present application provides a hand gesture recognition apparatus, including:

the image acquisition module is used for acquiring a target hand image;

the hand gesture recognition system comprises a model calling module, a hand gesture recognition module and a hand gesture recognition module, wherein the hand gesture recognition module is obtained by training a sample hand image, a synthesized hand segmentation image corresponding to the sample hand image and sample hand gesture information corresponding to the sample hand image, the synthesized hand segmentation image is obtained according to the sample hand gesture information, and the value of a pixel point in the synthesized hand segmentation image is used for representing the confidence coefficient that the pixel point belongs to a limb;

and the gesture determining module is used for determining hand gesture information corresponding to the target hand image through the hand gesture recognition model.

In another aspect, an embodiment of the present application provides an image processing apparatus, including:

the video acquisition module is used for acquiring a target video, and each image frame of the target video comprises a target user hand;

the information acquisition module is used for acquiring hand gesture information corresponding to each image frame through a hand gesture recognition model, the hand gesture recognition model is obtained by training a sample hand image, a synthesized hand segmentation graph corresponding to the sample hand image and sample hand gesture information corresponding to the sample hand image, the synthesized hand segmentation graph is acquired according to the sample hand gesture information, and the value of a pixel point in the synthesized hand segmentation graph is used for representing the confidence coefficient of the pixel point belonging to a limb;

the gesture determining module is used for determining the hand gestures respectively corresponding to the image frames according to the hand gesture information respectively corresponding to the image frames;

and the result determining module is used for determining the gesture recognition result of the hand of the target user according to the hand gestures respectively corresponding to the image frames.

In yet another aspect, embodiments of the present application provide a computer device, which includes a processor and a memory, where the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement a training method for a hand gesture recognition model according to the above aspect, or implement a hand gesture recognition method according to the above aspect, or implement an image processing method according to the above aspect.

In yet another aspect, embodiments of the present application provide a computer-readable storage medium having at least one instruction, at least one program, a set of codes, or a set of instructions stored therein, which is loaded and executed by a processor to implement a method for training a hand gesture recognition model according to the above aspect, or to implement a method for hand gesture recognition according to the above aspect, or to implement a method for image processing according to the above aspect.

In yet another aspect, embodiments of the present application provide a computer program product, which when executed by a processor, is configured to implement the above-mentioned training method for a hand gesture recognition model, or implement the hand gesture recognition method according to the above-mentioned aspect, or implement the image processing method according to the above-mentioned aspect.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

and obtaining a synthetic hand part segmentation image corresponding to the sample hand image according to the sample hand posture information, and training the hand posture recognition model by adopting the sample hand image, the synthetic hand part segmentation image and the sample hand posture information. Compared with the prior art, manual labeling is needed to obtain a hand segmentation graph, and model training is carried out based on the hand segmentation graph. According to the technical scheme, after the sample hand posture information is acquired, the synthetic hand segmentation graph is automatically acquired, manual marking is not needed, and the model is trained based on the synthetic hand segmentation graph, so that the labor cost and the time cost required by model training are reduced.

Drawings

FIG. 1 illustrates a flow chart of a training method of a hand gesture recognition model provided herein;

FIG. 2 is a flow chart of a method for training a hand gesture recognition model provided by an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating an expression of a composite hand segmentation map of the present application;

FIG. 4 is a flow chart of a method for training a hand gesture recognition model according to another embodiment of the present application;

FIG. 5 is a schematic diagram illustrating exemplary hand keypoints of the present application;

FIG. 6 is a schematic diagram illustrating another synthetic hand segmentation representation of the present application;

FIG. 7 is a diagram illustrating a composite finger segmentation subgraph according to the present application;

FIG. 8 is a schematic diagram illustrating a composite palm segmentation subgraph according to the present application;

FIG. 9 is a schematic diagram illustrating a composite hand segmentation subgraph of one embodiment of the present application;

FIG. 10 illustrates a schematic diagram of a hand gesture recognition model;

FIG. 11 is a flow diagram of a method of hand gesture recognition provided by one embodiment of the present application;

FIG. 12 is a flow chart of an image processing method provided by an embodiment of the present application;

FIG. 13 is a flow chart illustrating an image processing method provided by an embodiment of the present application;

FIG. 14 is a flow chart illustrating an image processing method provided by another embodiment of the present application;

FIG. 15 is a flow chart illustrating an image processing method provided by another embodiment of the present application;

FIG. 16 is a block diagram of a training apparatus for a hand gesture recognition model provided by an embodiment of the present application;

FIG. 17 is a block diagram of a hand gesture recognition apparatus provided by one embodiment of the present application;

fig. 18 is a block diagram of an image processing apparatus according to an embodiment of the present application;

fig. 19 is a block diagram of a terminal according to an embodiment of the present application;

fig. 20 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

AI is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

CV (Computer Vision) Computer Vision is a science for researching how to make a machine "see", and further refers to using a camera and a Computer to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further performing graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes technologies such as image processing, image Recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction, and the like, and also includes common biometric technologies such as face Recognition, fingerprint Recognition, and the like.

ML (Machine Learning) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to artificial intelligent CV, ML and other technologies, and provides a hand gesture recognition method which can be applied to the field of dyskinesia disease analysis and research, such as Parkinson's disease analysis and research; the method can also be applied to the field of gesture control, such as in an intelligent home system, the functions of intelligent equipment are controlled through gestures of a user, and various functions and parameters in a vehicle are controlled through gestures in a driving assistance system; the method can also be applied to motion sensing game operation, such as operating an electronic game through gesture change; the method can also be used in the field of sign language recognition, for example, the gesture actions of the deaf-mute are converted into information such as text and the like, and the barrier-free communication between the deaf-mute and normal people is realized.

In the method provided by the embodiment of the present application, the execution subject of each step may be a Computer device, which refers to an electronic device with data calculation, processing and storage capabilities, such as a PC (Personal Computer) or a server. Alternatively, the computer device may be a terminal device, a medical device, a monitoring device, or the like.

Referring to fig. 1, a flow chart of a training method of a hand gesture recognition model provided by the present application is exemplarily shown. Firstly, a computer device can obtain training samples 100, wherein each training sample 100 comprises a sample hand image 101 and sample hand posture information 102 corresponding to the sample hand image 101; then, a composite hand segmentation chart 103 corresponding to the sample hand image is determined from the sample hand posture information 102. Optionally, the synthetic hand segmentation graph 103 may include a synthetic finger segmentation subgraph, a synthetic palm segmentation subgraph and a synthetic hand segmentation subgraph corresponding to each finger; then, the hand gesture model may be trained by using the sample hand image 101, the synthesized hand segmentation graph 103, and the sample hand gesture information 102, so as to obtain a trained hand gesture recognition model, and then the hand gesture information corresponding to the target hand image may be determined by using the hand gesture model. Optionally, the hand gesture recognition model includes a feature extraction part 104, a structure prediction part 105 and a gesture prediction part 106; a feature extraction section 104 for extracting a feature map of the sample hand image 101; a structure prediction part 105 for obtaining a predicted hand part segmentation map from the feature map; and the gesture prediction part 106 is used for obtaining predicted hand gesture information corresponding to the sample hand image according to the feature map and the predicted hand segmentation map.

The technical solution of the present application will be described below by means of several embodiments.

Referring to fig. 2, a flowchart of a training method of a hand gesture recognition model according to an embodiment of the present application is shown. In the present embodiment, the method is mainly exemplified by being applied to the computer device described above. The method may include the steps of:

step 201, a training sample is obtained, where the training sample includes a sample hand image and sample hand posture information corresponding to the sample hand image.

Before training the hand gesture recognition model, a training sample is required to be obtained, and the training sample may include a plurality of training samples. Each training sample comprises a sample hand image and sample hand posture information corresponding to the sample hand image.

The sample hand image is an image including a hand, which is a limb portion including a palm and five fingers (thumb, index finger, middle finger, ring finger, and little finger). The sample hand image may be an image acquired by an image acquisition device (such as a camera, a video camera, a scanner, a medical device, a laser radar, etc.), may also be an image pre-stored locally, and may also be an image acquired from a network, which is not limited in this embodiment of the present application. The sample hand image may be a single image or an image frame in a video, which is not limited in the embodiment of the present application.

The hand posture information is used for reflecting the posture of the hand; the sample hand gesture information is the accurate hand gesture information of the label. The sample hand posture information corresponding to the sample hand image may be represented by the real position information of the hand key points in the sample hand image, may also be represented by the angles between each limb of the hand and each limb, and the like, which is not limited in the embodiment of the present application. Optionally, after the sample hand image is acquired, the relevant person may manually label the actual position information of the key points of the hand included in the sample hand image; the actual position information of the key points of the hand included in the sample hand image can also be annotated with a correlation device (e.g., a computer device). The real position information refers to the accurate position of the marked key point.

Optionally, the sample hand pose information corresponding to the sample hand image may be represented by using coordinates, and when the sample hand image is a two-dimensional image (such as an RGB color image), the sample hand pose information may be represented by using two-dimensional coordinates; when the sample hand image is a depth image (including two-dimensional information and depth information), the sample hand pose information may also be characterized using three-dimensional coordinates. The embodiments of the present application do not limit this.

Step 202, obtaining confidence maps corresponding to m sections of limbs respectively according to the sample hand posture information, wherein m is a positive integer.

The m limbs refer to m parts of the hand divided according to joint structures. And the value of the pixel point in the confidence coefficient map corresponding to the ith limb is used for representing the confidence coefficient of the pixel point belonging to the ith limb, and i is a positive integer less than or equal to m.

Optionally, the value range of the confidence may be [0,1], and in this case, the value range of the pixel point is also [0,1 ].

The confidence level may be determined using a probability density function, e.g. of the Gaussian typeProbability density function to determine confidence level S_LPM(p | L), that is,

wherein p is_jRepresenting the jth hand key point, m is less than or equal to n, p_kRepresenting the kth hand key point, k is less than or equal to n,

the representation of the ith limb is shown,

representing pixel points p to ith limb in sample hand image

Distance of (a)_LPMIs a hyperparameter that adjusts the width of the gauss.

In some other embodiments, the confidence level may also be determined in other manners, which is not limited in the embodiments of the present application.

The detailed process of obtaining the composite hand segmentation corresponding to the sample hand image according to the sample hand pose information is described in the following embodiments, and will not be described herein again.

And step 203, generating a synthesized hand part cutting image corresponding to the sample hand image according to the confidence level maps corresponding to the m sections of limbs respectively.

After the confidence maps corresponding to the m sections of limbs are obtained, the confidence maps corresponding to the m sections of limbs can be subjected to image fusion to generate a synthetic hand segmentation map. The composite hand segmentation image is an image obtained by dividing a hand region and a non-hand region in a sample hand image. The hand region is a region occupied by the hand in the sample hand image, and the non-hand region is a region other than the hand region in the sample hand image. And the values of the pixel points in the composite hand part segmentation graph are used for representing the confidence coefficient that the pixel points belong to the limbs.

For example, the confidence maps corresponding to the m limbs are spliced, that is, the hand limb structures corresponding to the m limbs are spliced, so as to obtain a synthesized hand segmentation map in which the whole hand region and the non-hand region are segmented.

For another example, the confidence maps corresponding to the multiple palm limbs in the m limbs may be spliced, that is, the hand limb structures corresponding to the respective palm limbs are spliced, so as to obtain a composite hand segmentation map obtained by segmenting a palm region and a non-palm region.

And synthesizing the values of the pixel points in the hand segmentation chart, wherein the values are used for representing the confidence coefficient that the pixel points belong to the limbs.

Illustratively, as shown in FIG. 3, for the key point k by hand₇And k₈The corresponding limb 30 is determined according to the confidence that the pixel point belongs to the limb 30, and the value range of the confidence is [0,1]]The confidence of the pixel points in the limb 30 is higher than the confidence of the pixel points outside the limb 30, for example, the confidence of the pixel points in the limb 30 may include 1, 0.99, 0.98, etc., and the confidence of the pixel points outside the limb 50 may include 0, 0.01, 0.02, etc.

The confidence map is a probability type expression mode, and better accords with the expression of the hand under the conditions of complex background, self-shielding or motion blur, and improves the accuracy of the hand segmentation map.

And step 204, training the hand gesture recognition model by adopting the sample hand image, the synthesized hand segmentation image and the sample hand gesture information.

After the synthesized hand segmentation map is acquired, the hand gesture recognition model can be further trained by using the sample hand image, the synthesized hand segmentation map and the sample hand gesture information.

Optionally, the hand gesture recognition model may include at least two parts, one of which is used for acquiring a hand segmentation image and the other of which is used for recognizing a hand gesture, and the hand gesture recognition is also referred to as hand gesture recognition.

For example, taking a hand gesture recognition model as Mask-pos masked CNN as an example, in the process of training the Mask-pos masked CNN, a hand segmentation map is learned through a part of the model, and then hand gesture information is further learned by using the learned hand segmentation map.

Optionally, when the training stopping condition is met, stopping training the hand gesture recognition model, and obtaining the hand gesture recognition model after training.

To sum up, the technical scheme provided by the embodiment of the application acquires the synthetic hand segmentation corresponding to the sample hand image according to the sample hand posture information, and trains the hand posture recognition model by adopting the sample hand image, the synthetic hand segmentation and the sample hand posture information. Compared with the prior art, manual labeling is needed to obtain a hand segmentation graph, and model training is carried out based on the hand segmentation graph. According to the technical scheme, after the sample hand posture information is acquired, the synthetic hand segmentation graph is automatically acquired, manual marking is not needed, and the model is trained based on the synthetic hand segmentation graph, so that the labor cost and the time cost required by model training are reduced.

In addition, the hand recognition model is adopted to automatically acquire the composite hand segmentation drawing, and compared with a method for acquiring the composite hand segmentation drawing by manual marking, the method overcomes the influence of human subjectivity on the final result of the hand recognition model, and improves the accuracy of the hand recognition model recognition result.

In addition, the confidence map is used for generating the synthesized hand segmentation map, the confidence map is a probability type expression mode, the expression of the hand under the conditions of complex background, self-shielding or motion blur is better met, and the accuracy of the hand segmentation map is improved.

Referring to fig. 4, a flowchart of a training method for a hand gesture recognition model according to another embodiment of the present application is shown. In the present embodiment, the method is mainly exemplified by being applied to the computer device described above. The method may include the steps of:

step 401, a training sample is obtained, where the training sample includes a sample hand image and sample hand posture information corresponding to the sample hand image.

This step is the same as or similar to the content of step 201 in the embodiment of fig. 2, and is not described here again.

Optionally, the sample hand pose information includes real position information of n hand key points in the sample hand image, where n is a positive integer.

The hand key points refer to key positions of hand skeletons. The hand key points may include: the tips of the individual fingers (including thumb, index finger, middle finger, ring finger, and pinky finger), the phalangeal joints of the individual fingers, and the like.

Illustratively, as shown in fig. 5, a schematic diagram of hand keypoints is exemplarily shown. The hand part includes carpal bone, 5 metacarpals and 5 phalanges, wherein, 5 metacarpals link to each other with 5 phalanges respectively, and 5 phalanges include big phalanx, index finger bone, well phalanx, third phalanx and little phalanx, and wherein, big phalanx divide into 2 sections again, divide into first festival and second festival along the fingertip to the direction of palm, and index finger bone, well phalanx, unknown phalanx and little phalanx divide into 4 sections again, divide into first festival, second festival, third festival and fourth festival along the fingertip to the direction of palm, and in addition, along the direction towards the wrist, the metacarpal bone that links to each other with the big phalanx can divide into first metacarpal and second metacarpal again. The hand may include the following 21 key points: connecting point k of carpal bones and each metacarpal bone₀(ii) a Connecting point k of first metacarpal bone and second metacarpal bone₁(ii) a The joint k between the second node of the thumb bone and the first metacarpal bone₂(ii) a The second joint of the thumb bone and the first joint of the thumb bone₂(ii) a Thumb bone tip point k₄(ii) a The junction k between the third section of the index phalanx and the metacarpal bone connected to the index phalanx₅(ii) a The junction k of the second section of the phalanx and the third section of the phalanx₆(ii) a The junction k of the first section of the phalange and the second section of the phalange₇(ii) a Index bone and tip point k₈(ii) a The junction k between the third segment of the middle phalanx and the metacarpal bone connected to the middle phalanx₉(ii) a The junction k of the second segment of the middle phalanx and the third segment of the middle phalanx₁₀(ii) a The junction k of the first segment of the middle phalanx and the second segment of the middle phalanx₁₁(ii) a Middle finger bone tip point k₁₂(ii) a The third segment of the ring phalanx and the junction k of the metacarpal bone to which the ring phalanx is connected₁₃(ii) a The junction k of the second segment of the ring phalanx and the third segment of the ring phalanx₁₄(ii) a The connecting point k of the first section of the ring phalanx and the second section of the ring phalanx₁₅(ii) a Bone tip point k of ring finger₁₆(ii) a The third joint of the thumb bone and the metacarpal bone connected with the thumb bone₁₇(ii) a The connection point k of the second segment of the little finger bone and the third segment of the little finger bone₁₈(ii) a The joint k of the first segment of the little phalanx and the second segment of the little phalanx₁₉(ii) a Tip point k of little thumb bone₂₀。

And step 402, connecting the n hand key points according to joint structures to obtain m sections of limbs according to the real position information of the n hand key points, wherein m is a positive integer.

After the actual position information of the n hand key points is obtained, the n hand key points are connected by combining the actual hand joint structure, and therefore m limbs are obtained. Wherein, the joint structure is used for indicating the connection relation of the key positions of the hand skeleton.

Illustratively, with continued reference to FIG. 5, the hand keypoints comprise 21, the 21 hand keypoints being connected according to a hand joint structure, e.g., k₀k₁、k₁k₂、k₃k₄、k₅k₆、k₇k₈、k₉k₁₀、k₁₁k₁₂、k₁₃k₁₄、k₁₅k₁₆、k₁₇k₁₈、k₁₈k₁₉、k₁₉k₂₀Total 20 limbs.

Step 403, for the ith limb in the m limbs, calculating the distance from the pixel point in the sample hand image to the ith limb, where i is a positive integer less than or equal to m.

After the ith limb in the m limbs is determined, the distance from the pixel point in the sample hand image to the ith limb can be calculated.

Optionally, the distance from each pixel point in the sample hand image to the ith limb may be calculated, and the distance from a part of the pixel points in the sample hand image to the ith limb may also be calculated, which is not limited in the embodiment of the present application.

And step 404, generating a confidence map corresponding to the ith limb according to the distance.

And the value of the pixel point in the confidence coefficient map corresponding to the ith limb is used for representing the confidence coefficient of the pixel point belonging to the ith limb.

For the description of the confidence level, reference may be made to the content of step 202 in the embodiment of fig. 2, and details are not repeated here.

And 405, generating a synthesized hand part cutting image corresponding to the sample hand image according to the confidence level maps corresponding to the m limbs respectively.

The synthesized hand segmentation graph is an image obtained by segmenting a hand region and a non-hand region in a sample hand image, and the value of a pixel point in the synthesized hand segmentation graph is used for representing the confidence coefficient that the pixel point belongs to limbs.

This step is the same as or similar to the step 203 in the embodiment of fig. 2, and is not described here again.

Optionally, after obtaining the m limbs, the following steps may be further adopted to generate a composite hand segmentation chart:

(1) and for the ith limb in the m limbs, determining a target area taking the ith limb as a central axis as a limb unit corresponding to the ith limb, wherein i is a positive integer less than or equal to m.

The shape of the target area may be rectangular, square, oval, etc., which is not limited in the embodiments of the present application.

Illustratively, as shown in fig. 6, the target area 60 is a rectangle, which takes the ith limb as a central axis and has both sides with target widths. The target width may be set according to practical experience, and the embodiment of the present application is not limited thereto.

(2) And generating a synthetic hand part cutting chart according to the limb units corresponding to the m sections of limbs respectively.

The value of the pixel point in the synthetic hand segmentation graph is a first numerical value or a second numerical value, the first numerical value is used for representing that the pixel point belongs to the limb, and the second numerical value is used for representing that the pixel point does not belong to the limb.

Optionally, the first numerical value may be 1, the second numerical value may be 0, and for a pixel point in the composite hand segmentation graph, when the value of the pixel point is 1, it indicates that the pixel point belongs to a limb; when the value of the pixel point is 0, the pixel point does not belong to the limb. In some other embodiments, the first numerical value and the second numerical value may also be represented by other numerical values, which are not limited in the embodiments of the present application.

For example, the limb unit may be represented by the following manner, taking the first value as 1 and the second value as 0:

wherein S is_LDM(p | L) denotes a limb unit; l represents the ith limb; taking the ith limb L as a central axis, wherein the widths of the two sides of the central axis are sigma_LDMDefining a rectangle as a target area; if the pixel point p is in the rectangle, the value of the pixel point p is 1, and if the pixel point p is outside the rectangle, the value of the pixel point p is 0.

The synthetic hand part segmentation graph is expressed by the first numerical value and the second numerical value, so that the expression of the synthetic hand part segmentation graph is simpler and more efficient, and the calculated amount in the synthetic hand part segmentation graph is less.

Optionally, the m limbs include an a-segment finger limb and a b-segment palm limb, the synthetic hand segmentation graph includes a synthetic finger segmentation subgraph, a synthetic palm segmentation subgraph and a synthetic hand segmentation subgraph, where a is a positive integer smaller than m, and b is a positive integer smaller than m.

In this case, the above-mentioned generation of the synthetic hand segmentation map according to the m limbs may include the following steps:

(1) and respectively combining the finger limbs belonging to each finger in the a-section finger limbs to obtain a synthesized finger segmentation subgraph corresponding to each finger.

The synthesized finger segmentation subgraph corresponding to the target finger is an image segmented with a target finger region and a non-target finger region.

Illustratively, as shown in fig. 7, a schematic diagram of a composite finger segmentation subgraph is exemplarily shown. Fig. 7(a) shows a composite finger segmentation subgraph 71 corresponding to the thumb; FIG. 7(b) shows a composite finger segmentation sub-graph 72 corresponding to the index finger; fig. 7(c) shows a synthesized finger segmentation map 73 corresponding to the middle finger; FIG. 7(d) shows a composite finger segmentation subgraph 74 corresponding to a ring finger; fig. 7(e) shows a composite finger segmentation subgraph 75 corresponding to the little finger.

(2) And combining the b-section palm limbs to obtain a synthetic palm segmentation subgraph.

The synthetic palm segmentation subgraph refers to an image segmented with a palm region and a non-palm region.

Illustratively, as shown in fig. 8, a schematic diagram of a synthetic palm segmentation subgraph 81 is exemplarily shown.

(3) And combining the m limbs to obtain a synthetic hand segmentation subgraph.

The synthesized hand-divided subgraph is an image divided into a hand region and a non-hand region.

Illustratively, as shown in fig. 9, a schematic diagram of a composite hand segmentation subgraph 91 is exemplarily shown.

The above steps (1), (2) and (3) can be characterized by the following formulas:

S^*(p|g)＝max(S(p|L₁),S(p|L₂),…,S(p|L_|g|))；

g＝{L₁,L₂,…,L_|g|}；

wherein S is^*(p | g) shows a composite hand segmentation; p represents a pixel point; l is₁Denotes limb segment 1, L₂Denotes limb 2, L_|g|Represents the g-th limb; s (pIL)₁) Represents the confidence corresponding to the 1 st limb or the limb unit corresponding to the 1 st limb, S (p | L)₂) Representing confidence of corresponding 2 nd limb or 2 nd limbBody-corresponding limb unit, S (pIL)_|g|) And representing the confidence corresponding to the g-th limb or the limb unit corresponding to the 2 nd limb.

The values of the pixel points in the synthesized finger segmentation subgraph, the synthesized palm segmentation subgraph and the synthesized hand segmentation subgraph can be used for representing the confidence coefficient that the pixel points belong to limbs. For example, for a synthesized finger segmentation sub-graph corresponding to a target finger, the pixel value of a pixel point in the synthesized finger segmentation sub-graph is used for representing the confidence that the pixel point belongs to the target finger region; for another example, for a synthesized palm segmentation sub-graph, the pixel value of a pixel point in the synthesized palm segmentation sub-graph is used for representing the confidence that the pixel point belongs to a palm region; for example, for the synthesized hand-segmentation sub-graph, the pixel values of the pixel points in the synthesized hand-segmentation sub-graph are used to represent the confidence that the pixel points belong to the hand region.

In addition, the values of the pixel points in the synthesized finger-divided subgraph, the synthesized palm-divided subgraph and the synthesized hand-divided subgraph can also be a first value or a second value, the first value is used for representing that the pixel points belong to limbs, and the second value is used for representing that the pixel points do not belong to the limbs. Taking the first numerical value as 1 and the second numerical value as 0 as an example, regarding the synthesized finger segmentation subgraph corresponding to the target finger, the pixel values of the pixel points in the synthesized finger segmentation subgraph include 1 and 0, when the value of the pixel point is 1, the pixel point is represented to belong to the target finger region, and when the value of the pixel point is 0, the pixel point is represented not to belong to the target finger region; for the synthesized palm segmentation subgraph, the pixel value of a pixel point in the synthesized palm segmentation subgraph comprises 1 and 0, when the value of the pixel point is 1, the pixel point is represented to belong to the palm region, and when the value of the pixel point is 0, the pixel point is represented not to belong to the palm region; for the synthesized hand segmentation subgraph, the pixel value of the pixel point in the synthesized hand segmentation subgraph comprises 1 and 0, when the value of the pixel point is 1, the pixel point is represented to belong to the hand region, and when the value of the pixel point is 0, the pixel point is represented not to belong to the hand region.

In the embodiment of the application, the hand segmentation graph comprises a finger segmentation subgraph, a palm segmentation subgraph and a hand segmentation subgraph of each finger, so that the overall structure and the detail structure of the hand can be obtained, and the prediction accuracy of the hand gesture recognition model obtained based on the training is further improved.

After the synthesized hand segmentation map is obtained, the hand gesture recognition model can be further trained by using the sample hand image, the synthesized hand segmentation map and the sample hand gesture information.

Optionally, the hand gesture recognition model includes a feature extraction part, a structure prediction part, and a gesture prediction part. The characteristic extraction part is used for extracting a characteristic diagram of the sample hand image; the structure prediction part is used for acquiring a predicted hand part segmentation graph according to the feature graph, and the predicted hand part segmentation graph comprises a predicted finger segmentation subgraph, a predicted palm segmentation subgraph and a predicted hand segmentation subgraph; and the gesture prediction part is used for obtaining predicted hand gesture information corresponding to the sample hand image according to the feature map and the predicted hand segmentation map.

Alternatively, the feature extraction part, the structure prediction part and the posture prediction part may be a Residual network (Residual Net), a Stacked Hourglass network (Stacked Hourglass Net), or the like, which is not limited in this embodiment of the present application.

Alternatively, the size of the image input to the feature extraction section may be adjusted to a target size, such as 368 × 368.

The above feature map is used to characterize abstract features of a sample hand image. The predicted finger segmentation subgraph is obtained by predicting a hand gesture recognition model, and is segmented into an image of a target finger region and a non-target finger region, wherein the target finger can be any finger; the predicted palm segmentation subgraph is obtained by predicting a hand posture recognition model and is segmented into an image with a palm region and a non-palm region; the predicted hand segmentation subgraph is obtained by predicting through a hand gesture recognition model, and is an image obtained by segmenting a hand region and a non-hand region.

Illustratively, as shown in fig. 10, a schematic diagram of a hand gesture recognition model is exemplarily shown. The hand gesture recognition model may include a feature extraction portion 104, a structure prediction portion 105, and a gesture prediction portion 106. The output of the feature extraction section 104 may be a 128-way feature map using VGG-19 as the feature extraction section 104. Next are 6 stages of DCNN, each stage including 5 convolutional layers of 7 × 7 convolutional kernels and 2 convolutional layers of 1 × 1 convolutional kernels, the first 3 stages serving as a structure prediction part 105 and the last 3 stages serving as a pose prediction part 106.

And 406, acquiring a value of the structure loss function according to the synthesized hand part segmentation graph and the predicted hand part segmentation graph.

The above-described structure loss function is a loss function corresponding to a structure prediction section, and the value of the structure loss function is used to characterize the difference between the synthetic hand segmentation and the predicted hand segmentation.

Optionally, the obtaining the value of the structure loss function according to the synthesized hand part segmentation map and the predicted hand part segmentation map may include: and obtaining the value of the structure loss function according to the synthesized finger segmentation subgraph, the predicted finger segmentation subgraph, the synthesized palm segmentation subgraph, the predicted palm segmentation subgraph, the synthesized hand segmentation subgraph and the predicted hand segmentation subgraph.

Illustratively, the value L of the above-described structural loss function_SThe following formula can be used to calculate:

wherein t represents the stage of the structure prediction part; g represents a set of a finger segmentation subgraph, a palm segmentation subgraph and a hand segmentation subgraph corresponding to each finger; g represents any one of a set of a finger segmentation subgraph, a palm segmentation subgraph and a hand segmentation subgraph corresponding to each finger; p represents a pixel point;

representing a predicted hand segmentation graph; s^*(p | g) shows a composite hand segmentation.

Step 407, obtaining a value of the gesture loss function according to the sample hand gesture information and the predicted hand gesture information.

The above-described pose loss function is a loss function corresponding to a pose prediction section, which is used to characterize the difference between the sample hand pose information and the predicted hand pose information.

Illustratively, the value L of the above-described attitude loss function_KThe following formula can be used to calculate:

k, the total number of the key points of the hand, and t, the stage of the posture prediction part; k represents any one key point; p represents a pixel point; c^*(p | k) represents sample hand pose information;

the representation gesture module predicts hand gesture information.

C above^*(p | k) can be expressed as:

wherein the content of the first and second substances,

real coordinates representing the kth keypoint; sigma_KCMThe representation is a hyperparameter that adjusts the width of the gaussian and can be set to 1.

And step 408, obtaining a value of the target loss function according to the value of the structural loss function and the value of the attitude loss function.

After the values of the structure loss function and the gesture loss function are obtained, the value of the target loss function of the hand gesture recognition model can be further determined.

Optionally, the obtaining a value of the target loss function according to the value of the structural loss function and the value of the attitude loss function may include: and determining the value of the target loss function according to the value of the structural loss function and the value of the attitude loss function and the respective weights of the value of the structural loss function and the value of the attitude loss function.

Illustratively, the value L of the above-mentioned objective loss function may be expressed as:

wherein G1 represents the target loss function when the hand segmentation graph only includes the hand segmentation subgraph; g1&G6 represents the target loss function when the finger segmentation subgraph, the palm segmentation subgraph and the hand segmentation subgraph corresponding to each finger of the hand segmentation graph are obtained; l is_SValue, L, representing the structure loss function_KValue, λ, representing the attitude loss function₁Weight, λ, representing the value of the structural loss function₂A weight representing the value of the attitude loss function.

Optionally, the respective weights of the values of the structural loss function and the values of the attitude loss function may be adjusted according to the prediction accuracy of the model in the verification set.

And step 409, adjusting parameters of the hand gesture recognition model according to the value of the target loss function.

After the value of the target loss function is obtained, the parameters of the hand gesture recognition model can be further adjusted through the value of the target loss function.

Optionally, when the value of the target loss function meets the condition, stopping adjusting parameters of the hand gesture recognition model to obtain the trained hand gesture recognition model.

Further, when the value of the target loss function is smaller than a preset value, stopping adjusting the parameters of the hand gesture recognition model.

To sum up, according to the technical scheme provided by the embodiment of the application, a plurality of limbs are obtained through sample hand posture information, a synthetic hand segmentation graph is generated according to the plurality of limbs, a value of a structural loss function is obtained according to the synthetic hand segmentation graph and a predicted hand segmentation graph, a value of a posture loss function is obtained according to the sample hand posture information and the predicted hand posture information, a target loss function is further obtained, and parameters of a hand posture recognition model are adjusted through the value of the target loss function. Compared with the prior art, manual labeling is needed to obtain a hand segmentation graph, and model training is carried out based on the hand segmentation graph. According to the technical scheme, after the sample hand posture information is acquired, the synthetic hand segmentation graph is automatically acquired, manual marking is not needed, and the model is trained based on the synthetic hand segmentation graph, so that the labor cost and the time cost required by model training are reduced.

In addition, in the embodiment of the application, the hand segmentation map comprises a finger segmentation sub-map, a palm segmentation sub-map and a hand segmentation sub-map of each finger, so that the overall structure and the detail structure of the hand can be obtained, and the prediction accuracy of the hand gesture recognition model obtained based on the training is further improved.

Referring to fig. 11, a flowchart of a hand gesture recognition method provided by an embodiment of the present application is shown. In the present embodiment, the method is mainly exemplified by being applied to the computer device described above. The method may include the steps of:

step 1101, a target hand image is acquired.

Step 1102, call a hand gesture recognition model. .

The hand gesture recognition model is obtained by training sample hand images, synthesized hand segmentation graphs corresponding to the sample hand images and sample hand gesture information corresponding to the sample hand images, wherein the synthesized hand segmentation graphs are obtained according to the sample hand gesture information, and the values of the pixel points in the synthesized hand segmentation graphs are used for representing the confidence coefficients that the pixel points belong to limbs.

The hand gesture recognition model comprises a feature extraction part, a structure prediction part and a gesture prediction part. The characteristic extraction part is used for extracting a characteristic diagram of the target hand image; the structure prediction part is used for acquiring a predicted hand segmentation graph according to the characteristic graph, wherein the predicted hand segmentation graph comprises a predicted finger segmentation subgraph, a predicted palm segmentation subgraph and a predicted hand segmentation subgraph which are respectively corresponding to each finger, the predicted finger segmentation subgraph corresponding to the target finger is an image segmented with a target finger region and a non-target finger region, the predicted palm segmentation subgraph is an image segmented with a palm region and a non-palm region, and the predicted hand segmentation subgraph is an image segmented with a hand region and a non-hand region; and the gesture prediction part is used for obtaining hand gesture information corresponding to the target hand image according to the feature map and the predicted hand segmentation map.

The training process of the hand gesture recognition model is described in detail above, and is not described in detail here.

Step 1103, determining hand gesture information corresponding to the target hand image through the hand gesture recognition model.

Based on the hand posture model, hand posture information corresponding to the target hand image can be further determined. The hand posture information is used for reflecting the posture of the hand, the hand posture information corresponding to the target hand image can be represented by position information of key points of the hand in the target hand image, and can also be represented by the limbs of the hand and angles among the limbs, and the like.

Optionally, the hand pose information corresponding to the target hand image may be represented by coordinates, and when the target hand image is a two-dimensional image (such as an RGB color image), the hand pose information corresponding to the target hand image may be represented by two-dimensional coordinates; when the target hand image is a depth image (including two-dimensional information and depth information), the hand pose information corresponding to the target hand image may also be characterized using three-dimensional coordinates. The embodiments of the present application do not limit this.

Optionally, after determining the hand posture information corresponding to the target hand image through the hand posture recognition model, the method may further include the following steps:

(1) and determining a hand skeleton model corresponding to the target hand image according to the hand posture information corresponding to the target hand image.

After determining the hand posture information corresponding to the target hand image, a hand skeleton model can be further determined.

Alternatively, when the hand posture information is represented by two-dimensional coordinates of a key point, after the two-dimensional coordinates of the key point are acquired, three-dimensional coordinates of the key point may be acquired based on the two-dimensional coordinates of the key point, and further, the hand skeleton model may be determined based on the three-dimensional coordinates of the key point.

(2) And determining a hand gesture corresponding to the target hand image based on the hand skeleton model.

After the hand skeleton model is determined, the hand skeleton model corresponds to a hand posture, so that after the hand skeleton model is determined, the hand posture corresponding to the target hand image can be determined based on the hand skeleton model.

To sum up, according to the technical scheme provided by the embodiment of the application, after the target hand image is acquired, the hand gesture information corresponding to the target hand image is determined by calling the hand gesture recognition model. The hand gesture recognition is obtained by training the sample hand image, the composite hand part segmentation image corresponding to the sample hand image and the sample hand gesture information corresponding to the sample hand image. Compared with the prior art, manual labeling is needed to obtain a hand segmentation graph, and model training is carried out based on the hand segmentation graph. According to the technical scheme, after the sample hand posture information is acquired, the synthetic hand segmentation graph is automatically acquired, manual marking is not needed, and the model is trained based on the synthetic hand segmentation graph, so that the labor cost and the time cost required by model training are reduced.

In addition, in the embodiment of the application, the predicted finger segmentation subgraph, the predicted palm segmentation subgraph and the predicted hand segmentation subgraph corresponding to each finger respectively can obtain the whole structure and the detail structure of the hand, and the accuracy of the hand gesture information determined by the hand gesture recognition model is further improved.

The beneficial effects of the present solution are further described below by testing it on two published hand two-dimensional coordinate datasets:

the two datasets are OneHand 10k and panoptric, respectively.

The OneHand 10k dataset contains 11703 hand images (divided into training set and test set) taken in natural state, and labeled with hand key points and hand segmentation graph. The accuracy of the identification of the key points is expressed by PCK (Percentage of Correct Keypoints), and the PCK mean value is shown in the following table-1:

model (model)	PCK mean value	Lifting of
			CPM	87.06	-
LDM-G1	87.64	+0.59(+0.67％)
			LPM-G1	88.07	+1.02(+1.17％)
RMask	88.05	+0.99(+1.14％)

TABLE-1

In table-1 above, the first column represents the model employed, the second column represents the PCK mean value, and the third column represents the prediction accuracy improvement rate.

The Panoptic data set contains 14817 pictures of the character's motion taken in Panoptic Studio, each marked with 21 keypoints on the right hand, and of higher quality. The data set is randomly divided into a training set, a checking set and a testing set, wherein the data in the training set accounts for 80% of the data set, the data in the checking set accounts for 10% of the data set, and the data in the testing set accounts for 10% of the data set. The PCK mean is shown in Table-2 below:

model (model)	PCK mean value	Lifting of
			CPM	76.94	-
LDM-G1	79.14	+2.20(+2.86％)
			LDM-G1&G6	79.32	+2.38(+3.09％)
LPM-G1	79.78	+2.84(+3.69％)
			LPM-G1&G6	80.03	+3.09(+4.01％)

TABLE-2

As can be seen by combining the results shown in the above tables-1 and-2,

1) the hand gesture recognition model provided by the embodiment of the application can improve the prediction precision of a baseline CPM (convolutional attitude Machines) model;

2) the prediction accuracy of the Model after representing the hand segmentation graph by adopting an LPM (Linear Probability Model) is superior to that of the Model after representing the hand segmentation graph by adopting an LDM (Logical Data Model);

4) compared with the method that only the hand division subgraph (namely G1) is adopted, the whole structure and the detail structure of the hand can be obtained, and therefore the accuracy of the hand posture information determined by the hand posture recognition model is improved.

Referring to fig. 12, a flowchart of an image processing method according to an embodiment of the present application is shown. In the present embodiment, the method is mainly applied to the computer device described above for example, and the computer device may be a medical device. The method may include the steps of:

step 1201, acquiring a target video.

Each image frame of the target video includes a target user hand.

The target video may be obtained by shooting a hand of a target user by the computer device, or may be obtained by the computer device from a network, which is not limited in this embodiment of the application.

And 1202, acquiring hand gesture information corresponding to each image frame through the hand gesture recognition model.

After the target video is acquired, further, hand gesture information corresponding to each image frame can be acquired through a hand gesture recognition model.

The hand gesture recognition model is obtained by training a sample hand image, a synthesized hand segmentation image corresponding to the sample hand image and sample hand gesture information corresponding to the sample hand image, wherein the synthesized hand segmentation image is obtained according to the sample hand gesture information, and the value of a pixel point in the synthesized hand segmentation image is used for representing the confidence coefficient that the pixel point belongs to a limb.

Step 1203, determining hand postures corresponding to the image frames according to the hand posture information corresponding to the image frames.

How to determine the hand gesture from the hand gesture information has been described above and will not be described in detail here.

And 1204, determining a gesture recognition result of the hand of the target user according to the hand gestures respectively corresponding to the image frames.

After the hand gestures corresponding to the projection image frames are determined, the gesture recognition result of the target user hand can be further determined.

It should be noted that, when the image processing method is applied to different scenes, the meaning represented by the gesture recognition result is also different. For example, when the image processing method is applied to auxiliary analysis of motion disorder, the gesture recognition result may be a hand motion evaluation index, and the hand motion evaluation index is used for representing the motion disorder degree of the target user hand; when the image processing method is applied to sign language recognition, the gesture recognition result can be target semantic information, and the target semantic information is used for representing the meaning expressed by the hand of the target user in the target video; when the image processing method is applied to gesture control, the gesture recognition result may be a target control instruction for controlling a target device to perform a corresponding action.

In some other embodiments, the gesture recognition result may also have other meanings, which is not limited in this application.

To sum up, according to the technical scheme provided by the embodiment of the application, after the target video obtained by shooting the hand of the target user is obtained, the hand gesture information corresponding to each image frame is obtained through the hand gesture recognition model, and the gesture recognition result of the hand of the target user is determined according to the hand gesture information corresponding to each image frame, so that the user can realize different functions in different application scenes based on the gesture recognition result.

The application of the above image processing method and the corresponding gesture recognition result will be described below by several embodiments.

Referring to fig. 13, a flowchart of an image processing method according to an embodiment of the present application is shown. In the present embodiment, the method is mainly applied to the computer device described above for example, and the computer device may be a medical device. The method may include the steps of:

step 1301, a target video is obtained.

Each image frame of the target video includes a target user hand of a patient with dyskinesia. The dyskinesia condition can be Parkinson's disease, diffuse Lewy body disease, essential tremor, hepatolenticular degeneration and the like, and the embodiment of the application is not limited to the above.

In the target video, the dyskinesia patient can make required actions through the hand according to actual requirements, such as opening and closing movements of the thumb and the index finger.

Optionally, the computer device may invoke an image acquisition device (such as a camera, a video camera, a medical device, etc.) to capture the dyskinetic patient to obtain the target video; in addition, the computer device may further obtain the target video from the network, which is not limited in this embodiment of the application.

Step 1302, acquiring hand gesture information corresponding to each image frame through the hand gesture recognition model.

After the target video is acquired, further, hand gesture information corresponding to each image frame in the target video can be acquired through a hand gesture recognition model.

The hand gesture recognition model is obtained by training sample hand images, synthetic hand part segmentation maps corresponding to the sample hand images and sample hand gesture information corresponding to the sample hand images, wherein the synthetic hand part segmentation maps are obtained according to the sample hand gesture information.

And step 1303, determining the hand postures corresponding to the image frames according to the hand posture information corresponding to the image frames.

After the hand posture information corresponding to each image frame is acquired, the hand posture corresponding to each image frame can be further determined.

And 1304, determining hand motion characteristic information according to the hand postures corresponding to the image frames.

Each image frame corresponds to a hand gesture, when it is determined that a target parameter (such as a distance between fingers, a speed of finger movement, and the like) is adopted to represent a movement characteristic, the target parameter corresponding to each image frame can be determined from the hand gesture corresponding to each image frame in a target video, and the hand movement characteristic information can be further obtained based on the target parameter corresponding to each image frame. The hand motion characteristic information is used for representing motion characteristics of the target user hand, and the motion characteristics can be represented by the target parameters.

Taking the example that the thumb and the fingers of the hand of the target user do opening and closing motions in the target video, the hand motion characteristic information can be used for representing the change relation of the distance between the thumb and the index finger along with time, the hand motion characteristic information can be represented by a waveform diagram, the abscissa of the waveform diagram is time, and the ordinate of the waveform diagram is the distance between the thumb and the index finger. The flexibility degree of the hand movement of the dyskinesia patient can be seen through the hand movement characteristic information.

Step 1305, determining a hand motion evaluation index of the target user hand according to the hand motion characteristic information.

After the hand motion characteristic information is determined, a hand motion evaluation index may be further determined based on the hand motion characteristic information.

The hand movement evaluation index is used for representing the hand movement disorder degree of the patient with the movement disorder. Alternatively, the hand movement evaluation index may be expressed in the form of a score.

Alternatively, after the hand feature information is determined, an evaluation index prediction model for predicting a hand motion evaluation index from the hand feature information may be called to determine a hand motion evaluation index. The embodiment of the present application does not limit the architecture of the evaluation index prediction model.

And step 1306, displaying the hand motion evaluation index.

After determining the hand motion evaluation index, the computer device may display the hand motion evaluation index, so that the medical staff may analyze the degree and condition of the hand motion disorder of the dyskinetic patient in combination with the hand motion evaluation index to determine a further treatment plan.

To sum up, the technical scheme provided by the embodiment of the application can acquire the hand posture information corresponding to each image frame through the hand posture recognition model after acquiring the target video obtained by shooting the target hand of the patient with dyskinesia, and determine the hand movement evaluation index according to the hand posture information corresponding to each image frame, so that medical personnel can analyze the degree and condition of the hand dyskinesia of the patient with dyskinesia by combining the hand movement evaluation index and determine a further treatment scheme. Compared with the related art, according to the technical scheme provided by the embodiment of the application, on one hand, the computer equipment can automatically acquire the hand movement evaluation index to assist the analysis of medical staff, so that the diagnosis time of the medical staff is saved; on the other hand, the interference of subjective factors can be reduced through the result obtained by the computer equipment, so that the result is more objective and more robust.

Referring to fig. 14, a flowchart of an image processing method according to another embodiment of the present application is schematically shown. In the present embodiment, the method is mainly exemplified by being applied to the computer device described above, for example, the computer device may be a wearable device (e.g., AR glasses). The method may include the steps of:

step 1401, obtain target video

The target video comprises target hands of sign language persons.

The user passes through computer equipment, like the camera on wearable equipment or the intelligent terminal equipment, gathers the image that the sign language personage carries out sign language communication.

And 1402, acquiring hand gesture information corresponding to each image frame in the target video through the hand gesture recognition model.

This step is the same as or similar to the step 1302 in the embodiment of fig. 13, and is not repeated here.

And 1403, determining the hand postures corresponding to the image frames according to the hand posture information corresponding to the image frames respectively.

This step is the same as or similar to the step 1303 in the embodiment of fig. 13, and is not repeated here.

Step 1404, determining semantic information corresponding to each image frame according to the hand gesture corresponding to each image frame.

After the hand postures corresponding to the image frames are obtained, semantic information corresponding to the image frames can be determined, wherein the semantic information corresponding to the target image frame is used for representing meanings expressed by the hand postures corresponding to the target image frame.

Exemplarily, it is assumed that the target video includes three image frames, a first image frame, a second image frame, and a third image frame, where a meaning represented by a hand gesture corresponding to the first image frame is "me", that is, semantic information corresponding to the first image frame is "me"; the meaning represented by the hand gesture corresponding to the second image frame is "love", that is, the semantic information corresponding to the second image frame is "love"; the meaning represented by the hand gesture corresponding to the third image frame is "the country", that is, the semantic information corresponding to the third image frame is "the country".

In a possible implementation manner, the computer device may store a plurality of correspondence relationships between the hand postures and the semantic information in advance, so that after the computer device acquires the hand posture corresponding to a certain image frame, the semantic information corresponding to the hand posture of the image frame, that is, the semantic information corresponding to the image frame, may be further determined according to the correspondence relationships.

In another possible implementation manner, after the hand gestures corresponding to the image frames are obtained, a sign language recognition model may be called to determine semantic information corresponding to the image frames. The sign language recognition model is used for determining semantic information corresponding to the hand gesture according to the hand gesture. Optionally, the sign language recognition model is trained by semantic information that includes images of standard sign language actions and corresponds to the standard sign language actions.

Step 1405, determining the semantic information corresponding to the hand of the target user based on the semantic information corresponding to each image frame.

Further, after semantic information corresponding to each image frame is determined, semantic information corresponding to the hand of the target user may be determined. The target semantic information is used for representing meaning expressed by the target user hand in the target video.

Exemplarily, assuming that the target video includes three image frames, a first image frame, a second image frame and a third image frame, where semantic information corresponding to the first image frame is "me", semantic information corresponding to the second image frame is "love", and semantic information corresponding to the third image frame is "home", it may be determined that the semantic information corresponding to the hand of the target user in the target video is "love home".

Optionally, the target semantic information may be presented to the user by using text information or may be presented to the user in a form of voice, which is not limited in this embodiment of the application. For example, when the computer device is AR glasses, the target language information may be directly presented in the form of text in the AR glasses, and the user may directly view the target language information.

To sum up, according to the technical scheme provided by the embodiment of the application, the hand gesture information corresponding to each image frame in the target video is obtained through the hand gesture recognition model, and the hand gesture corresponding to each image frame is further obtained, so that the semantic information corresponding to each image frame can be determined, and the target semantic information corresponding to the target video is finally determined. According to the technical scheme, the hand gesture recognition model is adopted, semantic information corresponding to the hand gesture can be further recognized, so that a user without a hand language base can quickly acquire the semantic information corresponding to the hand gesture, and the communication efficiency is improved.

Referring to fig. 15, a flowchart of an image processing method according to another embodiment of the present application is schematically shown. In the present embodiment, the method is mainly applied to the computer device described above for example, and the computer device may be an image pickup device (e.g., a video camera, a video recorder). The method may include the steps of:

in step 1501, a target video is acquired.

The target image is obtained by shooting gesture actions executed by the target user hand.

Optionally, the target image may be an image acquired by the computer device in real time, or may be an image pre-stored in the computer device, which is not limited in this embodiment of the application.

When the computer device is an independent camera device, for example, an electronic device such as a video camera or a video recorder having an image acquisition function, the computer device can be arranged around the environment where the user is located, so that the sign language action executed by the user can be conveniently shot from different angles, and the accuracy of sign language action recognition can be improved.

Step 1502, acquiring hand gesture information corresponding to each image frame in the target video through the hand gesture recognition model.

And 1503, determining the hand postures corresponding to the image frames according to the hand posture information corresponding to the image frames.

And 1504, determining target control instructions according to the hand postures corresponding to the image frames respectively.

After the hand gestures respectively corresponding to the image frames are obtained, the complete hand gesture of the target user hand in the target video can be obtained, and then the target control instruction is further obtained based on the complete hand gesture.

In a possible implementation manner, the computer device may store a corresponding relationship between the hand gesture and the control instruction in advance, so that after the hand gesture corresponding to the target image is acquired, the target control instruction may be determined according to the corresponding relationship between the hand gesture and the control instruction.

In another possible implementation, the computer device may also generate the target control instruction by an instruction generating model for generating the control instruction according to the hand gesture. The generated instruction model may be a neural network model, and its corresponding structure may include convolutional layers and fully-connected layers. The convolutional layer can be constructed by a multilayer convolutional neural network, a cyclic neural network or a deep neural network, and the fully-connected layer can be constructed by a multilayer bidirectional long-short term memory neural network or a long-short term memory neural network.

In step 1505, the target device is controlled to perform a corresponding action via the target control command.

The target control instruction is used for controlling the target device to execute corresponding actions. The target device may be an intelligent device pre-arranged in a gateway, such as an intelligent air conditioner, an intelligent door lock, a curtain motor, an intelligent curtain, an intelligent television, and the like.

After the target control instruction is acquired, the target device may be further controlled to execute a corresponding action based on the target control instruction. For example, the intelligent air conditioner is controlled to perform actions such as opening, closing, heating, cooling and the like; for another example, the intelligent curtain is controlled to perform actions such as opening, closing and opening and closing degree adjustment; for example, the smart television is controlled to perform actions such as turning on, turning off, and changing playing content, which is not limited in the embodiment of the present application.

To sum up, the technical scheme provided by the embodiment of the application acquires the hand gesture information corresponding to the target image through the hand gesture recognition model, further acquires the hand gesture, and acquires the control instruction based on the hand gesture to control the target device. According to the technical scheme, the control of the intelligent device can be realized only by identifying the hand gesture of the user, and the practicability of the control of the intelligent device is effectively improved.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 16, a block diagram of a training device for a hand gesture recognition model according to an embodiment of the present application is shown. The device has the function of realizing the training method example of the hand gesture recognition model, and the function can be realized by hardware or by hardware executing corresponding software. The device may be the computer device described above, or may be provided on a computer device. The apparatus 1600 may include: a sample acquisition module 1610, a confidence acquisition module 1620, a segmentation map generation module 1630, and a model training module 1640.

The sample obtaining module 1610 is configured to obtain a training sample, where the training sample includes a sample hand image and sample hand posture information corresponding to the sample hand image.

A confidence obtaining module 1620, configured to obtain confidence maps corresponding to m segments of limbs respectively according to the sample hand posture information, where a value of a pixel point in the confidence map corresponding to the ith segment of limb is used to represent a confidence that the pixel point belongs to the ith segment of limb, m is a positive integer, and i is a positive integer less than or equal to m;

a segmentation map generation module 1630, configured to generate a synthesized hand segmentation map corresponding to the sample hand image according to the confidence maps corresponding to the m limbs, where the synthesized hand segmentation map is an image obtained by segmenting a hand region and a non-hand region in the sample hand image, and a value of a pixel point in the synthesized hand segmentation map is used to represent a confidence that the pixel point belongs to a limb.

And the model training module 1640 is used for training the hand gesture recognition model by adopting the sample hand image, the synthesized hand part segmentation map and the sample hand gesture information.

In some possible designs, the sample hand pose information comprises true position information for n hand keypoints in the sample hand image, the n being a positive integer; a confidence coefficient obtaining module 1620, configured to connect the n hand key points according to joint structures to obtain the m sections of limbs according to the real position information of the n hand key points; for the ith limb in the m limbs, calculating the distance from a pixel point in the sample hand image to the ith limb; and generating a confidence map corresponding to the ith limb according to the distance.

In some possible designs, the m limbs include an a-segment finger limb and a b-segment palm limb, the synthetic hand segmentation graph includes a synthetic finger segmentation subgraph, a synthetic palm segmentation subgraph and a synthetic hand segmentation subgraph, a is a positive integer smaller than m, and b is a positive integer smaller than m; the segmentation map generation module 1630 is configured to combine the finger limbs belonging to each finger in the a-segment finger limb respectively to obtain a synthesized finger segmentation sub-map corresponding to each finger, where the synthesized finger segmentation sub-map corresponding to the target finger is an image segmented with a target finger region and a non-target finger region; combining the b-section palm limbs to obtain the synthetic palm segmentation subgraph, wherein the synthetic palm segmentation subgraph is an image segmented with a palm region and a non-palm region; and combining the m limbs to obtain the synthetic hand segmentation subgraph, wherein the synthetic hand segmentation subgraph is an image segmented with a hand region and a non-hand region.

In some possible designs, the hand gesture recognition model includes a feature extraction portion, a structure prediction portion, and a gesture prediction portion; the characteristic extraction part is used for extracting a characteristic map of the sample hand image; the structure prediction part is used for acquiring a predicted hand segmentation graph according to the feature graph, and the predicted hand segmentation graph comprises a predicted finger segmentation subgraph, a predicted palm segmentation subgraph and a predicted hand segmentation subgraph; and the gesture prediction part is used for obtaining predicted hand gesture information corresponding to the sample hand image according to the feature map and the predicted hand part segmentation map.

In some possible designs, the model training module 1630 is configured to obtain a value of a structure loss function according to the synthetic hand segment cut and the predicted hand segment cut; acquiring a value of a gesture loss function according to the sample hand gesture information and the predicted hand gesture information; obtaining a value of a target loss function according to the value of the structural loss function and the value of the attitude loss function; adjusting parameters of the hand gesture recognition model by a value of the objective loss function.

Referring to fig. 17, a block diagram of a hand gesture recognition apparatus provided in an embodiment of the present application is shown. The device has the function of realizing the hand gesture recognition method, and the function can be realized by hardware or by hardware executing corresponding software. The device may be the computer device described above, or may be provided on a computer device. The apparatus 1700 may include: image acquisition module 1710, model call module 1720, and pose determination module 1730.

An image obtaining module 1710, configured to obtain a target hand image.

The model calling module 1720 is configured to call a hand gesture recognition model, where the hand gesture recognition model is obtained by training a sample hand image, a synthesized hand segmentation map corresponding to the sample hand image, and sample hand gesture information corresponding to the sample hand image, where the synthesized hand segmentation map is obtained according to the sample hand gesture information, and values of pixels in the synthesized hand segmentation map are used to represent confidence that the pixels belong to a limb.

A gesture determining module 1730, configured to determine, through the hand gesture recognition model, hand gesture information corresponding to the target hand image.

Referring to fig. 18, a block diagram of an image processing apparatus according to an embodiment of the present application is shown. The apparatus has a function of implementing the above-mentioned image processing method example, and the function may be implemented by hardware, or may be implemented by hardware executing corresponding software. The device may be the computer device described above, or may be provided on a computer device. The apparatus 1800 may include: video acquisition module 1810, information acquisition module 1820, pose determination module 1830, and result determination module 1840.

The video acquiring module 1810 is configured to acquire a target video, where each image frame of the target video includes a hand of a target user.

The information acquisition module 1820 is configured to acquire, through a hand gesture recognition model, hand gesture information corresponding to each image frame, where the hand gesture recognition model is obtained by training a sample hand image, a composite hand segmentation map corresponding to the sample hand image, and the sample hand gesture information corresponding to the sample hand image, where the composite hand segmentation map is acquired according to the sample hand gesture information, and values of pixels in the composite hand segmentation map are used to represent confidence levels that the pixels belong to limbs.

A gesture determining module 1830, configured to determine, according to the hand gesture information corresponding to each image frame, a hand gesture corresponding to each image frame;

a result determining module 1840, configured to determine a gesture recognition result of the target user's hand according to the hand gestures respectively corresponding to the image frames.

In some possible designs, the result determining module 1840 is configured to determine hand motion feature information according to the hand postures corresponding to the respective image frames, where the hand motion feature information is used to characterize the motion features of the target user's hand; according to the hand motion characteristic information, determining a hand motion evaluation index of the target user hand, wherein the hand motion evaluation index is used for representing the motion obstacle degree of the target user hand.

In some possible designs, the result determining module 1840 is configured to determine semantic information corresponding to each of the image frames according to a hand pose corresponding to each of the image frames, where the semantic information corresponding to a target image frame is used to represent a meaning expressed by the hand pose corresponding to the target image frame; and determining target semantic information corresponding to the target user hand based on semantic information corresponding to each image frame, wherein the target semantic information is used for representing the meaning expressed by the target user hand in the target video.

In some possible designs, the result determining module 1840 is configured to determine, according to the hand gestures respectively corresponding to the image frames, a target control instruction corresponding to the hand of the target user, where the target control instruction is used to control a target device to perform a corresponding action.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Referring to fig. 19, a block diagram of a terminal according to an embodiment of the present application is shown. Generally, terminal 1900 includes: a processor 1901 and a memory 1902.

The processor 1901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 1901 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (field Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1901 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1901 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 1901 may further include an AI processor for processing computational operations related to machine learning.

The memory 1902 may include one or more computer-readable storage media, which may be non-transitory. The memory 1902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1902 is used to store at least one instruction, at least one program, code set, or set of instructions for execution by the processor 1901 to implement a method of training a hand gesture recognition model provided by method embodiments of the present application, or to implement a method of hand gesture recognition as described in the above-mentioned aspects, or to implement a method of image processing as described in the above-mentioned aspects.

In some embodiments, terminal 1900 may further optionally include: a peripheral interface 1903 and at least one peripheral. The processor 1901, memory 1902, and peripheral interface 1903 may be connected by bus or signal lines. Various peripheral devices may be connected to peripheral interface 1903 via a bus, signal line, or circuit board. Specifically, the peripheral device may include: at least one of a communication interface 1904, a display screen 1905, audio circuitry 1906, a camera assembly 1907, a positioning assembly 1908, and a power supply 1909.

Those skilled in the art will appreciate that the configuration shown in FIG. 19 is not intended to be limiting of terminal 1900 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

Referring to fig. 20, a schematic structural diagram of a server according to an embodiment of the present application is shown. Specifically, the method comprises the following steps:

the server 2000 includes a CPU (Central Processing Unit) 2001, a system Memory 2004 including a RAM (Random Access Memory) 2002 and a ROM (Read Only Memory) 2003, and a system bus 2005 connecting the system Memory 2004 and the Central Processing Unit 2001. The server 2000 also includes a basic I/O (Input/Output) system 2006 to facilitate information transfer between devices within the computer, and a mass storage device 2007 for storing an operating system 2013, application programs 2014, and other program modules 2012.

The basic input/output system 2006 includes a display 2008 for displaying information and an input device 2009 such as a mouse, keyboard, etc. for a user to input information. Wherein the display 2008 and the input devices 2009 are coupled to the central processing unit 2001 through an input-output controller 2010 coupled to the system bus 2005. The basic input/output system 2006 may also include an input/output controller 2010 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input-output controller 2010 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 2007 is connected to the central processing unit 2001 through a mass storage controller (not shown) connected to the system bus 2005. The mass storage device 2007 and its associated computer-readable media provide non-volatile storage for the server 2000. That is, the mass storage device 2007 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROM (Compact disk Read-Only Memory) drive.

Without loss of generality, the computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 2004 and mass storage device 2007 described above may be collectively referred to as memory.

The server 2000 may also operate as a remote computer connected to a network via a network, such as the internet, according to various embodiments of the present application. That is, the server 2000 may be connected to the network 2012 through a network interface unit 2011 that is coupled to the system bus 2005, or the network interface unit 2011 may be utilized to connect to other types of networks or remote computer systems (not shown).

The memory also includes at least one instruction, at least one program, set of codes, or set of instructions stored in the memory and configured to be executed by the one or more processors to implement the method of training a hand gesture recognition model described above, or to implement the method of hand gesture recognition described above, or to implement the method of image processing described above.

In an exemplary embodiment, a computer device is also provided. The computer device may be a terminal or a server. The computer device comprises a processor and a memory, wherein at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the training method of the hand gesture recognition model described above, or to implement the hand gesture recognition method described above, or to implement the image processing method described above.

In an exemplary embodiment, there is also provided a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes or a set of instructions which, when executed by a processor, implement the above-mentioned method of training a hand gesture recognition model, or implement the hand gesture recognition method as described in the above-mentioned aspect, or implement the image processing method as described in the above-mentioned aspect.

In an exemplary embodiment, a computer program product is also provided, which, when being executed by a processor, is adapted to carry out the above-mentioned method of training a hand gesture recognition model, or to carry out the hand gesture recognition method as described in the above-mentioned aspect, or to carry out the image processing method as described in the above-mentioned aspect.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for training a hand gesture recognition model, the method comprising:

2. The method of claim 1, wherein the sample hand pose information comprises true position information for n hand keypoints in the sample hand image, the n being a positive integer;

the obtaining of the confidence maps corresponding to the m limbs according to the sample hand posture information includes:

connecting the n hand key points according to joint structures to obtain the m sections of limbs according to the real position information of the n hand key points;

for the ith limb in the m limbs, calculating the distance from a pixel point in the sample hand image to the ith limb;

and generating a confidence map corresponding to the ith limb according to the distance.

3. The method according to claim 1, wherein the m limbs comprise an a-segment finger limb and a b-segment palm limb, the synthetic hand segmentation graph comprises a synthetic finger segmentation subgraph, a synthetic palm segmentation subgraph and a synthetic hand segmentation subgraph, a is a positive integer smaller than m, and b is a positive integer smaller than m;

generating the synthetic hand part segmentation graph according to the confidence level maps corresponding to the m sections of limbs respectively, wherein the generating of the synthetic hand part segmentation graph comprises the following steps:

combining the finger limbs belonging to each finger in the a-section finger limbs respectively to obtain a synthesized finger segmentation subgraph corresponding to each finger respectively, wherein the synthesized finger segmentation subgraph corresponding to the target finger is an image segmented with a target finger area and a non-target finger area;

combining the b-section palm limbs to obtain the synthetic palm segmentation subgraph, wherein the synthetic palm segmentation subgraph is an image segmented with a palm region and a non-palm region;

and combining the m limbs to obtain the synthetic hand segmentation subgraph, wherein the synthetic hand segmentation subgraph is an image segmented with a hand region and a non-hand region.

4. The method of any of claims 1 to 3, wherein the hand gesture recognition model comprises a feature extraction component, a structure prediction component, and a gesture prediction component;

the characteristic extraction part is used for extracting a characteristic map of the sample hand image;

the structure prediction part is used for acquiring a predicted hand segmentation graph according to the feature graph, and the predicted hand segmentation graph comprises a predicted finger segmentation subgraph, a predicted palm segmentation subgraph and a predicted hand segmentation subgraph;

and the gesture prediction part is used for obtaining predicted hand gesture information corresponding to the sample hand image according to the feature map and the predicted hand part segmentation map.

5. The method of claim 4, wherein the training the hand pose recognition model using the sample hand image, the composite hand portion cutmap, and the sample hand pose information comprises:

obtaining the value of the structure loss function according to the synthesized hand part segmentation graph and the predicted hand part segmentation graph;

acquiring a value of a gesture loss function according to the sample hand gesture information and the predicted hand gesture information;

obtaining a value of a target loss function according to the value of the structural loss function and the value of the attitude loss function;

adjusting parameters of the hand gesture recognition model by a value of the objective loss function.

6. A method of hand gesture recognition, the method comprising:

acquiring a target hand image;

7. An image processing method, characterized in that the method comprises:

8. The method according to claim 7, wherein the determining the gesture recognition result of the target user's hand according to the hand gestures respectively corresponding to the image frames comprises:

determining hand motion characteristic information according to hand postures corresponding to the image frames respectively, wherein the hand motion characteristic information is used for representing motion characteristics of the target user hand;

according to the hand motion characteristic information, determining a hand motion evaluation index of the target user hand, wherein the hand motion evaluation index is used for representing the motion obstacle degree of the target user hand.

9. The method according to claim 7, wherein the determining the gesture recognition result of the target user's hand according to the hand gestures respectively corresponding to the image frames comprises:

determining semantic information corresponding to each image frame according to the hand gesture corresponding to each image frame, wherein the semantic information corresponding to a target image frame is used for representing the meaning expressed by the hand gesture corresponding to the target image frame;

and determining target semantic information corresponding to the target user hand based on semantic information corresponding to each image frame, wherein the target semantic information is used for representing the meaning expressed by the target user hand in the target video.

10. The method according to claim 7, wherein the determining the gesture recognition result of the target user's hand according to the hand gestures respectively corresponding to the image frames comprises:

and determining a target control instruction corresponding to the hand of the target user according to the hand postures corresponding to the image frames respectively, wherein the target control instruction is used for controlling target equipment to execute corresponding actions.

11. A training device for a hand gesture recognition model, the device comprising:

12. A hand gesture recognition apparatus, the apparatus comprising:

the image acquisition module is used for acquiring a target hand image;

13. An image processing apparatus, characterized in that the apparatus comprises:

and the result determining module is used for determining the gesture recognition result of the target user hand according to the hand gestures respectively corresponding to the image frames.

14. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to implement the method of any one of claims 1 to 5, or to implement the method of claim 6, or to implement the method of claim 7 or 10.

15. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the method of any one of claims 1 to 5, or to implement the method of claim 6, or to implement the method of claim 7 or 10.