WO2022174605A1

WO2022174605A1 - Gesture recognition method, gesture recognition apparatus, and smart device

Info

Publication number: WO2022174605A1
Application number: PCT/CN2021/124613
Authority: WO
Inventors: 汤志超; 程骏; 郭渺辰; 钱程浩; 邵池; 庞建新
Original assignee: 深圳市优必选科技股份有限公司
Priority date: 2021-02-21
Filing date: 2021-10-19
Publication date: 2022-08-25
Also published as: CN112949437A

Abstract

The present application is suitable for the technical field of gesture recognition, and provides a gesture recognition method, a gesture recognition apparatus, and a smart device. The method comprises: obtaining a target video comprising a gesture; and inputting the target video into a trained gesture recognition model so as to obtain category information, positioning box information and key point information of the gesture of the target video, wherein the gesture recognition model is obtained by training using a sample gesture image carrying annotation information, and the annotation information comprises category information, positioning box information and key point information of a gesture of the sample gesture image. By means of the solution of the present application, the accuracy and robustness of gesture recognition can be improved.

Description

Gesture recognition method, gesture recognition device and smart device

This application claims the priority of the Chinese Patent Application No. 202110194549.9 filed with the Chinese Patent Office on February 21, 2021, the entire contents of which are incorporated herein by reference.

technical field

The present application belongs to the technical field of gesture recognition, and in particular, relates to a gesture recognition method, a gesture recognition device, a smart device, and a computer-readable storage medium.

Background technique

At present, gesture recognition plays an important role in the field of human-computer interaction. Gesture recognition technology can help people solve problems in corresponding scenarios, such as recognizing the sign language of deaf people and playing guessing games with robots. However, the current gesture recognition technology does not have high recognition accuracy and high robustness.

technical problem

In view of this, the present application provides a gesture recognition method, a gesture recognition device, a smart device, and a computer-readable storage medium, which can improve the accuracy and robustness of gesture recognition.

technical solutions

In a first aspect, the present application provides a gesture recognition method, including:

Get the target video that contains the gesture;

Input the above-mentioned target video into the trained gesture recognition model, and obtain the category information, positioning frame information and key point information of the gesture in the above-mentioned target video, wherein the above-mentioned gesture recognition model is obtained by training the sample gesture images carrying the annotation information, and the above-mentioned The annotation information includes the category information, positioning frame information, and key point information of the gesture in the above-mentioned sample gesture image.

In a second aspect, the present application provides a gesture recognition device, including:

an acquisition unit for acquiring a target video containing gestures;

The recognition unit is used to input the above-mentioned target video into the trained gesture recognition model, and obtain the category information, positioning frame information and key point information of the gesture in the above-mentioned target video, wherein the above-mentioned gesture recognition model uses the sample gesture image carrying the annotation information. It is obtained through training that the above-mentioned labeling information includes the category information, positioning frame information and key point information of the gesture in the above-mentioned sample gesture image.

In a third aspect, the present application provides a smart device, including a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor implements the first aspect when the processor executes the computer program. steps of the method.

In a fourth aspect, the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements the steps of the method in the first aspect.

In a fifth aspect, the present application provides a computer program product, wherein the computer program product includes a computer program, and when the computer program is executed by one or more processors, the steps of the method of the first aspect are implemented.

beneficial effect

As can be seen from the above, in the solution of the present application, after obtaining the target video containing gestures, the above target video is input into the trained gesture recognition model, and the category information, positioning frame information and key point information of the gesture in the above target video are obtained, wherein The above-mentioned gesture recognition model is obtained by training sample gesture images carrying annotation information, and the above-mentioned annotation information includes gesture category information, positioning frame information and key point information in the above-mentioned sample gesture images. The solution of the present application uses the sample gesture images carrying the annotation information to train the gesture recognition model. Since the annotation information includes a variety of gesture information (ie category information, positioning frame information and key point information), in the process of training the gesture recognition model, the gesture recognition The model can implicitly combine the various gesture information for learning, so that the trained gesture recognition model has high accuracy and robustness. It can be understood that, for the beneficial effects of the second aspect to the fifth aspect, reference may be made to the relevant description in the first aspect, which is not repeated here.

Description of drawings

In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only for the present application. In some embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

1 is a schematic flowchart of a gesture recognition method provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of an application environment of the gesture recognition method provided by an embodiment of the present application;

3 is a structural block diagram of a gesture recognition device provided by an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a smart device provided by an embodiment of the present application.

Embodiments of the present invention

In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are set forth in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to those skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

In order to illustrate the technical solutions proposed in the present application, the following specific embodiments are used for description.

A gesture recognition method provided by an embodiment of the present application is described below. The gesture recognition method is applied to a smart device. Referring to Figure 1, the gesture recognition method includes:

Step 101: Acquire a target video including gestures.

In this embodiment of the present application, the target video includes gestures, that is, the target video is a video obtained by photographing a human hand by a photographing device. Specifically, the target video can be a video input in real time through a camera connected to a smart device, or it can be a pre-recorded video, which is not limited here. For example, the user can pre-shoot the hand that is making the gesture through his mobile phone, and then send the captured video to the smart device, and the smart device can use the captured video as the target video.

The target video includes several frames of images, and among the several frames of images, at least one frame of images contains gestures, that is, there are two cases, one is that each frame of the target video contains gestures, and the other is The situation is that part of the image of the target video contains gestures and another part of the images does not contain gestures.

Step 102: Input the target video into the trained gesture recognition model, and obtain the category information, positioning frame information and key point information of the gesture in the target video.

In the embodiment of the present application, the gesture recognition model is obtained by training sample gesture images. In order to improve the recognition accuracy of the gesture recognition model, the number of sample gesture images used for training the gesture recognition model should be as large as possible, for example, the number of sample gesture images may be 10,000. Due to the flexibility of the human hand, the number of categories of gestures that the human hand can make is very large, so the gesture recognition model cannot recognize all the categories of gestures that the human hand can make. Based on this, at least one gesture can be selected as the preset gesture based on the application scenario and user requirements, and then sample gesture images including the preset gesture are collected, wherein each sample gesture image includes a preset gesture. Exemplarily, 9 kinds of gestures can be selected as preset gestures, and the 9 kinds of preset gestures are palm gestures, stone gestures, scissor gestures, OK gestures, and handsome gestures. ) gesture, call gesture, swear gesture, rock gesture, and one gesture.

For each sample gesture image, it can be annotated, so that the sample gesture image carries annotation information, and the annotation information can include the category information, positioning frame information and key point information of the gesture in the sample gesture image, wherein the category information It is used to indicate the category of the gesture, the positioning frame information is used to indicate the positioning frame of the gesture, the positioning frame is the circumscribing rectangle of the gesture, and the key point information is used to indicate the key points of the gesture (ie, 21 skeleton points of a single hand).

The gesture recognition model after training can be obtained by training the gesture recognition model through the sample gesture images. Input the target video into the trained gesture recognition model, and the trained gesture recognition model can output the category information, positioning box information and key point information of the gesture in the target video, that is to say, the gesture recognition model is a multi- The task model can complete multiple tasks, including the category information of the output gesture, the positioning box information of the output gesture, and the key point information of the output gesture. During the training process, the multi-task model can improve the learning efficiency and quality of each task by learning the connections and differences of different tasks. Therefore, the gesture recognition accuracy of the trained gesture recognition model in the embodiment of the present application is compared to Traditional gesture recognition models are higher.

It should be noted that after inputting the target video into the trained gesture recognition model, the gesture recognition model actually performs gesture recognition on each frame of the target video. For each frame of the target video, the gesture recognition model can detect whether the image contains gestures, and if the image contains gestures, output the category information, positioning frame information and key point information of the gestures in the image, if the image contains gestures If gestures are not included, no information is output. Wherein, the category information of the gesture in each frame of image in the target video is used to indicate which gesture in the at least one preset gesture the gesture in the frame of image belongs to; the positioning frame information of the gesture in each frame of image in the target video The position of the positioning frame used to indicate the gesture in the frame image, for example, the positioning frame information is the coordinates of the upper left corner and the lower right corner of the positioning frame; the key point information of the gesture in each frame image in the target video is used to indicate the frame image The position of the key point of the gesture in , for example, the key point information is the coordinate of the key point.

Optionally, before inputting the target video into the trained gesture recognition model, the method further includes:

Normalize each frame of the target video to obtain a normalized video;

Correspondingly, the above step 102 specifically includes:

The normalized video is input into the trained gesture recognition model, and the category information, positioning frame information and key point information of the gesture in the target video are obtained.

In this embodiment of the present application, the normalization process may be to perform mean and variance operations on the pixel values of the three RGB channels in each frame of the target video, so that the pixel values are converted from a range of 0 to 255 to -1 to 1 In the range. Through normalization processing, each frame image of the target video can meet the requirements of the gesture recognition model for the image format, which facilitates the subsequent use of the gesture recognition model for gesture recognition. In the embodiment of the present application, the normalized target video is recorded as a normalized video, and the normalized video is input into the trained gesture recognition model, so that the gesture recognition model outputs the gesture recognition model in the target video based on this. Category information, positioning box information and key fixed information.

Optionally, considering that the gesture recognition model is a multi-task model and can complete various tasks, the gesture recognition model can be made to include a gesture classification branch, a gesture localization branch and a key point detection branch, wherein each branch correspondingly completes a Task.

Specifically, the gesture classification branch is used to output category information of gestures in the target video. The implementation of the gesture classification branch is to perform one-hot encoding on the gesture category, and use the softmax layer to output the probability of the gesture category. Through the gesture classification branch, a target preset gesture with the highest matching probability with the gesture in the target video can be determined among at least one preset gesture, and the category information of the gesture in the target video can be determined based on the target preset gesture. For example, the target video contains an unknown gesture X. After the target video is input into the trained gesture recognition model, the matching probability between the gesture X and the preset gesture A is 14%, and the matching probability between the gesture X and the preset gesture B is 14%. The probability is 85%, and the matching probability between the gesture X and the preset gesture C is 1%, then the preset gesture B can be determined as the target preset gesture, and the category information indicates that the unknown gesture X is the preset gesture B.

Specifically, the gesture positioning branch is used to output the positioning frame information of the gesture in the target video. Through the gesture positioning branch, the position of the gesture in the target video can be positioned, and then the positioning frame information of the gesture in the target video can be determined based on the position.

Specifically, the keypoint detection branch is used to output keypoint information of gestures in the target video. The implementation of the keypoint detection branch is network regression. Through the key point detection branch, the position of the key point of the gesture in the target video can be detected, and then the key point information of the gesture in the target video can be determined based on the position.

Optionally, the gesture recognition model further includes a feature extraction layer (ie BackBone network), which can be a deep residual network (ResNet), such as ResNet50, or a lightweight network such as shuffleNet and MobileNet. , which network to choose as the feature extraction layer can be determined according to the performance of the smart device. For example, if the smart device is a desktop computer with strong performance, ResNet50 can be selected as the feature extraction layer. If the smart device is a mobile phone with weak performance, then MobileNet can be selected as the feature extraction layer. After inputting the target video into the gesture recognition model, the feature extraction layer can perform feature extraction on the target video to obtain the feature information of the target video. Referring to Figure 2, after the feature information of the target video is obtained through the feature extraction layer, the feature information will be input to the gesture classification branch, the gesture localization branch and the key point detection branch respectively. Then the gesture classification branch can output the category information of the gesture in the target video based on the feature information, the gesture localization branch can output the positioning frame information of the gesture in the target video based on the feature information, and the key point detection branch can output the gesture in the target video based on the feature information. key point information.

Optionally, the gesture classification branch, the gesture localization branch and the keypoint detection branch can be obtained by training with different loss functions respectively. For example, in the training process, the cross-entropy loss function can be used to guide the training of the gesture classification branch, the GloU loss function can be used to guide the training of the gesture location branch, and the WingLoss loss function can be used to guide the training of the key point detection branch. Since different branches are trained with different loss functions, the accuracy of the branches obtained by training can be higher.

Optionally, before training the gesture recognition model, the sample gesture images can also be enhanced, and then the enhanced sample gesture images are used to train the gesture recognition model, so that the sample gesture images are more generalized, which is beneficial to gestures. The recognition accuracy of the recognition model is improved. Among them, the enhancement processing may include flipping and rotation, etc.

Optionally, after the above step 102, it also includes:

Based on the category information, positioning frame information and key point information of the gesture in the target video, mark the gesture category, positioning frame and key points in the target video;

Output a target video marked with the gesture's category, positioning box, and keypoints.

In the embodiment of the present application, after the gesture recognition model outputs the category information, positioning frame information, and key point information of the gesture in the target video, it can be based on the category information, positioning frame information and key point information of the gesture in the target video. The category, positioning frame and key points of the gesture are marked in the video, and then a target video marked with the category, positioning frame and key points of the gesture is output to show the target video to the user. In the target video, users can see the categories, positioning boxes, and key points of the marked gestures, bringing users a more visually impactful experience.

Exemplarily, for each frame of the gesture image of the target video, the type of the gesture may be marked in the gesture image based on the type information of the gesture in the gesture image, and the type of the gesture may be marked in the gesture image based on the positioning frame information of the gesture in the gesture image. The positioning frame, and the key points of the gesture are marked in the gesture image based on the key point information of the gesture in the gesture image. The gesture image refers to an image containing gestures. It can be understood that, for the non-gesture images in the target video, no marking operation will be performed, wherein the non-gesture images refer to images that do not contain gestures.

In an application scenario, the gesture recognition method provided by the embodiment of the present application can be applied to a robot, and the robot can implement a guessing game with a user by executing the gesture recognition method. Specifically, the robot can recognize in real time which gesture of rock, scissors, and cloth the user's gesture belongs to, and then determine which of the rock, scissors, and cloth the robot should present.

As can be seen from the above, in the solution of the present application, after obtaining the target video containing gestures, the above target video is input into the trained gesture recognition model, and the category information, positioning frame information and key point information of the gesture in the above target video are obtained, wherein The above-mentioned gesture recognition model is obtained by training sample gesture images carrying annotation information, and the above-mentioned annotation information includes gesture category information, positioning frame information and key point information in the above-mentioned sample gesture images. The solution of the present application uses the sample gesture images carrying the annotation information to train the gesture recognition model. Since the annotation information includes a variety of gesture information (ie category information, positioning frame information and key point information), in the process of training the gesture recognition model, the gesture recognition The model can implicitly combine the various gesture information for learning, so that the trained gesture recognition model has high accuracy and robustness.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Corresponding to the gesture recognition method proposed above, an embodiment of the present application provides a gesture recognition device. Referring to FIG. 3, the gesture recognition device 300 in the embodiment of the present application includes:

an acquisition unit 301, used to acquire a target video containing gestures;

The identification unit 302 is configured to input the above-mentioned target video into a trained gesture recognition model, and obtain the category information, positioning frame information and key point information of the gesture in the above-mentioned target video, wherein the above-mentioned gesture recognition model uses sample gestures carrying label information The image is obtained by training, and the above-mentioned label information includes the category information, positioning frame information and key point information of the gesture in the above-mentioned sample gesture image.

Optionally, the above gesture recognition apparatus 300 further includes:

a marking unit, configured to mark the category, positioning frame and key points of the gesture in the above-mentioned target video based on the category information, positioning frame information and key point information of the gesture in the above-mentioned target video;

The output unit is used for outputting the above-mentioned target video marked with the category of the gesture, the positioning frame and the key points.

Optionally, the above-mentioned marking unit, specifically for each frame of the gesture image of the above-mentioned target video, marks the type of the gesture in the above-mentioned gesture image based on the category information of the gesture in the above-mentioned gesture image, based on the positioning frame of the gesture in the above-mentioned gesture image. The information indicates the positioning frame of the gesture in the gesture image, and indicates the key point of the gesture in the gesture image based on the key point information of the gesture in the gesture image, wherein the gesture image is an image including the gesture.

Optionally, the above gesture recognition model includes a gesture classification branch, a gesture positioning branch and a key point detection branch;

The above-mentioned gesture classification branch is used to output the category information of gestures in the above-mentioned target video;

The above-mentioned gesture positioning branch is used to output the positioning frame information of the gesture in the above-mentioned target video;

The above-mentioned key point detection branch is used to output the key point information of the gesture in the above-mentioned target video.

Optionally, the above-mentioned gesture recognition model further includes a feature extraction layer, which is used to perform feature extraction on the above-mentioned target video to obtain feature information;

The above-mentioned gesture classification branch is specifically configured to output the category information of gestures in the above-mentioned target video based on the above-mentioned feature information;

The above-mentioned gesture positioning branch is specifically configured to output the positioning frame information of the gesture in the above-mentioned target video based on the above-mentioned feature information;

The above-mentioned key point detection branch is specifically configured to output the key point information of the gesture in the above-mentioned target video based on the above-mentioned feature information.

Optionally, the above-mentioned gesture classification branch, the above-mentioned gesture localization branch, and the above-mentioned key point detection branch are respectively obtained by training with different loss functions.

Optionally, the above gesture recognition apparatus 300 further includes:

a normalization unit, which is used to normalize each frame of the above-mentioned target video to obtain a normalized video;

Correspondingly, the above-mentioned recognition unit 302 is specifically configured to input the above-mentioned normalized video into the trained gesture recognition model, and obtain the category information, positioning frame information and key point information of the gesture in the above-mentioned target video.

The embodiment of the present application also provides a smart device, and the smart device may be a robot, a mobile phone, a desktop computer, or a tablet computer, which is not limited here. Referring to FIG. 4 , the smart device 4 in this embodiment of the present application includes: a memory 401 , one or more processors 402 (only one is shown in FIG. 4 ), a binocular camera 403 , and a binocular camera 403 , which is stored in the memory 401 and can be processed during processing. computer program running on the device. The binocular camera 403 includes a first camera and a second camera; the memory 401 is used to store software programs and units, and the processor 402 executes various functional applications and data processing by running the software programs and units stored in the memory 401, to obtain the resources corresponding to the above preset events. Specifically, the processor 402 implements the following steps by running the above-mentioned computer program stored in the memory 401:

Get the target video that contains the gesture;

Assuming that the above is the first possible implementation, in the second possible implementation provided based on the first possible implementation, the above-mentioned target video is input into the trained gesture recognition model to obtain the above-mentioned After the category information, positioning frame information and key point information of the gesture in the target video, the processor 402 also implements the following steps by running the above computer program stored in the memory 401:

Based on the category information, positioning frame information and key point information of the gesture in the above target video, the category, positioning frame and key points of the gesture are marked in the above target video;

Output the above target video marked with the category of the gesture, the positioning box and the key points.

In the third possible implementation manner provided on the basis of the above-mentioned second possible implementation manner, the above-mentioned gesture is marked in the above-mentioned target video based on the category information, positioning frame information and key point information of the gesture in the above-mentioned target video categories, anchor boxes, and keypoints, including:

For each frame of the gesture image in the target video, the type of the gesture is marked in the gesture image based on the type information of the gesture in the gesture image, and the type of the gesture is marked in the gesture image based on the positioning frame information of the gesture in the gesture image. The positioning frame, and marking the key points of the gesture in the gesture image based on the key point information of the gesture in the gesture image, wherein the gesture image is an image including the gesture.

In a fourth possible implementation manner provided on the basis of the above-mentioned first possible implementation manner, the above-mentioned gesture recognition model includes a gesture classification branch, a gesture localization branch, and a key point detection branch;

In the fifth possible implementation manner provided on the basis of the above-mentioned fourth possible implementation manner, the above-mentioned gesture recognition model further includes a feature extraction layer, which is used to perform feature extraction on the above-mentioned target video to obtain characteristic information;

In the sixth possible implementation manner provided based on the fourth possible implementation manner, the gesture classification branch, the gesture localization branch, and the key point detection branch are respectively obtained by training with different loss functions.

On the basis of the above-mentioned first possible implementation manner, or the above-mentioned second possible implementation manner as a basis, or the above-mentioned third possible implementation manner as a basis, or the above-mentioned fourth possible implementation manner as a basis, or the above-mentioned In the fifth possible implementation manner as a basis, or in the seventh possible implementation manner provided on the basis of the sixth possible implementation manner, before the above-mentioned target video is input into the trained gesture recognition model, the processor 402 also implements the following steps by running the above-mentioned computer program stored in the memory 401:

Normalize each frame of the target video to obtain a normalized video;

Correspondingly, the above-mentioned target video is input into the trained gesture recognition model, and the category information, positioning frame information and key point information of the gesture in the above-mentioned target video are obtained, including:

The above normalized video is input into the trained gesture recognition model, and the category information, positioning frame information and key point information of the gesture in the target video are obtained.

It should be understood that, in this embodiment of the present application, the processor 402 may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP) , Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Memory 401 may include read-only memory and random access memory, and provides instructions and data to processor 402 . Part or all of memory 401 may also include non-volatile random access memory. For example, the memory 401 may also store information of device categories.

Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example. Module completion, that is, dividing the internal structure of the above device into different functional units or modules to complete all or part of the functions described above. Each functional unit and module in the embodiment may be integrated in one processing unit, or each unit may exist physically alone, or two or more units may be integrated in one unit, and the above-mentioned integrated units may adopt hardware. It can also be realized in the form of software functional units. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present application. For the specific working processes of the units and modules in the above-mentioned system, reference may be made to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

In the foregoing embodiments, the description of each embodiment has its own emphasis. For parts that are not described or described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of external device software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

In the embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the system embodiments described above are only illustrative. For example, the division of the above-mentioned modules or units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined. Either it can be integrated into another system, or some features can be omitted, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

The units described above as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

If the above-mentioned integrated units are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. Based on this understanding, the present application can implement all or part of the processes in the methods of the above embodiments, and can also be completed by instructing the associated hardware through a computer program, and the above computer program can be stored in a computer-readable storage medium, the computer When the program is executed by the processor, the steps of the foregoing method embodiments can be implemented. Wherein, the above-mentioned computer program includes computer program code, and the above-mentioned computer program code may be in the form of source code, object code form, executable file or some intermediate form. The above-mentioned computer-readable storage medium may include: any entity or device capable of carrying the above-mentioned computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer-readable memory, a read-only memory (ROM, Read-Only Memory) ), random access memory (RAM, Random Access Memory), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content contained in the above-mentioned computer-readable storage media may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, computer-readable storage Excluded from the medium are electrical carrier signals and telecommunication signals.

The above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The recorded technical solutions are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in the application. within the scope of protection.

Claims

A gesture recognition method, comprising:

Get the target video that contains the gesture;

Input the target video into the trained gesture recognition model, and obtain the category information, positioning frame information and key point information of the gesture in the target video, wherein the gesture recognition model is trained by the sample gesture images carrying the annotation information It is obtained that the annotation information includes category information, positioning frame information and key point information of the gesture in the sample gesture image.
The gesture recognition method according to claim 1, wherein after inputting the target video into the trained gesture recognition model, the category information, positioning frame information and key point information of the gesture in the target video are obtained After that, also include:

Based on the category information, positioning frame information and key point information of the gesture in the target video, marking the gesture category, positioning frame and key point in the target video;

Output the target video marked with the category of the gesture, the positioning box and the key points.
The gesture recognition method according to claim 2, wherein the gesture category and the positioning frame are marked in the target video based on the category information, positioning frame information and key point information of the gesture in the target video. and key points, including:

For each frame of the gesture image of the target video, the type of the gesture is marked in the gesture image based on the type information of the gesture in the gesture image, and the gesture type is marked in the gesture image based on the positioning frame information of the gesture in the gesture image. The positioning frame of the gesture is marked in the image, and the key point of the gesture is marked in the gesture image based on the key point information of the gesture in the gesture image, wherein the gesture image is an image containing the gesture.
The gesture recognition method according to claim 1, wherein the gesture recognition model comprises a gesture classification branch, a gesture localization branch and a key point detection branch;

The gesture classification branch is used to output category information of gestures in the target video;

The gesture positioning branch is used to output the positioning frame information of the gesture in the target video;

The key point detection branch is used to output the key point information of the gesture in the target video.
The gesture recognition method according to claim 4, wherein the gesture recognition model further comprises a feature extraction layer for performing feature extraction on the target video to obtain feature information;

The gesture classification branch is specifically configured to output category information of gestures in the target video based on the feature information;

The gesture positioning branch is specifically configured to output the positioning frame information of the gesture in the target video based on the feature information;

The key point detection branch is specifically configured to output the key point information of the gesture in the target video based on the feature information.
The gesture recognition method according to claim 4, wherein the gesture classification branch, the gesture localization branch and the key point detection branch are respectively obtained by training with different loss functions.
The gesture recognition method according to any one of claims 1-6, wherein before the inputting the target video into the trained gesture recognition model, further comprising:

Normalize each frame of the target video to obtain a normalized video;

Correspondingly, inputting the target video into the trained gesture recognition model to obtain the category information, positioning frame information and key point information of the gesture in the target video, including:

The normalized video is input into the trained gesture recognition model, and the category information, positioning frame information and key point information of the gesture in the target video are obtained.
A gesture recognition device, comprising:

an acquisition unit for acquiring a target video containing gestures;

The recognition unit is used to input the target video into the trained gesture recognition model, and obtain the category information, positioning frame information and key point information of the gesture in the target video, wherein the gesture recognition model is obtained by carrying the annotation information. The sample gesture image is obtained by training, and the annotation information includes the category information, positioning frame information and key point information of the gesture in the sample gesture image.
An intelligent device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, when the processor executes the computer program, the computer program according to claim 1 to 7. The method of any one.
A computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, the method according to any one of claims 1 to 7 is implemented.