CN110796096A

CN110796096A - Training method, device, equipment and medium for gesture recognition model

Info

Publication number: CN110796096A
Application number: CN201911047039.8A
Authority: CN
Inventors: 胡玉坤; 刘裕峰; 郑文
Original assignee: Reach Best Technology Co Ltd
Current assignee: Reach Best Technology Co Ltd
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2020-02-14
Anticipated expiration: 2039-10-30
Also published as: CN110796096B

Abstract

The disclosure relates to a training method, a training device, equipment and a training medium for a gesture recognition model, which are used for solving the problems that the accuracy of gesture recognition is low and user experience is influenced in the related art. The training method of the gesture recognition model comprises the following steps: acquiring gesture key point data, hand data and background data from a pre-acquired picture sample containing a gesture through marking; training an initial gesture recognition model by using the gesture key point data to obtain a backbone network for recognizing the gesture key points, wherein the initial gesture recognition model comprises an encoder network and a decoder network; and extracting the encoder network from the backbone network, and performing secondary training on the encoder network by using the hand data and the background data to obtain a gesture recognition model, wherein the secondary training comprises multi-classification training, bounding box regression training and two-classification training.

Description

Training method, device, equipment and medium for gesture recognition model

Technical Field

The present disclosure relates to the field of deep learning technologies, and in particular, to a method, an apparatus, a device, and a medium for training a gesture recognition model.

Background

Gesture detection and Tracking (Hand GesturreDetection and Tracking) is a classic topic in the field of computer vision, and the main task of the gesture detection and Tracking is to detect and classify gesture information appearing in a frame-by-frame video stream through a Convolutional Neural Network (CNN), accurately regress the position of a bounding box of an opponent, perform two-classification (or judgment) on whether the Hand exists in a picture, and output confidence probabilities of all classes detected in the picture as detection results.

However, in the prior art, all the data are directly processed and then the neural network is trained, and three branches are trained simultaneously, so that the data proportion is difficult to control, and the trained neural network has a poor convergence effect. Meanwhile, gesture classification influences the recognition accuracy degree due to individual difference of users, and part of gestures are similar and easy to confuse and difficult to accurately recognize; the classification of hands will cause some error in training since some non-hand samples are very similar to hands in terms of color and texture.

Disclosure of Invention

The present disclosure provides a training method, an apparatus, a device, and a medium for a gesture recognition model, so as to at least solve the problem that the accuracy of gesture recognition is low and user experience is affected in the related art. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a training method for a gesture recognition model, including:

acquiring gesture key point data, hand data and background data from a pre-acquired picture sample containing a gesture through marking;

training an initial gesture recognition model by using the gesture key point data to obtain a backbone network for recognizing the gesture key points, wherein the initial gesture recognition model comprises an encoder network and a decoder network;

and extracting an encoder network from the backbone network, and performing secondary training on the encoder network by using the hand data and the background data to obtain a gesture recognition model, wherein the secondary training comprises multi-classification training, bounding box regression training and two-classification training.

In one possible embodiment, the present disclosure provides a method for performing secondary training on an encoder network by using hand data and background data when the secondary training is multi-class training, including:

connecting a multi-classification branch network behind the decoder network;

and training the multi-classification branch network by using hand data to obtain the multi-classification recognition branch network.

In one possible embodiment, the method provided by the present disclosure, when the secondary training is bounding box regression training, performing secondary training on the encoder network by using the hand data and the background data, includes:

connecting a bounding box regression branch network behind the decoder network;

and training the enclosure regression branch network by using hand data to obtain the enclosure regression recognition branch network.

In one possible embodiment, the present disclosure provides a method for performing secondary training on an encoder network by using hand data and background data when the secondary training includes multi-classification training and bounding box regression training, including:

respectively connecting a multi-classification branch network and a bounding box regression branch network behind the decoder network;

and training the multi-classification branch network and the bounding box regression branch network by using the hand data and the background data to obtain the multi-classification recognition branch network and the bounding box regression recognition branch network.

In one possible embodiment, the present disclosure provides a method for performing secondary training on an encoder network by using hand data and background data when the secondary training is binary training, including:

connecting a two-classification branch network behind the decoder network;

and training the two-classification branch network by using the hand data and the background data to obtain the two-classification recognition branch network.

In one possible embodiment, the present disclosure provides a method for training an initial gesture recognition model using gesture key point data and background data, including:

and setting the key point supervision information of the background data to be zero, and training the initial gesture recognition model by using the gesture key point data and the background data with the key point supervision information set to be zero, wherein the proportion of the background data to the key points is the same.

According to a second aspect of the embodiments of the present disclosure, there is provided a training apparatus for a gesture recognition model, including:

the acquisition unit is configured to acquire gesture key point data, hand data and background data from a pre-acquired picture sample containing a gesture through annotation;

the first training unit is configured to train an initial gesture recognition model by using the gesture key point data and the background data to obtain a backbone network for recognizing the gesture key points, wherein the initial gesture recognition model comprises an encoder network and a decoder network;

and the second training unit is configured to extract the encoder network from the backbone network, perform secondary training on the encoder network by using the hand data and the background data to obtain a gesture recognition model, and the secondary training comprises multi-classification training, bounding box regression training and binary classification training.

In one possible embodiment, the present disclosure provides an apparatus wherein the second training unit is specifically configured to:

when the secondary training is multi-classification training, connecting a multi-classification branch network behind a decoder network;

when the secondary training is the enclosure regression training, connecting an enclosure regression branch network behind the decoder network;

when the secondary training comprises multi-classification training and bounding box regression training, respectively connecting a multi-classification branch network and a bounding box regression branch network behind a decoder network;

when the secondary training is two-classification training, connecting a two-classification branch network behind the decoder network;

In a possible implementation manner, the present disclosure provides an apparatus, wherein the obtaining unit is specifically configured to:

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the training method based on the gesture recognition model according to any one of the first aspect of the embodiments of the disclosure.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer program product comprising: a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method for training a gesture recognition model according to any one of the first aspect of the embodiments of the present disclosure.

According to a fifth aspect of the embodiments of the present disclosure, there is provided a storage medium, wherein instructions of the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the training method of a gesture recognition model according to any one of the first aspect of the embodiments of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

obtaining gesture key point data, hand data and background data from a pre-obtained picture sample containing a gesture through marking, training an initial gesture recognition model by utilizing the gesture key point data to obtain a backbone network for recognizing the gesture key point, extracting an encoder network from the backbone network, and performing secondary training on the encoder network by utilizing the hand data and the background data to obtain a gesture recognition model. Compared with the training method of the gesture recognition model in the prior art, the gesture key point training samples are pre-trained, and then multi-classification, two-classification and bounding box regression are trained according to pre-training results, so that the recognition speed and accuracy of the gesture recognition model are improved, and the user experience is enhanced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a diagram illustrating locations of hand keypoints for a gesture recognition model, according to an exemplary embodiment;

FIG. 2 is a schematic illustration of a six-gesture of a gesture recognition model shown in accordance with an exemplary embodiment;

FIG. 3 is a diagram illustrating an ok gesture of a gesture recognition model in accordance with an exemplary embodiment;

FIG. 4 is a schematic flow chart diagram illustrating a method of training a gesture recognition model in accordance with an exemplary embodiment;

FIG. 5 is a schematic diagram illustrating an arrangement of a training apparatus for a gesture recognition model according to an exemplary embodiment;

FIG. 6 is a schematic diagram illustrating the structure of an electronic device in accordance with one illustrative embodiment;

fig. 7 is a schematic structural diagram illustrating a terminal to which a training method of a gesture recognition model is applied according to an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Some of the words that appear in the text are explained below:

1. a Convolutional Neural Network (CNN) is one of feedforward type Neural networks, which uses artificial neurons to respond to units around a certain range of receptive fields, and has excellent performance in the field of image processing.

2. The Intersection over Union (IoU), i.e. the Intersection area of two regions divided by the Union area of the two regions.

3. The gesture key point detection means inputting an RGB image including a human hand, and outputting 21 positions of the key points of the human hand by using a deep learning algorithm, as shown in fig. 1, the positions of the key points of the human hand are 21.

4. Gesture Detection and Tracking (Hand Detection and Tracking) is a classic subject in the field of computer vision, and the main task of the Gesture Detection and Tracking is to detect and classify Gesture information appearing in a frame-by-frame video stream through a Convolutional Neural Network (CNN), accurately regress the position of a bounding box of a Hand, judge whether the Hand exists in the picture or not, and output confidence probability of each category detected in the picture as a Detection result.

5. The following are examples of different gesture types that can be distinguished by the gesture recognition model used in the present disclosure, specifically:

(1) five: five gestures, five fingers open;

(2) heart: single-handed heart-to-heart gesture;

(3) and (3) grease: tilting the thumb;

(4)666: six gestures;

(5) lift: a lift gesture;

(6) victory: a scissor hand gesture;

(7) pointer: a forefinger pointing gesture;

(8) heart 2: the hands are more dominant;

(9) ok: an ok gesture;

(10) fist: a fist-making gesture;

(11) and (8) right: eight gestures.

Fig. 2 and 3 illustrate a six-gesture and an ok-gesture, respectively, among the above-described gestures.

In the prior art, three branches of gesture classification, bounding box and hand classification are trained simultaneously, the data proportion is difficult to control, and the convergence effect of the simultaneously trained neural network is poor. The gesture classification recognition influences the recognition accuracy degree due to the individual difference of the users, and part of gestures are similar and easy to confuse and difficult to accurately recognize; the classification of hands will cause some error in training since some non-hand samples are very similar to hands in terms of color and texture.

According to the traditional neural network training method, all data are directly processed and then the neural network is trained, in the method, the network is trained by using gesture key point data, and then multi-classification training, two-classification training and bounding box regression training are respectively carried out on the network.

The application scenario described in the embodiment of the present disclosure is for more clearly illustrating the technical solution of the embodiment of the present disclosure, and does not form a limitation on the technical solution provided in the embodiment of the present disclosure, and as a person having ordinary skill in the art knows, with the occurrence of a new application scenario, the technical solution provided in the embodiment of the present disclosure is also applicable to similar technical problems. Wherein, in the description of the present disclosure, unless otherwise indicated, "plurality" means.

The training method of the gesture recognition model provided by the embodiment of the disclosure is shown in fig. 4, and the specific technical scheme is as follows.

Step S1, obtaining gesture key point data, hand data and background data from the pre-obtained picture sample containing the gesture through labeling.

During specific implementation, a plurality of picture samples containing gestures are obtained in advance, and gesture key point data, hand data and background data are obtained from the picture samples in a manual marking or machine marking mode. The gesture key point data is used for recognizing the gesture key points, and the hand data and the background data are used for secondary training.

It should be noted that, because the secondary training in step S3 includes multi-classification training, two-classification training, and bounding box regression training, where the multi-classification training and the bounding box regression training only need hand data, and the two-classification training also needs background data, in order to enable the network to adapt to all data distributions, and keep the format consistency of the parameters of the training data, when training the gesture key points to obtain the backbone network, the background data in the same proportion as the gesture key points is added, and the key point supervision information of the background data is set to all 0, so as to avoid interfering with the gesture key point recognition, of course, other data processing methods may also be adopted, which is not limited in the embodiments of the present disclosure.

And step S2, training an initial gesture recognition model by using the gesture key point data to obtain a backbone network for recognizing the gesture key points, wherein the initial gesture recognition model comprises an encoder network and a decoder network.

In specific implementation, since the gesture key points are used for training and in order to obtain the backbone network, the requirement on the accuracy of the gesture key points is low, and the criterion for judging the accuracy of the gesture key points can be set as the distance between the position of the gesture key point in the labeling information of the gesture key points and the position of the gesture key point calculated by the gesture recognition model. In the prior art, the gesture key point training generally uses 3 pixels as a determination standard, and the determination standard in the present disclosure may be slightly wider than this value, for example, 5 pixels as the determination standard, or 6 pixels as the determination standard, which is not limited in the embodiment of the present disclosure.

Compared with multi-classification training, bounding box regression training and two-classification training, the gesture key point training requires that the extraction capability of the neural network on the picture features is stronger. If the training effect is better, the depth and the parameter data volume of the neural network need to be improved to a certain extent compared with the prior art. During specific implementation, one more decoder network can be connected in the initial gesture recognition model, an encoder-decoder network structure is adopted for improving the calculated amount of the neural network and the parameter data amount of the neural network, and after the training of the backbone network is completed, the decoder network connected in multiple ways is removed, so that the data size and the calculated amount of the gesture recognition model after the final training are not changed.

And step S3, extracting the encoder network from the backbone network, and performing secondary training on the encoder network by using the hand data and the background data to obtain a gesture recognition model, wherein the secondary training comprises multi-classification training, bounding box regression training and binary classification training.

In specific implementation, after a backbone network is obtained, network parameters originally used for multi-classification training are taken out from the backbone network and fixed, and three network branches are connected behind the backbone network and respectively comprise a two-classification network branch, a multi-classification network branch and a bounding box regression network branch.

After the network branches are connected, training the multi-classification branch network by using hand data to obtain a multi-classification recognition branch network; training the bounding box regression branch network by using hand data to obtain a bounding box regression recognition branch network; and training the two-classification branch network by using the hand data and the background data to obtain the two-classification recognition branch network. The multi-classification recognition model, the bounding box regression recognition model and the two-classification recognition model form a gesture recognition model.

It should be noted that the secondary training includes multi-classification training, two-classification training and bounding box regression training, and because the multi-classification training and the bounding box regression training only need hand data to complete the training, and during the two-classification training, the background data is also needed, the multi-classification training and the bounding box regression training can be trained simultaneously, so as to save the settlement amount. Of course, the multi-classification training, the two-classification training and the bounding box regression training may be performed step by step, or any two or three training may be performed simultaneously, which is not limited in the embodiment of the present disclosure.

In the disclosure, a trunk network is generated by pre-training, and as the gesture key points are pre-trained, the spatial information of the gesture recognition model is strong, so that the overlap degree (IOU) of bounding box regression recognition is improved by nearly 3 percentage points, and the accuracy rate of binary recognition is improved by 1 percentage point. Under the condition of the same calculated amount and parameter amount, the stability and accuracy of the bounding box regression identification are greatly improved, and the robustness of the abnormal sample through the binary classification identification is obviously improved.

As shown in fig. 5, a schematic structural diagram of a training apparatus for a gesture recognition model provided in an embodiment of the present disclosure includes:

an obtaining unit 501 configured to perform obtaining gesture key point data, hand data and background data by annotation from a pre-obtained picture sample containing a gesture;

a first training unit 502 configured to perform training of an initial gesture recognition model by using the gesture key point data and the background data to obtain a backbone network for recognizing the gesture key points, where the initial gesture recognition model includes an encoder network and a decoder network;

and a second training unit 503 configured to perform extraction of the encoder network from the backbone network, and perform secondary training on the encoder network by using the hand data and the background data to obtain a gesture recognition model, where the secondary training includes multi-class training, bounding box regression training, and binary training.

In one possible implementation, the present disclosure provides an apparatus, wherein the second training unit 503 is specifically configured to:

and training the enclosure regression branch network by using hand data to obtain an enclosure regression recognition branch network.

In one possible implementation, the present disclosure provides an apparatus, wherein the obtaining unit 501 is specifically configured to:

FIG. 6 is a block diagram illustrating an electronic device, namely a training device 600 for a gesture recognition model, in accordance with an exemplary embodiment.

A processor 610;

a memory 620 for storing instructions executable by the processor 610;

wherein the processor 610 is configured to execute the instructions to implement the short network address generation method in the embodiment of the present disclosure.

In an exemplary embodiment, a storage medium comprising instructions, such as the memory 620 comprising instructions, executable by the processor 610 of the device 600 to perform the above-described method is also provided. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In the embodiment of the present disclosure, as shown in fig. 7, a terminal 700 for applying the training method of the gesture recognition model provided in the embodiment of the present disclosure includes: radio Frequency (RF) circuit 710, power supply 720, processor 730, memory 770, input unit 750, display unit 760, camera 770, communication interface 780, and Wireless Fidelity (Wi-Fi) module 790. Those skilled in the art will appreciate that the configuration of the terminal shown in fig. 7 is not intended to be limiting, and that the terminal provided by the embodiments of the present application may include more or less components than those shown, or some components may be combined, or a different arrangement of components may be provided.

The following describes the various components of the terminal 700 in detail with reference to fig. 7:

the RF circuit 710 may be used for receiving and transmitting data during a communication or conversation. Specifically, the RF circuit 710 sends the downlink data of the base station to the processor 730 for processing after receiving the downlink data; and in addition, sending the uplink data to be sent to the base station. Generally, the RF circuit 710 includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like.

In addition, the RF circuit 710 may also communicate with a network and other terminals through wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Messaging Service (SMS), and the like.

The Wi-Fi technology belongs to a short-distance wireless transmission technology, and the terminal 700 may connect to an Access Point (AP) through a Wi-Fi module 790, thereby implementing Access to a data network. The Wi-Fi module 790 may be used for receiving and transmitting data during communication.

The terminal 700 may be physically connected to other terminals through the communication interface 780. Optionally, the communication interface 780 is connected to the communication interfaces of the other terminals through a cable, so as to implement data transmission between the terminal 700 and the other terminals.

In the embodiment of the present application, the terminal 700 can implement a communication service and send information to other contacts, so the terminal 700 needs to have a data transmission function, that is, the terminal 700 needs to include a communication module inside. Although fig. 7 illustrates communication modules such as the RF circuit 710, the Wi-Fi module 790, and the communication interface 780, it is to be understood that at least one of the above-described components or other communication modules (e.g., a bluetooth module) for implementing communication may be present in the terminal 700 for data transmission.

For example, when the terminal 700 is a mobile phone, the terminal 700 may include the RF circuit 710 and may further include the Wi-Fi module 790; when the terminal 700 is a computer, the terminal 700 may include the communication interface 780 and may further include the Wi-Fi module 790; when the terminal 700 is a tablet computer, the terminal 700 may include the Wi-Fi module.

The memory 770 may be used to store software programs and modules. The processor 730 executes various functional applications and data processing of the terminal 700 by executing software programs and modules stored in the memory 770, and part or all of the processes in fig. 4 of the embodiments of the present disclosure can be implemented when the processor 730 executes program codes in the memory 770.

Alternatively, the memory 770 may mainly include a program storage area and a data storage area. The storage program area can store an operating system, various application programs (such as communication applications), a training module of a gesture recognition model and the like; the storage data area may store data created according to the use of the terminal (such as various multimedia files like pictures, video files, etc., and training information templates of gesture recognition models), etc.

Further, the memory 770 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 750 may be used to receive numeric or character information input by a user and generate key signal inputs related to user settings and function control of the terminal 700.

Alternatively, the input unit 750 may include a touch panel 751 and other input terminals 752.

The touch panel 751, also referred to as a touch screen, can collect touch operations of a user (such as a user's operation of a finger, a stylus, or any other suitable object or accessory on or near the touch panel 751) and drive a corresponding connection device according to a preset program. Alternatively, the touch panel 751 may include two portions, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 730, and can receive and execute commands sent by the processor 730. In addition, the touch panel 751 may be implemented in various types, such as resistive, capacitive, infrared, and surface acoustic wave.

Optionally, the other input terminals 752 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 760 may be used to display information input by a user or information provided to the user and various menus of the terminal 700. The display unit 760 is a display system of the terminal 700, and is configured to present an interface to implement human-computer interaction.

The display unit 760 may include a display panel 761. Alternatively, the Display panel 761 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-emitting diode (OLED), or the like.

Further, the touch panel 751 can cover the display panel 761, and when the touch panel 751 detects a touch operation thereon or nearby, the touch panel is transmitted to the processor 730 to determine the type of the touch event, and then the processor 730 provides a corresponding visual output on the display panel 761 according to the type of the touch event.

Although in fig. 7, the touch panel 751 and the display panel 761 are implemented as two separate components to implement the input and output functions of the terminal 700, in some embodiments, the touch panel 751 and the display panel 761 can be integrated to implement the input and output functions of the terminal 700.

The processor 730 is a control center of the terminal 700, connects various components using various interfaces and lines, performs various functions of the terminal 700 and processes data by operating or executing software programs and/or modules stored in the memory 770 and calling data stored in the memory 770, thereby implementing various services based on the terminal.

Optionally, the processor 730 may include one or more processing units. Optionally, the processor 730 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, and the like, and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 730.

The camera 770 is used for implementing a shooting function of the terminal 700 and shooting pictures or videos. The camera 770 may also be used to implement a scanning function of the terminal 700, and scan a scanned object (two-dimensional code/barcode).

The terminal 700 also includes a power supply 720, such as a battery, for powering the various components. Optionally, the power supply 720 may be logically connected to the processor 730 through a power management system, so as to implement functions of managing charging, discharging, power consumption, and the like through the power management system.

It is noted that the processor 730 of the embodiments of the disclosure can perform the functions of the processor 610 in fig. 6, and the memory 770 stores the contents of the processor 610.

In addition, in an exemplary embodiment, the present disclosure also provides a storage medium, and when instructions in the storage medium are executed by a processor of the electronic device, the electronic device is enabled to implement the training method for the gesture recognition model in the embodiment of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A training method of a gesture recognition model is characterized by comprising the following steps:

and extracting the encoder network from the backbone network, and performing secondary training on the encoder network by using the hand data and the background data to obtain a gesture recognition model, wherein the secondary training comprises multi-classification training, bounding box regression training and two-classification training.

2. The method of claim 1, wherein the training the encoder network twice with the hand data and the background data when the training is multi-class training comprises:

connecting a multi-class branch network behind the decoder network;

and training the multi-classification branch network by using the hand data to obtain the multi-classification recognition branch network.

3. The method of claim 1, wherein the second training of the encoder network with the hand data and the background data when the second training is bounding box regression training comprises:

connecting a bounding box regression branch network behind the decoder network;

and training the enclosure regression branch network by using the hand data to obtain an enclosure regression recognition branch network.

4. The method of claim 1, wherein the secondary training of the encoder network using the hand data and the background data comprises, when the secondary training comprises multi-classification training and bounding box regression training:

and training the multi-classification branch network and the enclosure regression branch network by using the hand data and the background data to obtain a multi-classification recognition branch network and an enclosure regression recognition branch network.

5. The method of claim 1, wherein the second training of the encoder network with the hand data and the background data when the second training is a binary training comprises:

connecting a two-classification branch network behind the decoder network;

6. The method of claim 1, wherein training an initial gesture recognition model using the gesture keypoint data and the context data comprises:

and setting the key point supervision information of the background data to be zero, and training the initial gesture recognition model by using the gesture key point data and the background data with the key point supervision information set to be zero, wherein the background data and the key points have the same proportion.

7. A device for training a gesture recognition model, comprising:

a first training unit, configured to perform training on an initial gesture recognition model by using the gesture key point data and the background data to obtain a backbone network for recognizing the gesture key points, where the initial gesture recognition model includes an encoder network and a decoder network;

and the second training unit is configured to extract the encoder network from the backbone network, perform secondary training on the encoder network by using the hand data and the background data to obtain a gesture recognition model, wherein the secondary training comprises multi-classification training, bounding box regression training and binary classification training.

8. The apparatus according to claim 7, wherein the obtaining unit is specifically configured to:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement a method of training a gesture recognition model according to any one of claims 1 to 6.

10. A storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform a method of training a gesture recognition model according to any one of claims 1 to 6.