CN117523654A

CN117523654A - Gesture recognition model training method and related device

Info

Publication number: CN117523654A
Application number: CN202210907224.5A
Authority: CN
Inventors: 杨伟明; 唐惠忠; 王少鸣; 郭润增; 卢鑫畅
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2024-02-06

Abstract

The embodiment of the application discloses a gesture recognition model training method and a related device, wherein a sample gesture video has a corresponding sample gesture type. And acquiring a plurality of video frame images corresponding to the sample gesture video based on a preset time interval, generating sample data corresponding to the sample gesture video based on the plurality of video frame images, and taking the sample gesture type as a label corresponding to the sample data. The gesture recognition model can be trained through the sample data, and the gesture recognition model is used for recognizing gesture types corresponding to gesture videos to be recognized. Therefore, the gesture recognition model has the capability of recognizing the gesture type corresponding to the gesture video through the partial video frames obtained by frame extraction in the gesture video, and reduces the data volume required by the model in the recognition process while ensuring the gesture recognition accuracy, so that the requirement on the processing capacity of the model is reduced, the gesture recognition model can also effectively run on equipment with lower processing capacity, and meanwhile, the gesture recognition efficiency is improved.

Description

Gesture recognition model training method and related device

Technical Field

The application relates to the technical field of model training, in particular to a gesture recognition model training method and a related device.

Background

Gesture recognition is one of the currently prevailing recognition modes, and a user can trigger subsequent operations, such as payment, unlocking, etc., only after making an accurate gesture.

In the related art, gesture recognition is usually performed based on a complete gesture video, and the recognition model needs to analyze and process the whole gesture video to obtain a relatively accurate gesture recognition result. However, the gesture recognition model needs to have higher processing capacity, high hardware requirement and lower gesture recognition efficiency, so that the gesture recognition model is difficult to bring good gesture recognition experience to users.

Disclosure of Invention

In order to solve the technical problems, the application provides a gesture recognition model training method, which can reduce the data volume according to gesture recognition, improve gesture recognition efficiency and reduce the performance requirement of a gesture recognition model.

The embodiment of the application discloses the following technical scheme:

in a first aspect, an embodiment of the present application discloses a gesture recognition model training method, where the method includes:

acquiring a sample gesture video, wherein the sample gesture video has a corresponding sample gesture type;

collecting a plurality of video frame images corresponding to the sample gesture video based on a preset time interval;

Generating sample data corresponding to the sample gesture video based on the plurality of video frame images, wherein the sample gesture type is a label corresponding to the sample data;

and training the sample data to obtain a gesture recognition model, wherein the gesture recognition model is used for recognizing a gesture type corresponding to the gesture video to be recognized.

In a second aspect, an embodiment of the present application discloses a gesture recognition model training device, where the device includes a first acquisition unit, a generation unit, and a training unit:

the first acquisition unit is used for acquiring a sample gesture video, and the sample gesture video is provided with a corresponding sample gesture type;

the first acquisition unit is used for acquiring a plurality of video frame images corresponding to the sample gesture video based on a preset time interval;

the generating unit is used for generating sample data corresponding to the sample gesture video based on the plurality of video frame images, and the sample gesture type is a label corresponding to the sample data;

the training unit is used for training to obtain a gesture recognition model through the sample data, and the gesture recognition model is used for recognizing gesture types corresponding to gesture videos to be recognized.

In a possible implementation manner, the generating unit is specifically configured to:

taking the plurality of video frame images as sample data corresponding to the sample gesture video;

the plurality of video frame images are respectively used as target video frame images, and the training unit is specifically used for:

determining a first to-be-determined gesture type corresponding to the target video frame image through an initial gesture recognition model;

and adjusting model parameters of the initial gesture recognition model according to the difference between the first pending gesture type and the sample gesture type to obtain the gesture recognition model.

In a possible implementation manner, the apparatus further includes a second acquisition unit, a first determination unit, and a second determination unit:

the second acquisition unit is used for acquiring the gesture video to be identified;

the second acquisition unit is used for acquiring a plurality of video frame images to be recognized corresponding to the gesture video to be recognized based on the preset time interval;

the first determining unit is used for determining gesture types respectively corresponding to the plurality of video frame images to be recognized through the gesture recognition model;

the second determining unit is configured to determine, in response to a ratio of a first gesture type among gesture types respectively corresponding to the plurality of video frame images to be recognized being greater than a first preset threshold, that the gesture type corresponding to the gesture video to be recognized is the first gesture type.

determining a plurality of region images corresponding to the video frame images respectively based on a preset image size;

determining the plurality of area images as sample data corresponding to the sample gesture video;

the training unit is specifically configured to use the plurality of area images as target area images respectively:

determining a second undetermined gesture type corresponding to the target area image through an initial gesture recognition model;

and adjusting model parameters of the initial gesture recognition model according to the difference between the second undetermined gesture type and the sample gesture type to obtain the gesture recognition model.

In a possible implementation manner, the apparatus further includes a third acquisition unit, a third determination unit, a fourth determination unit, a first response unit, and a second response unit:

the third obtaining unit is used for obtaining the gesture video to be recognized;

the third acquisition unit is used for acquiring a plurality of video frame images to be recognized corresponding to the gesture video to be recognized based on the preset time interval;

the third determining unit is configured to determine a plurality of area images corresponding to the plurality of video frame images to be identified respectively based on the preset image size;

The fourth determining unit is configured to determine gesture types corresponding to a plurality of area images corresponding to the target to-be-identified image through the gesture identification model, with the plurality of to-be-identified frame images being respectively used as target to-be-identified frame images;

the first response unit is configured to determine that a gesture type corresponding to the frame image to be recognized is a second gesture type, in response to a ratio of a second gesture type in gesture types corresponding to the multiple region images corresponding to the image to be recognized being greater than a second preset threshold;

the second response unit is configured to determine that the gesture type corresponding to the gesture video to be recognized is the second gesture type, in response to the ratio of the second gesture type in the gesture types respectively corresponding to the plurality of video frame images to be recognized being greater than a third preset threshold.

In a possible implementation manner, the apparatus further includes a fourth acquisition unit, a fifth determination unit, and a third response unit:

the fourth acquisition unit is configured to acquire a plurality of initial video frame images corresponding to the gesture video to be identified based on an initial time interval, where the initial time interval is greater than the preset time interval;

The fifth determining unit is configured to determine, according to the gesture recognition model and the plurality of initial video frame images, an initial gesture type corresponding to the gesture video to be recognized;

and the third response unit is configured to execute the step of acquiring a plurality of video frame images to be recognized corresponding to the gesture video to be recognized based on the preset time interval in response to the initial gesture type not being the first gesture type.

In one possible implementation, the gesture recognition model is composed of two convolution layers, two pooling layers, and one full connection layer.

In a possible implementation manner, the apparatus further includes a sixth determining unit and a seventh determining unit:

the sixth determining unit is configured to determine duty ratio parameters of hand images corresponding to the plurality of area images respectively;

the seventh determining unit is configured to determine a plurality of effective area images with a duty ratio parameter greater than a fourth preset threshold value in the plurality of area images;

the step of using the plurality of region images as target region images respectively includes:

and respectively taking the plurality of effective area images as target area images.

In a possible implementation manner, the fourth determining unit is specifically configured to:

Determining a plurality of target effective area images with the duty ratio parameters of the hand images larger than a fourth preset threshold value in a plurality of area images corresponding to the target to-be-identified image;

determining gesture types corresponding to the plurality of target effective area images respectively through the gesture recognition model;

the first response unit is specifically configured to:

and determining that the gesture type corresponding to the frame image to be recognized by the target is the second gesture type according to the fact that the duty ratio of the second gesture type in the gesture types respectively corresponding to the plurality of target effective area images is larger than a second preset threshold.

In a third aspect, embodiments of the present application disclose a computer device comprising a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the gesture recognition model training method of any of the first aspects according to instructions in the program code.

In a fourth aspect, embodiments of the present application disclose a computer readable storage medium for storing program code for performing the gesture recognition model training method of any of the first aspects.

In a fifth aspect, embodiments of the present application disclose a computer program product comprising a computer program, wherein the computer program when executed by a processor implements the gesture recognition model training method of any of the first aspects

According to the technical scheme, when model training is performed, a sample gesture video can be acquired first, and the sample gesture video has a corresponding sample gesture type. In order to reduce the data amount required in the gesture recognition process, a plurality of video frame images corresponding to the sample gesture video can be acquired based on a preset time interval, then sample data corresponding to the sample gesture video is generated based on the plurality of video frame images, and the sample gesture type is used as a label corresponding to the sample data. The gesture recognition model can be trained through the sample data, and the gesture recognition model is used for recognizing gesture types corresponding to gesture videos to be recognized. Therefore, the gesture recognition model has the capability of recognizing the gesture type corresponding to the gesture video through the partial video frames obtained by frame extraction in the gesture video, and reduces the data volume required by the model in the recognition process while ensuring the gesture recognition accuracy, so that the requirement on the processing capacity of the model is reduced, the gesture recognition model can also effectively run on equipment with lower processing capacity, and meanwhile, the gesture recognition efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a gesture recognition model training method in an actual application scenario provided in an embodiment of the present application;

FIG. 2 is a flowchart of a gesture recognition model training method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a gesture recognition model training method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a model provided in an embodiment of the present application;

fig. 5 is a schematic diagram of a gesture recognition model training method in an actual application scenario provided in the embodiment of the present application;

fig. 6 is a schematic diagram of a gesture recognition model training method in an actual application scenario provided in the embodiment of the present application;

FIG. 7 is a block diagram of a gesture recognition model training apparatus according to an embodiment of the present application;

Fig. 8 is a block diagram of a terminal according to an embodiment of the present application;

fig. 9 is a block diagram of a server according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

Gesture recognition refers to that after a user makes a corresponding gesture, the gesture video is transmitted to an algorithm to be recognized, and the gesture type of the gesture type corresponding to the gesture video is determined. In the related art, a model for gesture recognition is to analyze the whole gesture video and directly obtain the gesture type corresponding to the gesture video. Because the gesture video generally comprises a large number of video frames, the gesture video has a large data volume, and a gesture recognition model in the related art brings large processing pressure to gesture recognition equipment, so that the gesture recognition efficiency is also poor.

In order to solve the technical problems, the embodiment of the application provides a gesture recognition model training method, and the gesture recognition model obtained through training by the training method can be used for carrying out gesture recognition based on partial video frame images extracted from a gesture video, so that the accuracy of gesture recognition is ensured, the required data amount of the model in the recognition process is reduced, and the gesture recognition efficiency is improved.

It will be appreciated that the method may be applied to a processing device that is capable of model training, for example, a terminal device or a server having model training functionality. The method can be independently executed by the terminal equipment or the server, can also be applied to a network scene of communication between the terminal equipment and the server, and is executed by the cooperation of the terminal equipment and the server. The terminal equipment can be a computer, a mobile phone and other equipment. The server can be understood as an application server or a Web server, and can be an independent server or a cluster server in actual deployment.

In order to facilitate understanding of the technical solution provided by the embodiments of the present application, a method for training a gesture recognition model provided by the embodiments of the present application will be described next with reference to an actual application scenario.

Referring to fig. 1, fig. 1 is a schematic diagram of a gesture recognition model training method in an actual application scenario provided in the embodiment of the present application, where a processing device is a server 101.

The server 101 may first acquire a sample gesture video, where the sample gesture video has N video frame images, that is, video frame image 1 to video frame image N, and the sample gesture video corresponds to a sample gesture type, that is, the sample gesture type is a gesture type recorded by the sample gesture video.

The server 101 collects a plurality of video frame images corresponding to the sample gesture video from the sample gesture video based on a preset interval, and then generates sample data corresponding to the sample gesture video based on the plurality of video frame images, wherein the sample gesture type is a label corresponding to the sample data. That is, during model training with the sample data, the server 101 may learn how to recognize the sample gesture type based on a plurality of video frame images in the sample data by the gesture recognition model. Thus, the gesture recognition model may have the ability to determine the corresponding gesture type from partial video frame images in the full gesture video. The server 101 may acquire a gesture video to be recognized, and determine, according to the gesture recognition model, a gesture type corresponding to the gesture video to be recognized. Because the gesture recognition model does not need to carry out gesture recognition based on complete gesture video, the data volume required in the gesture recognition process is reduced to a certain extent, the gesture recognition efficiency is improved, and the processing pressure of the model is reduced.

Next, a method for training a gesture recognition model provided in the embodiments of the present application will be described with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 is a flowchart of a gesture recognition model training method according to an embodiment of the present application, where the method includes:

s201: and acquiring a sample gesture video.

The sample gesture video may be any gesture video with a known corresponding gesture type, and the sample gesture video has a corresponding sample gesture type, which is a gesture type recorded in the sample gesture video, as shown in fig. 3, and fig. 3 illustrates 8 possible gesture types, i.e., G1 to G8.

S202: and acquiring a plurality of video frame images corresponding to the sample gesture video based on a preset time interval.

In order to reduce the data amount required in the gesture type recognition process, the processing device may set a preset time interval, for example, may set to 5ms, and collect a plurality of video frame images from the sample gesture video based on the preset time interval.

S203: sample data corresponding to the sample gesture video is generated based on the plurality of video frame images.

The processing device may generate sample data corresponding to the sample gesture video based on the plurality of video frame images, and use the sample gesture type as a tag corresponding to the sample data. That is, the sample gesture type is the gesture type that the model should recognize when the model is valid during model training using the sample data.

S204: and training through the sample data to obtain a gesture recognition model.

The above description indicates that the sample gesture type corresponding to the sample gesture video can be more accurately represented through the plurality of video frame images, so that model training can be more effectively performed based on the sample data to obtain a more accurate gesture recognition model, the gesture recognition model is used for recognizing the gesture type corresponding to the gesture video to be recognized, and the gesture video to be recognized can be any video needing gesture recognition. For the gesture recognition model, analysis of gesture types can be realized only by a plurality of video frame objects, and compared with a gesture recognition model for directly analyzing complete gesture videos in the related art, the gesture recognition model has the advantages that the required data amount in the recognition process is small, so that the data processing pressure of the model is small, and the gesture type recognition speed is high.

Specifically, when sample data is generated, the processing device may directly use a plurality of video frame images as sample data corresponding to the sample gesture video, then use the plurality of video frame images as target video frame images respectively, and when model training is performed through the sample data, the processing device may determine a first to-be-determined gesture type corresponding to the target video frame images through an initial gesture recognition model, where the first to-be-determined gesture type is a gesture type recognized by the initial gesture recognition model based on the target video frame images.

Because the sample gesture type is the gesture type actually corresponding to the target video frame image, the error of the initial gesture recognition model when determining the gesture type based on the video frame image can be reflected through the difference between the sample gesture type and the first predetermined gesture type. Therefore, the processing equipment can adjust the model parameters of the initial gesture recognition model according to the difference between the first gesture type to be determined and the sample gesture type, so that the initial gesture recognition model learns how to recognize a more accurate gesture recognition model based on the video frame image, and the gesture recognition model is further obtained.

As can be seen from the above training process, the gesture recognition model obtained by training has the capability of recognizing gesture types based on the image of the single Zhang Shipin frame, so in order to improve the accuracy of the finally recognized gesture types, in one possible implementation, the gesture recognition model may use a voting mechanism to recognize the gesture types.

Specifically, the processing device may acquire the gesture video to be recognized first, and then acquire a plurality of video frame images to be recognized corresponding to the gesture video to be recognized based on a preset time interval. The processing device can determine gesture types corresponding to the plurality of video frame images to be recognized through the gesture recognition model, and preset a first preset threshold value, wherein the first preset threshold value is used for judging gesture types corresponding to the whole plurality of video frame images to be recognized. And responding to the fact that the duty ratio of a first gesture type in gesture types respectively corresponding to the plurality of video frame images to be recognized is larger than a first preset threshold value, and describing that the gesture types recognized by the gesture recognition model are all the first gesture type for most of the video frame images to be recognized, so that the processing equipment can determine that the gesture types corresponding to the gesture videos to be recognized are the first gesture type. By the mode of applying the model, even if the gesture recognition model is different in the recognition process aiming at individual video frame images to be recognized, the gesture recognition accuracy can be improved by correcting other recognition results. Meanwhile, the model only needs to have a function of gesture recognition for a single video frame image, so that the model processing pressure is further reduced, and the performance requirement on equipment running the gesture recognition model is also reduced.

In one possible implementation, to further reduce the model processing pressure, the processing device may segment the video frame image into smaller data units for model training.

Specifically, when model training is performed, the processing device may preset an image size, then divide the plurality of video frame images based on the preset image size, and determine a plurality of area images corresponding to the plurality of video frame images respectively, where each video frame image may be composed of a plurality of area images corresponding to the video frame image.

The processing device may determine the plurality of area images as sample data corresponding to the sample gesture video, and when performing model training, the processing device may determine, by using the initial gesture recognition model, a second pending gesture type corresponding to the target area image, where the second pending gesture type is a gesture type obtained by the initial gesture recognition model by recognizing the target area image, by using the plurality of area images as the target area image respectively.

Because the sample gesture type is the gesture type actually corresponding to the target area image, the error of the initial gesture recognition model when determining the gesture type based on the area image can be reflected through the difference between the sample gesture type and the second undetermined gesture type. Based on the above, the processing device may adjust the model parameters of the initial gesture recognition model according to the difference between the second pending gesture type and the sample gesture type, so that the initial gesture recognition model learns how to recognize a more accurate gesture type based on the region image, and further obtains the gesture recognition model.

Similar to the voting mechanism for performing gesture recognition based on multiple video frame images, when performing gesture recognition, the processing device may acquire a gesture video to be recognized, collect multiple video frame images to be recognized corresponding to the gesture video to be recognized based on a preset time interval, and then determine multiple area images corresponding to the multiple video frame images to be recognized respectively based on a preset image size.

The processing device may determine, by using the plurality of frame images to be identified as target frame images to be identified, gesture types corresponding to the plurality of region images corresponding to the target frame images to be identified respectively through the gesture identification model, and preset a second preset threshold, where the first preset threshold is used to determine a gesture type corresponding to the whole video frame image to be identified. And responding to the fact that the duty ratio of the second gesture type in the gesture types corresponding to the plurality of region images corresponding to the target image to be recognized is larger than a second preset threshold value, and indicating that the gesture types recognized by the gesture recognition model are all the second gesture type for most of the region images in the plurality of region images, so that the processing equipment can determine that the gesture type corresponding to the target frame image to be recognized is the second gesture type, which is a first voting mechanism in the gesture recognition process.

And responding to the fact that the duty ratio of the second gesture type in the gesture types respectively corresponding to the plurality of video frame images to be recognized is larger than a third preset threshold value, and indicating that the gesture types recognized by the gesture recognition model are all the second gesture type for most of the video frame images to be recognized, so that the processing equipment can determine that the gesture type corresponding to the gesture video to be recognized is the second gesture type. By the mode of applying the model, even if the gesture recognition model is different in the recognition process aiming at an individual area image or a video frame image to be recognized, the gesture recognition accuracy can be improved by correcting other recognition results. In addition, in the recognition mode, the gesture recognition model only needs to have the capability of recognizing gesture types aiming at a single area image, and the data volume required to be processed in a single recognition process is extremely low, so that the processing capability of equipment required by gesture recognition is further reduced.

It can be understood that the hand image of the user generally does not occupy all parts of the gesture video image, that is, in the multiple region images segmented based on the same video frame image, the hand image of the partial region image occupies a relatively small area, the hand image of the partial region image occupies a relatively large area, and for gesture recognition, the hand image is mainly implemented through analysis of the hand image, so in one possible implementation manner, the processing device may determine the duty ratio parameters of the hand images corresponding to the multiple region images respectively, where the duty ratio parameters are used to embody the hand image distribution situation in each region image, and the duty ratio parameters may be manually labeled, or may be training the model through manually labeled parameters, so that the model can be self-analyzed and judged.

The processing device may preset a fourth preset threshold, where the fourth preset threshold is used to determine whether a hand image duty ratio in the area image meets a requirement for accurately analyzing the gesture type. The processing equipment can determine a plurality of effective area images with the duty ratio parameter larger than a fourth preset threshold value in the plurality of area images, and when model training is carried out, the processing equipment can respectively train the plurality of effective area images as target area images, so that the situation that fitting is carried out on non-hand images when training is carried out on the basis of the area images with lower hand image occupation is avoided, and the model training accuracy is lower.

Similarly, in the application process, in one possible implementation manner, after determining a plurality of area images corresponding to a plurality of video frame images to be identified respectively based on a preset image size, the processing device may determine a plurality of target effective area images with a duty ratio parameter of a hand image greater than a fourth preset threshold value, where the target effective area images have enough hand images, so that gesture type identification can be performed more accurately. The processing device may determine gesture types corresponding to the plurality of target effective area images respectively through the gesture recognition model, and in response to a ratio of a second gesture type in the gesture types corresponding to the plurality of target effective area images respectively being greater than a second preset threshold, the processing device may determine that the gesture type corresponding to the frame image to be recognized by the target is the second gesture type.

In one possible implementation, to further improve gesture recognition accuracy while ensuring gesture recognition efficiency, the processing device may sample the gesture video to be recognized at different sampling intervals.

For example, in the scenario of the embodiment of the present application, the gesture recognition model is used to recognize whether the gesture made by the user is of the first gesture type, for example, the scenario may be a gesture verification scenario for payment, unlocking, or the like. The processing device may first collect a plurality of initial video frame images corresponding to the gesture video to be recognized based on an initial time interval, where the initial time interval is greater than the preset time interval. The processing device may determine, through the gesture recognition model, an initial gesture type corresponding to the gesture video to be recognized according to the plurality of initial video frame images, for example, the processing device may determine, through analyzing a duty ratio condition in gesture types respectively corresponding to the plurality of video frame images, the initial gesture type corresponding to the gesture video to be recognized.

In response to the initial gesture type being the first gesture type, the processing device may directly treat the first gesture type as a recognition result of the gesture type to be recognized. In response to the initial gesture type not being the first gesture type, to avoid misrecognition due to low sampling accuracy, the processing device may perform a step of acquiring a plurality of video frame images to be recognized corresponding to the gesture video to be recognized based on the preset time interval. Therefore, for the gesture image to be recognized which is easy to recognize, the processing equipment can reduce the data processing amount in the model recognition process through a larger time interval, so that the gesture recognition efficiency is further improved; for gesture images to be identified with high identification difficulty, the processing equipment can further improve sampling precision, and gesture type identification is performed through more video frame images, so that identification precision is guaranteed.

As can be seen from the above model training and application modes, the gesture recognition model in the application has lower processing capacity, so that a model with a simpler model structure can be adopted as the gesture recognition model in the application. For example, in one possible implementation, the gesture recognition model may be a lightweight convolution application network model, which may be composed of two convolution layers (con-figuration), two pooling layers (Max-Pool), and one full connection layer, as shown in fig. 4. The gesture recognition model may be a two-class model, that is, only a specific gesture type is recognized, if the output result is 1, the corresponding gesture type is the specific gesture type; if the output result is 0, it represents that the corresponding gesture type is not the specific gesture type. The model-dependent code may be as follows:

layers.Conv1D(32,kernel_size＝64),

layers.ReLU(),

layers.Dropout(.5),

layers.MaxPooling1D(),

layers.Conv1D(16,kernel_size＝64),

layers.ReLU(),

layers.Dropout(.5),

layers.MaxPooling1D(),

layers.Flatten(),

layers.Dense(8),

layers.ReLU(),

layers.Dropout(.5),

in order to facilitate understanding of the technical solution provided by the embodiments of the present application, a method for training a gesture recognition model provided by the embodiments of the present application will be described next in conjunction with an actual application scenario.

Referring to fig. 5, fig. 5 is a schematic diagram of a gesture recognition model training method in an actual application scenario provided in the embodiment of the present application, where the method mainly includes five parts, namely gesture definition, data collection, feature engineering, model training and model deployment.

1. Gesture definition

The processing device may first obtain human gesture data from the device side, for example, may collect the human gesture data by using a millimeter wave sensor, where the collected gesture mainly includes 8 gesture types shown in fig. 3.

2. Data collection

(1) The processing equipment collects a certain number of gesture videos aiming at each gesture type,

(2) Determining negative sample gesture videos, wherein the negative sample gesture videos do not correspond to gesture videos of any gesture type, and the processing device collects as many negative sample gesture videos as in step (1).

(3) And (3) cleaning the data collected in the steps (1) and (2) by using technical methods of searching for repeated values, searching for missing values, searching for abnormal values and the like, wherein invalid data is obtained through cleaning.

3. Feature engineering

The processing device can acquire, process and extract meaningful features and attributes from the data collected by the data collection step by using a signal data processing technology, and is convenient to send to a model for training, and the main flow is as follows:

(1) And carrying out median filtering cleaning on gesture data collected by the data, wherein the method is as follows.

(2) The file in the step (1) is segmented by adopting an attribute mechanism and a certain time length (adjustable, 5 ms), the original file is divided into a plurality of blocks, the gesture type corresponding to the gesture video is determined as a corresponding label, and the segmentation mechanism is trained by adopting a network model.

The patent adopts an attetin mechanism, after training by adopting a network model, the input image is blocked with weight, the network model adopts a lightweight convolution network, the network model inputs gray data values of regional images obtained by blocking as characteristics, and then the gesture labels are finally output after one layer of attetin.

(3) The area image divided into a plurality of blocks in the step (2) is processed as follows until all files are processed:

extracting characteristic energy distribution characteristics by taking a window with a duration of 5ms, and performing standardization;

taking a window with a duration of 10ms, extracting characteristic energy distribution characteristics, and performing standardization

Extracting characteristic energy distribution characteristics by taking a window with a duration of 15ms, and performing standardization;

if the data is the target gesture, marking 1, otherwise marking 0.

Through the labeling mode, more accurate training data can be obtained when a plurality of gesture types are included in one gesture video.

4. Model training

The embodiment of the application can adopt a classification method to classify, and the designed network model has 3 layers in total in consideration of the computing power of the mobile terminal:

the first layer is a 1D CNN convolutional neural network, which has 32 convolutional kernels with the size of 64;

The 1D CNN is followed by a global average pooling layer;

the pooling layer is connected with a Relu activation function;

the second layer is a 1D CNN convolutional neural network, which has 16 convolutional kernels with the size of 64;

the 1D CNN is followed by a global average pooling layer;

the pooling layer is connected with a Relu activation function;

the third layer is a full link layer, and Dense is 8.

After training the model with the above data, the processing device may perform model evaluation by:

(1) And denoising the data to be predicted.

(2) And partitioning the file to determine the area image.

(3) And sending the extracted area images to a network model for prediction.

(4) And counting the prediction result, namely counting by adopting a voting method, if one block is predicted to be the gesture type of the list, casting a vote, otherwise, casting a negative sample.

5. Model deployment

The framework for gesture recognition model deployment is shown in fig. 6.

Based on the gesture recognition model training method provided in the foregoing embodiments, the present application further provides a gesture recognition model training device, referring to fig. 7, fig. 7 is a block diagram of a gesture recognition model training device 700 provided in the embodiment of the present application, where the device includes a first acquisition unit 701, a first acquisition unit 702, a generation unit 703, and a training unit 704:

The first obtaining unit 701 is configured to obtain a sample gesture video, where the sample gesture video has a corresponding sample gesture type;

the first acquisition unit 702 is configured to acquire a plurality of video frame images corresponding to the sample gesture video based on a preset time interval;

the generating unit 703 is configured to generate sample data corresponding to the sample gesture video based on the multiple video frame images, where the sample gesture type is a label corresponding to the sample data;

the training unit 704 is configured to train to obtain a gesture recognition model according to the sample data, where the gesture recognition model is used to recognize a gesture type corresponding to the gesture video to be recognized.

In one possible implementation manner, the generating unit 703 is specifically configured to:

the training unit 704 is specifically configured to take the plurality of video frame images as target video frame images respectively:

the training unit 704 is specifically configured to:

the first response unit is specifically configured to:

Embodiments of the present application further provide a computer device, which is described below with reference to the accompanying drawings. Referring to fig. 8, an embodiment of the present application provides a device, which may also be a terminal device, where the terminal device may be any intelligent terminal including a mobile phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA for short), a Point of Sales (POS for short), a vehicle-mounted computer, and the like, and the terminal device is taken as an example of the mobile phone:

fig. 8 is a block diagram showing a part of the structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 8, the mobile phone includes: radio Frequency (RF) circuitry 710, memory 720, input unit 730, display unit 740, sensor 750, audio circuitry 760, wireless fidelity (Wireless Fidelity, wiFi) module 770, processor 780, and power supply 790. Those skilled in the art will appreciate that the handset configuration shown in fig. 8 is not limiting of the handset and may include more or fewer components than shown, or may combine certain components, or may be arranged in a different arrangement of components.

The following describes the components of the mobile phone in detail with reference to fig. 8:

The RF circuit 710 may be configured to receive and transmit signals during a message or a call, and specifically, receive downlink information of a base station and process the downlink information with the processor 780; in addition, the data of the design uplink is sent to the base station. Generally, RF circuitry 710 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (Low Noise Amplifier, LNA for short), a duplexer, and the like. In addition, the RF circuitry 710 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (Global System of Mobile communication, GSM for short), general packet radio service (General Packet Radio Service, GPRS for short), code division multiple access (Code Division Multiple Access, CDMA for short), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA for short), long term evolution (Long Term Evolution, LTE for short), email, short message service (Short Messaging Service, SMS for short), and the like.

The memory 720 may be used to store software programs and modules, and the processor 780 performs various functional applications and data processing of the handset by running the software programs and modules stored in the memory 720. The memory 720 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 720 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 730 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the handset. In particular, the input unit 730 may include a touch panel 731 and other input devices 732. The touch panel 731, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on or thereabout the touch panel 731 using any suitable object or accessory such as a finger, a stylus, etc.), and drive the corresponding connection device according to a predetermined program. Alternatively, the touch panel 731 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 780, and can receive commands from the processor 780 and execute them. In addition, the touch panel 731 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 730 may include other input devices 732 in addition to the touch panel 731. In particular, the other input devices 732 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

The display unit 740 may be used to display information input by a user or information provided to the user and various menus of the mobile phone. The display unit 740 may include a display panel 741, and optionally, the display panel 741 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD) or an Organic Light-Emitting Diode (OLED) or the like. Further, the touch panel 731 may cover the display panel 741, and when the touch panel 731 detects a touch operation thereon or thereabout, the touch operation is transferred to the processor 780 to determine the type of touch event, and then the processor 780 provides a corresponding visual output on the display panel 741 according to the type of touch event. Although in fig. 8, the touch panel 731 and the display panel 741 are two separate components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 731 and the display panel 741 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 750, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 741 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 741 and/or the backlight when the mobile phone moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for applications of recognizing the gesture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the handset are not described in detail herein.

Audio circuitry 760, speaker 761, and microphone 762 may provide an audio interface between a user and a cell phone. The audio circuit 760 may transmit the received electrical signal converted from audio data to the speaker 761, and the electrical signal is converted into a sound signal by the speaker 761 to be output; on the other hand, microphone 762 converts the collected sound signals into electrical signals, which are received by audio circuit 760 and converted into audio data, which are processed by audio data output processor 780 for transmission to, for example, another cell phone via RF circuit 710 or for output to memory 720 for further processing.

WiFi belongs to a short-distance wireless transmission technology, and a mobile phone can help a user to send and receive emails, browse webpages, access streaming media and the like through a WiFi module 770, so that wireless broadband Internet access is provided for the user. Although fig. 8 shows a WiFi module 770, it is understood that it does not belong to the necessary constitution of the mobile phone, and can be omitted entirely as required within the scope of not changing the essence of the invention.

The processor 780 is a control center of the mobile phone, connects various parts of the entire mobile phone using various interfaces and lines, and performs various functions of the mobile phone and processes data by running or executing software programs and/or modules stored in the memory 720 and calling data stored in the memory 720, thereby performing overall detection of the mobile phone. Optionally, the processor 780 may include one or more processing units; preferably, the processor 780 may integrate an application processor that primarily processes operating systems, user interfaces, applications, etc., with a modem processor that primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 780.

The handset further includes a power supply 790 (e.g., a battery) for powering the various components, which may preferably be logically connected to the processor 780 through a power management system, such as to provide for managing charging, discharging, and power consumption by the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which will not be described herein.

In this embodiment, the processor 780 included in the terminal device further has the following functions:

The embodiment of the present application further provides a server, as shown in fig. 9, fig. 9 is a block diagram of the server 800 provided in the embodiment of the present application, where the server 800 may have a relatively large difference due to different configurations or performances, and may include one or more central processing units (Central Processing Units, abbreviated as CPUs) 822 (e.g. one or more processors) and a memory 832, and one or more storage media 830 (e.g. one or more mass storage devices) storing application programs 842 or data 844. Wherein the memory 832 and the storage medium 830 may be transitory or persistent. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 822 may be configured to communicate with the storage medium 830 to execute a series of instruction operations in the storage medium 830 on the server 800.

The Server 800 may also include one or more power supplies 826, one or more wired or wireless network interfaces 850, one or more input/output interfaces 858, and/or one or more operating systems 841, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Etc.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 9.

The embodiments of the present application further provide a computer readable storage medium storing a computer program for executing any one of the gesture recognition model training methods described in the foregoing embodiments.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, where the above program may be stored in a computer readable storage medium, and when the program is executed, the program performs steps including the above method embodiments; and the aforementioned storage medium may be at least one of the following media: read-only memory (ROM), RAM, magnetic disk or optical disk, etc., which can store program codes.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment is mainly described in a different point from other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, with reference to the description of the method embodiments in part. The apparatus and system embodiments described above are merely illustrative, in which elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing is merely one specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for training a gesture recognition model, the method comprising:

2. The method of claim 1, wherein the generating sample data corresponding to the sample gesture video based on the video frame image comprises:

respectively taking the plurality of video frame images as target video frame images, training the sample data to obtain a gesture recognition model, wherein the gesture recognition model comprises the following steps:

3. The method according to claim 2, wherein the method further comprises:

acquiring the gesture video to be identified;

collecting a plurality of video frame images to be recognized corresponding to the gesture video to be recognized based on the preset time interval;

determining gesture types corresponding to the plurality of video frame images to be recognized respectively through the gesture recognition model;

and determining that the gesture type corresponding to the gesture video to be recognized is the first gesture type in response to the fact that the duty ratio of the first gesture type in the gesture types respectively corresponding to the plurality of video frame images to be recognized is larger than a first preset threshold.

4. The method of claim 1, wherein the generating sample data corresponding to the sample gesture video based on the plurality of video frame images comprises:

the plurality of area images are respectively used as target area images, the gesture recognition model is obtained through training of the sample data, and the method comprises the following steps:

5. The method according to claim 4, wherein the method further comprises:

acquiring the gesture video to be identified;

determining a plurality of area images corresponding to the plurality of video frame images to be identified respectively based on the preset image size;

Respectively taking the plurality of frame images to be recognized as target frame images to be recognized, and determining gesture types respectively corresponding to a plurality of region images corresponding to the target frame images to be recognized through the gesture recognition model;

determining that the gesture type corresponding to the frame image to be recognized is the second gesture type in response to the fact that the duty ratio of the second gesture type in the gesture types corresponding to the region images corresponding to the image to be recognized is larger than a second preset threshold;

and determining that the gesture type corresponding to the gesture video to be recognized is the second gesture type in response to the fact that the duty ratio of the second gesture type in the gesture types respectively corresponding to the plurality of video frame images to be recognized is larger than a third preset threshold.

6. A method according to claim 3, characterized in that the method further comprises:

acquiring a plurality of initial video frame images corresponding to the gesture video to be recognized based on an initial time interval, wherein the initial time interval is larger than the preset time interval;

determining an initial gesture type corresponding to the gesture video to be recognized according to the plurality of initial video frame images through the gesture recognition model;

and responding to the initial gesture type not being the first gesture type, and executing the step of acquiring a plurality of video frame images to be recognized corresponding to the gesture video to be recognized based on the preset time interval.

7. The method of claim 1, wherein the gesture recognition model is comprised of two convolution layers, two pooling layers, and one fully connected layer.

8. The method of claim 4, wherein prior to said respectively taking the plurality of region images as target region images, the method further comprises:

determining the duty ratio parameters of the hand images corresponding to the region images respectively;

determining a plurality of effective area images with the duty ratio parameters larger than a fourth preset threshold value in the plurality of area images;

9. The method according to claim 5, wherein determining, by the gesture recognition model, gesture types respectively corresponding to a plurality of region images corresponding to the target image to be recognized, includes:

The determining that the gesture type corresponding to the frame image to be recognized is the second gesture type in response to the ratio of the second gesture type in the gesture types corresponding to the region images corresponding to the image to be recognized being greater than a second preset threshold value, includes:

10. The gesture recognition model training device is characterized by comprising a first acquisition unit, a generation unit and a training unit:

11. A computer device, the computer device comprising a processor and a memory:

the processor is configured to perform the gesture recognition model training method of any of claims 1-9 according to instructions in the program code.

12. A computer readable storage medium, characterized in that the computer readable storage medium is for storing a program code for performing the gesture recognition model training method of any of claims 1-9.

13. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the gesture recognition model training method of any of claims 1-9.