CN115731604A

CN115731604A - Model training method, gesture recognition method, device, equipment and storage medium

Info

Publication number: CN115731604A
Application number: CN202110995209.6A
Authority: CN
Inventors: 皇甫统帅; 程宝平; 谢小燕
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2023-03-03

Abstract

The embodiment of the application discloses a model training method, a gesture recognition method, a device, equipment and a storage medium, wherein the method comprises the following steps: obtaining at least one gesture picture and at least one non-gesture picture; respectively carrying out picture fusion processing on each gesture picture in the at least one gesture picture and the at least one non-gesture picture to obtain a sample picture set; training a preset network model by using a sample picture set to obtain a gesture recognition model; the gesture recognition model comprises a first sub-model and a second sub-model, the first sub-model is used for determining the gesture type of the picture to be detected, and the second sub-model is used for determining the hand positioning information of the picture to be detected. Therefore, the sample picture not only comprises the gesture, but also is fused with non-gesture content, so that the gesture recognition model obtained through training is more suitable for real scenes, the recognition accuracy and the model robustness of the gesture recognition model are enhanced, and the interference of external factors on the recognition result is reduced.

Description

Model training method, gesture recognition method, device, equipment and storage medium

Technical Field

The present application relates to the field of human-computer interaction, and in particular, to a model training method, a gesture recognition method, an apparatus, a device, and a storage medium.

Background

The gesture recognition technology provides a natural and intuitive communication mode for the conversation between a person and the terminal equipment. However, when gesture recognition is performed at present, in the related art, hand segmentation and positioning based on skin color, manual gesture feature extraction and the like are generally adopted for recognition, and the recognition result is easily influenced by the environment, so that the accuracy of the gesture recognition result is low.

Disclosure of Invention

The application provides a model training method, a gesture recognition device, equipment and a storage medium, which can obtain a gesture recognition model more suitable for a real scene, improve the accuracy of gesture recognition and reduce the interference of external factors such as illumination.

The technical scheme of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a model training method, where the method includes:

acquiring at least one gesture picture and at least one non-gesture picture;

performing picture fusion processing on each gesture picture in the at least one gesture picture and the at least one non-gesture picture respectively to obtain a sample picture set;

training a preset network model by using the sample picture set to obtain a gesture recognition model; the gesture recognition model comprises a first sub-model and a second sub-model, the first sub-model is used for determining the gesture type of the picture to be detected, and the second sub-model is used for determining the hand positioning information of the picture to be detected.

In a second aspect, an embodiment of the present application provides a gesture recognition method, which is applied to the gesture recognition model according to the first aspect, and the method includes:

acquiring a video stream to be detected;

performing gesture detection on each video frame in the video stream to be detected by using the gesture recognition model, and determining gesture types and hand positioning information in each video frame;

performing overlapping degree IOU detection on the hand positioning information in the video stream to be detected, and determining a hand detection result of the video stream to be detected;

and when the hand detection result indicates that the hands in the video stream to be detected are the same hand, determining a gesture type identification result of the same hand in the video stream to be detected according to the gesture type in each video frame.

In a third aspect, embodiments of the present application provide a model training apparatus, which includes a first obtaining unit, a fusing unit, and a training unit, wherein,

the first acquisition unit is configured to acquire at least one gesture picture and at least one non-gesture picture;

the fusion unit is configured to perform picture fusion processing on each gesture picture in the at least one gesture picture and the at least one non-gesture picture respectively to obtain a sample picture set;

the training unit is configured to train a preset network model by using the sample picture set to obtain a gesture recognition model; the gesture recognition model comprises a first sub-model and a second sub-model, the first sub-model is used for determining the gesture type of the picture to be detected, and the second sub-model is used for determining the hand positioning information of the picture to be detected.

In a fourth aspect, embodiments of the present application provide a gesture recognition apparatus, which includes a second obtaining unit, a gesture detection unit, an IOU detection unit, and a determination unit, wherein,

the second obtaining unit is configured to obtain a video stream to be detected;

the gesture detection unit is configured to perform gesture detection on each video frame in the video stream to be detected by using the gesture recognition model, and determine a gesture type and hand positioning information in each video frame;

the IOU detection unit is configured to perform overlapping IOU detection on the hand positioning information in the video stream to be detected and determine a hand detection result of the video stream to be detected;

the determining unit is configured to determine a gesture type recognition result of the same hand in the video stream to be detected according to the gesture type in each video frame when the hand detection result indicates that the hand in the video stream to be detected is the same hand.

In a fifth aspect, embodiments of the present application provide an electronic device, which includes a memory and a processor, wherein,

the memory for storing a computer program operable on the processor;

the processor, when executing the computer program, performing the model training method according to the first aspect; alternatively, the gesture recognition method according to the second aspect is performed.

In a sixth aspect, an embodiment of the present application provides a computer storage medium storing a computer program, where the computer program, when executed by a processor, implements the model training method according to the first aspect; alternatively, a gesture recognition method as described in the second aspect is implemented.

According to the model training method, the gesture recognition device, the equipment and the storage medium, at least one gesture picture and at least one non-gesture picture are obtained during model training; respectively carrying out picture fusion processing on each gesture picture in the at least one gesture picture and the at least one non-gesture picture to obtain a sample picture set; training a preset network model by using a sample picture set to obtain a gesture recognition model; the gesture recognition model comprises a first sub-model and a second sub-model, the first sub-model is used for determining the gesture type of the picture to be detected, and the second sub-model is used for determining the hand positioning information of the picture to be detected. Acquiring a video stream to be detected during gesture recognition; performing gesture detection on each video frame in a video stream to be detected by using a gesture recognition model, and determining gesture types and hand positioning information in each video frame; performing overlapping degree IOU detection on hand positioning information in a video stream to be detected, and determining a hand detection result of the video stream to be detected; and when the hand detection result indicates that the hands in the video stream to be detected are the same hand, determining a gesture type identification result of the same hand in the video stream to be detected according to the gesture type in each video frame. Therefore, the sample picture not only comprises the gesture, but also is fused with non-gesture content, so that the gesture recognition model obtained by training is more suitable for real scenes, the recognition accuracy and the model robustness of the gesture recognition model are enhanced, and the interference of external factors on the recognition result is reduced; in addition, gesture detection is carried out on each video frame in the video stream to be detected, hand tracking is carried out on the video stream to be detected according to the detected hand positioning information, the same hand in the video stream to be detected is determined, and a gesture type recognition result of the same hand in the video stream to be detected is further determined according to the detected gesture type; therefore, the accuracy and precision of gesture recognition are improved.

Drawings

Fig. 1 is a schematic flowchart of a model training method according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a process of fusing pictures according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a gesture recognition method according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a gesture recognition result of a picture according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram illustrating an architecture of a gesture recognition apparatus according to an embodiment of the present disclosure;

fig. 6 is a schematic detailed flowchart of a gesture recognition method according to an embodiment of the present disclosure;

fig. 7 is a schematic diagram of a hierarchical convergence of a network structure according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a default network model according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a classification branch network and a regression branch network according to an embodiment of the present disclosure;

FIG. 10 is a schematic diagram of a structure of a model training apparatus according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of a gesture recognition apparatus according to an embodiment of the present disclosure;

fig. 12 is a schematic diagram of a specific hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant application and are not limiting of the application. It should be noted that, for the convenience of description, only the parts related to the related applications are shown in the drawings.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

It should be noted that the terms "first \ second \ third" referred to in the embodiments of the present application are only used for distinguishing similar objects and do not represent a specific ordering for the objects, and it should be understood that "first \ second \ third" may be interchanged under the permission of a specific order or sequence, so that the embodiments of the present application described herein can be implemented in an order other than that shown or described herein.

It can be understood that the gesture recognition technology provides a natural and intuitive communication mode for a person to have a conversation with the terminal device. Gesture recognition in a real scene mainly relates to a plurality of processes such as hand detection, gesture feature extraction and classification. At present, the hand detection technology mainly comprises a skin color segmentation-based hand detection method and a deep learning hand detection method; the gesture feature extraction and classification mainly comprises a method for extracting and classifying features by traditional manual design and a method based on deep learning.

At present, the related art provides a method for performing gesture recognition based on a YOLO (You Only Look one) network, and the method performs background filtering on a picture to be recognized and then determines a gesture recognition result based on a simple threshold skin color segmentation algorithm of a YCbCr color space. However, the method has no filtering property on body parts such as skin color-like backgrounds, faces and the like, is easily influenced by the surrounding environment, and has poor robustness of a gesture recognition algorithm; in addition, the YOLO model runs slowly, resulting in poor real-time performance of the model; moreover, the method only supports single-hand gesture recognition.

In addition, the related art also provides a gesture recognition method based on computer vision, the method can position the fingertips of the hands, however, the fingertips are often shielded in a real scene, and the application scene and the detection precision of the model are greatly restricted; in addition, two deep learning models (a hand detection model and a gesture recognition recurrent neural network model) need to be loaded during gesture recognition, so that the algorithm processing time is increased, the operation efficiency of the algorithm is seriously influenced, and the calculation resources are wasted. The related art also provides a gesture recognition method based on deep learning, and the gesture recognition task is completed through an improved YOLOv3 model. On one hand, however, the YOLO model runs slower; on the other hand, the model of the scheme is easily influenced by factors such as background and illumination, and the accuracy of the model is poor. The related technology also provides a gesture interaction method, the method comprises the steps of shooting gestures of a user through a camera, carrying out hand detection and gesture recognition on a shot image by utilizing a pre-trained deep neural network to obtain a first gesture recognition result, carrying out gesture analysis and confirmation based on the first gesture recognition result in a multi-frame image to obtain a second gesture recognition result, triggering and executing operations corresponding to the second gesture recognition result, and finishing gesture interaction. However, the method needs to load three algorithm models (a hand detection algorithm model, a tracking algorithm model and a gesture recognition algorithm model) at the same time, the algorithm processing time is long, the operation efficiency is low, and the calculation resources are wasted; in addition, hardware equipment such as a depth image camera and the like is needed for acquiring the depth image information by the method, so that the application scene of the method is greatly restricted.

That is to say, the current gesture recognition method needs to divide and position the hand based on skin color, manually extract gesture features, and the like for recognition, and is easily influenced by the environment, and the hand positioning, the design and the selection of the features greatly influence the gesture recognition result. The rapid development of deep learning and the application thereof in the field of gesture recognition greatly improve the accuracy of gesture recognition, but still have the following problems: (1) Due to the complex deep learning model and other reasons, the real-time performance of gesture recognition is poor; (2) The existing static gesture recognition technology often performs feature extraction and recognition on a specific video frame in an isolated manner, and time sequence information among video frames is lost, so that the gesture recognition accuracy is reduced; (3) In the existing gesture recognition method, in order to perform multi-hand gesture recognition by considering the time sequence characteristics of gestures of front and back frames, a target tracking algorithm is often introduced, which not only increases the spatial redundancy and the time redundancy of the gesture recognition algorithm, but also causes the waste of computing resources; (4) At present, the number of available public gesture data sets is small, and the applied scenes of the public gesture data sets are different, so the gesture recognition algorithm training set is usually derived from a self-made data set. Due to the restriction of conditions such as environment and the like, the self-made data set has the defects of single background, single illumination and the like, so that the robustness of the obtained algorithm model is poor.

Based on this, the embodiment of the present application provides a model training method, and the basic idea of the method is: acquiring at least one gesture picture and at least one non-gesture picture; performing picture fusion processing on each gesture picture in the at least one gesture picture and the at least one non-gesture picture respectively to obtain a sample picture set; training a preset network model by using a sample picture set to obtain a gesture recognition model; the gesture recognition model comprises a first sub-model and a second sub-model, the first sub-model is used for determining the gesture type of the picture to be detected, and the second sub-model is used for determining the hand positioning information of the picture to be detected.

The embodiment of the application further provides a gesture recognition method, and the basic idea of the method is as follows: acquiring a video stream to be detected; performing gesture detection on each video frame in a video stream to be detected by using a gesture recognition model, and determining gesture types and hand positioning information in each video frame; performing overlapping degree IOU detection on hand positioning information in a video stream to be detected, and determining a hand detection result of the video stream to be detected; and when the hand detection result indicates that the hands in the video stream to be detected are the same hand, determining a gesture type identification result of the same hand in the video stream to be detected according to the gesture type in each video frame.

Therefore, the sample picture not only comprises the gesture, but also is fused with non-gesture content, so that the gesture recognition model obtained through training is more suitable for real scenes, the recognition accuracy and the model robustness of the gesture recognition model are enhanced, and the interference of external factors on the recognition result is reduced. In addition, gesture detection is carried out on each video frame in the video stream to be detected, hand tracking is carried out on the video stream to be detected according to the detected hand positioning information, the same hand in the video stream to be detected is determined, and a gesture type recognition result of the same hand in the video stream to be detected is further determined according to the detected gesture type; therefore, the accuracy and precision of gesture recognition are improved.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

In an embodiment of the present application, referring to fig. 1, a flowchart of a model training method provided in an embodiment of the present application is shown. As shown in fig. 1, the method may include:

s101, obtaining at least one gesture picture and at least one non-gesture picture.

S102, performing picture fusion processing on each gesture picture in the at least one gesture picture and the at least one non-gesture picture respectively to obtain a sample picture set.

It should be noted that the model training method provided in the embodiments of the present application may be applied to a model training apparatus or an electronic device integrated with the apparatus. Here, the electronic device may be, for example, a computer, a smart phone, a tablet computer, a notebook computer, a palm computer, a Personal Digital Assistant (PDA), a navigation device, a server, and the like, which are not particularly limited in this embodiment of the present application.

It should be further noted that the gesture picture may include a picture collected in a real scene and including a gesture, may also include a gesture picture obtained from a movie, a television, and the like, and may even include a gesture picture of a cartoon, which is not specifically limited in this embodiment of the present application.

The non-gesture picture may include a scene picture captured actually, a landscape picture, and the like, and may also include various pictures in a non-actual scene, and the like, which is not particularly limited in the embodiment of the present application.

It should be further noted that in the embodiment of the present application, a sample picture set is obtained by performing picture fusion processing on the gesture picture and the non-gesture picture. Here, one sample picture in the sample picture set may be a sample picture obtained by performing picture fusion processing on one gesture picture and one non-gesture picture, or may be a sample picture obtained by performing picture fusion processing on one gesture picture and a plurality of non-gesture pictures. Therefore, for one gesture picture, the gesture picture can be respectively subjected to picture fusion processing with a plurality of non-gesture pictures, so that one gesture picture is expanded into a plurality of sample pictures, and a sample picture set is enriched.

By the mode, on one hand, the sample picture not only contains gesture content but also non-gesture content, and after model training is carried out, the influence of environmental factors on a model recognition result can be reduced; on the other hand, when the gesture picture is not rich enough, the sample picture set can be enriched.

In some embodiments, the picture fusion process may include at least: and performing picture fusion processing and/or picture splicing processing.

It should be noted that when the gesture picture and the non-gesture picture are subjected to picture fusion processing, the gesture picture and the non-gesture picture may be fused and/or spliced. Exemplarily, refer to fig. 2, which shows a schematic diagram of a process of fusing pictures according to an embodiment of the present application. In fig. 2, a landscape picture and a gesture picture may be subjected to image fusion, so as to obtain a fused sample picture.

Further, in some embodiments, the method may further comprise: adding marking information to sample pictures in the sample picture set; the annotation information may include gesture type information in the sample picture and hand positioning information in the sample picture, and the hand positioning information may include center coordinates of the hand positioning frame and width and height of the hand positioning frame.

Note that, in the embodiment of the present application, annotation information may be added to the sample picture, where the annotation information may include gesture type information and hand positioning information of a hand included in the sample picture. The hand positioning information may specifically include a center coordinate of the hand positioning frame and a width and a height of the hand positioning frame.

In addition, according to the embodiment of the application, the labeling information can be added to the gesture picture before the picture fusion is carried out, and then the gesture picture after the labeling information is added and the non-gesture picture are subjected to picture fusion processing, so that the fused sample picture also has the labeling information; the embodiment of the present application is not particularly limited to this.

Note that, when adding annotation information, the annotation can be implemented by an Image annotation tool disclosed in the art, for example, VGG Image annotation (abbreviated as VIA), but the Image annotation tool used in the embodiment of the present application is not particularly limited.

S103, training a preset network model by using the sample picture set to obtain a gesture recognition model.

It should be noted that the gesture recognition model may include a first sub-model and a second sub-model, where the first sub-model may be used to determine a gesture type of the picture to be detected, and the second sub-model may be used to determine hand positioning information of the picture to be detected.

That is to say, the gesture recognition model obtained by training the preset network model by using the sample picture set can determine not only the gesture type of the picture to be detected, but also the hand positioning information of the picture to be detected.

In addition, in order to enable the recognition result of the gesture recognition model to be more accurate, the embodiment of the application can also preprocess the sample picture before the sample picture is used for training the preset network model. Therefore, in some embodiments, before training the preset network model with the sample picture set, the method may further include:

carrying out size adjustment on sample pictures in the sample picture set; and/or the presence of a gas in the gas,

and carrying out color mode conversion on the sample pictures in the sample picture set.

It should be noted that the resizing may include resizing the sample pictures to a predetermined size, for example, the predetermined size is a pixel size: 320 is multiplied by 320; the color mode conversion may include converting each sample picture into a preset color mode, for example, the preset color mode is a red, green and blue color mode (RGB color mode).

When training a preset network model with a sample picture set to obtain a gesture recognition model, in some embodiments, the preset network model may include a feature extraction network, a classification branch network, and a regression branch network; correspondingly, the training of the preset network model by using the sample picture set to obtain the gesture recognition model may include:

inputting sample pictures in the sample picture set into a preset network model;

carrying out feature processing on the sample pictures in the sample picture set through a feature extraction network to obtain at least one target feature picture;

training the classification branch network through at least one target characteristic diagram to obtain a first sub-model; training the regression branch network through at least one target feature map to obtain a second sub-model;

and determining a gesture recognition model according to the first sub-model and the second sub-model.

It should be noted that, when the preset network model is trained, the sample pictures of the sample picture set may be sequentially input or simultaneously input into the preset network model, and at least one target feature map corresponding to the sample pictures is obtained through the feature extraction network of the preset network model; and training a classification branch network and a regression branch network of a preset network model respectively through the at least one target feature map to respectively obtain a first sub-model and a second sub-model, and determining a gesture recognition model according to the first sub-model and the second sub-model.

Further, for a feature extraction network, in some embodiments, the feature extraction network comprises a feature extraction layer, a feature fusion layer, and a feature convolution layer; the performing, by the feature extraction network, feature processing on the sample picture in the sample picture set to obtain at least one target feature map may include:

performing initial feature extraction on the sample picture by using the feature extraction layer, and determining at least one initial feature map corresponding to the sample picture;

performing feature interactive fusion on the at least one initial feature map by using the feature fusion layer to obtain at least one fusion feature map corresponding to the sample picture;

and performing convolution operation on the at least one fused feature map by using the feature convolution layer to obtain at least one target feature map.

It should be noted that, when determining the target feature map, for any sample picture, first, the feature extraction layer of the feature extraction network is used to perform initial feature extraction on the sample picture in at least one feature layer, so as to obtain at least one initial feature map corresponding to the sample picture.

It should be further noted that the preset network model includes an SSD _ mobilenetV3_ I model. In some embodiments, the feature extraction layers may include feature layers Layer14, layer17, layer20, layer23, layer26, and Layer29 in the SSD _ mobilenetV3_ I model.

Illustratively, the feature extraction Layer takes six feature layers including Layer14, layer17, layer20, layer23, layer26 and Layer29 as an example, and after a sample picture is subjected to initial feature extraction, an initial feature map corresponding to the sample picture in the six feature layers can be obtained.

Further, feature cross fusion is performed on the at least one initial feature map by using a feature fusion layer of the feature extraction network, so that at least one fusion feature map corresponding to the sample picture is obtained. Taking the SSD _ mobilenetV3_ I model as an example, six feature fusion maps can be obtained for six feature layers, such as Layer14, layer17, layer20, layer23, layer26, and Layer29. Here, the feature fusion layer may include a cascade (Concatenate) layer, that is, feature interactive fusion is performed on six initial feature maps of the sample picture in a cascade manner, so that six feature fusion maps with rich features may be obtained.

And further, performing convolution operation on the at least one feature fusion graph by using a feature convolution layer of the feature extraction network to finally obtain at least one target feature graph. Still taking the SSD _ mobilenetV3_ I model as an example, the convolution operations are performed on the obtained six feature fusion graphs, so as to obtain six target feature graphs correspondingly. Here, the feature convolution layer may include a convolution layer (denoted by Conv1 × 1) having a convolution kernel size of 1 × 1.

It should be further noted that, for the fused feature map, in some embodiments, the performing feature interactive fusion on at least one initial feature map by using the feature fusion layer to obtain at least one fused feature map corresponding to the sample picture may include:

determining an initial feature map of the sample picture in an ith feature layer and initial feature maps of other feature layers except the ith feature layer;

sampling the initial characteristic graphs of other characteristic layers except the ith characteristic layer to obtain a sampling result;

performing feature fusion on the initial feature map and the sampling result of the ith feature layer by using the feature fusion layer to obtain a fusion feature map of the sample picture on the ith feature layer; wherein i is an integer greater than zero.

When feature fusion layers are used to perform feature interactive fusion on the initial feature maps to obtain a fused feature map, the feature fusion layers are used to fuse the initial feature maps corresponding to each feature layer with the sampling results of sampling the initial feature maps of other feature layers.

That is to say, for the ith feature layer, it is necessary to sample the initial feature maps of other feature layers than the ith feature layer to obtain a sampling result, and the feature fusion layer fuses the sampling result corresponding to the initial feature maps of the other feature layers with the initial feature map of the ith feature layer, so as to obtain a fusion feature map corresponding to the sample picture in the ith feature layer.

Taking the feature Layer14 in the SSD _ mobilenetV3_ I model as an example, sampling the feature layers Layer17, layer20, layer23, layer26, and Layer29 respectively to obtain five sampling results, and then performing feature fusion on the five sampling results and the initial feature map of Layer14 to obtain a fusion feature map corresponding to Layer 14.

In a specific example, the feature layers other than the ith feature layer may include: the first characteristic layer part is positioned in front of the ith characteristic layer, the second characteristic layer part is positioned behind the ith characteristic layer, the level of the first characteristic layer part is higher than that of the ith characteristic layer, and the level of the ith characteristic layer is higher than that of the second characteristic layer part;

correspondingly, the sampling the initial feature maps of the feature layers other than the ith feature layer to obtain a sampling result may include:

and carrying out down-sampling processing on the initial feature map of the first feature layer part and carrying out up-sampling processing on the initial feature map of the second feature layer part to obtain a sampling result.

It should be noted that, for the ith feature layer, the feature layers other than the ith feature layer may include two parts, namely, a first feature layer part located before the ith feature layer, that is, a second feature layer part located after the ith feature layer, that is, a second feature layer part located below the ith feature layer.

In addition, for the sampling process, the embodiment of the present application performs a down-sampling process on the initial feature map of the first feature layer portion, and performs an up-sampling process on the initial feature map of the second feature layer portion.

Illustratively, taking the feature Layer20 in the SSD _ mobileneetv 3_ I model described above as an example, for Layer20, the first feature Layer portion before it includes Layer14 and Layer17, and the second feature Layer portion after it includes Layer23, layer26 and Layer29. Layer14 and Layer17 are down sampled and Layer23, layer26 and Layer29 are up sampled, respectively. Taking the Layer14 in the SSD _ mobilenetV3_ I model as an example, for Layer14, there is no first Layer part before, and the second Layer part after the first Layer part includes Layer17, layer20, layer23, layer26 and Layer29. The Layer17, layer20, layer23, layer26 and Layer29 are upsampled, respectively. Taking the Layer29 in the SSD _ mobilenetV3_ I model as an example, for Layer29, the first Layer part before it includes Layer14, layer17, layer20, layer23 and Layer26, and the second Layer part does not exist after it. The down-sampling process is performed for Layer14, layer17, layer20, layer23 and Layer26, respectively.

In this way, the interactive fusion of the profile detail features of the low-layer feature layer and the middle-layer feature layer and the semantic features of the high-layer feature layer is realized by sampling and then fusing the initial feature maps corresponding to other feature layers except the ith feature layer, and each feature layer simultaneously has the high-layer semantic features and the profile detail features of the low-layer and the middle-layer by adopting a parallel and independent feature fusion mode.

In some embodiments, the downsampling process may employ a maximum pooling function and the upsampling process may employ a deconvolution function.

That is, when the down-sampling process is performed, it can be implemented by a maximum pooling (max _ pooling) function; in the up-sampling process, it can be realized by a Deconvolution (Deconvolution) function.

Further, for the first sub-model, the classification branching network may include an attention mechanism module; the attention mechanism module can comprise a convolution submodule, an activation submodule, a multiplication submodule and an addition submodule, wherein the convolution submodule consists of a convolution layer, a transposition convolution layer and a jump layer. Accordingly, in some embodiments, the training the classification branching network through at least one target feature map to obtain the first sub-model may include:

determining a first model loss function, performing classification training on at least one target feature map by using a classification branch network comprising an attention mechanism module, and determining the trained classification branch network as a first sub-model when the value of the first model loss function reaches a preset convergence value.

It should be noted that, an attention mechanism module is introduced into the classification branch network in the embodiment of the present application, so that the classification branch network can focus more on features of the sample picture related to the gesture type, that is, the classification branch network focuses more on semantic features.

Here, the attention mechanism module may include a convolution sub-module, an activation sub-module, a multiplication sub-module, and an addition sub-module. The convolution submodule is used for further extracting features of a target feature map, and the extracted result can be called as an initial classification feature matrix; and then, carrying out probability estimation on the features in the initial classification feature matrix through activating a sub-module to obtain a classification probability matrix, wherein the probability value of the features with higher relevance to the gesture type is larger. Illustratively, the activation function of the activation submodule can adopt a Sigmoid function, and due to the properties of single increment and single increment of an inverse function, the characteristics can be mapped between [0,1] through the Sigmoid function; the multiplication submodule is used for carrying out Element multiplication processing (Element-wise product) on the classification probability matrix and the target feature map so as to obtain a classification product matrix, and the classification probability matrix represents the probability value of the features, so that the features with higher correlation with the gesture type can be amplified after being subjected to Element multiplication with the target feature map; the addition submodule is used for carrying out Element addition processing (Element-wise sum) on the classification product matrix and the target feature map so as to obtain a classification addition matrix.

Wherein the convolution sub-module may include a convolution layer, a transposed convolution layer, and a jump layer, wherein the convolution layer may include a convolution layer with a convolution kernel size of 1 × 1 (denoted by Conv1 × 1) and a convolution layer with a convolution kernel size of 3 × 3 (denoted by Conv3 × 3); the transposed convolutional layer includes a transposed convolutional layer (denoted by Deconv3 × 3) having a convolutional kernel size of 3 × 3; the skip layer is used to skip connect (skip connection) the convolution layer and the transposed convolution layer.

In the attention mechanism module of the classification branch network, at least one classification summation matrix can be obtained by processing each target characteristic map of the sample picture according to the method. Aiming at least one classification addition matrix, the classification branch network can perform duplication elimination through a non-maximum suppression (NMS) algorithm and predict the gesture type of the corresponding sample picture, the predicted gesture type is compared with the real gesture type, and the value of the first model loss function is determined until the value of the first model loss function reaches a preset convergence value, namely the error value of the model is smaller, or training is performed again, the error value of the model is not smaller, the model is shown to be converged, at this moment, the trained classification branch network is determined as a first sub-model, and a first sub-model is obtained.

It should be further noted that the first model loss function may be a loss function commonly used in the art, such as an absolute value loss function, a 0-1 loss function, a Hinge loss function, a perceptual loss function, or a cross-entropy loss function, and this is not particularly limited in this embodiment.

Further, for the second sub-model, the regression branch network may include an attention mechanism module; the attention mechanism module can comprise a convolution submodule, an activation submodule, a multiplication submodule and an addition submodule, wherein the convolution submodule consists of a convolution layer, a transposition convolution layer and a jump layer. Accordingly, in some embodiments, the training the regression branch network through at least one target feature map to obtain the second submodel may include:

and determining a second model loss function, performing regression training on at least one target characteristic diagram by using a regression branch network comprising an attention mechanism module, and determining the trained regression branch network as a second sub-model when the value of the second model loss function reaches a preset convergence value.

It should be noted that, in the embodiment of the present application, an attention mechanism module is introduced into the regression branch network, so that the regression branch network can focus more on features of the sample picture related to the hand positioning information, that is, the regression branch network focuses more on the profile detail features.

Here, note that the force mechanism module may still include a convolution sub-module, an activation sub-module, a multiplication sub-module, and an addition sub-module. The convolution submodule is used for further extracting features of a target feature map, and the extracted result can be called as an initial regression feature matrix; and then carrying out probability estimation on the features in the initial regression feature matrix through an activation submodule so as to obtain a regression probability matrix, wherein the probability value of the feature with higher relevance to the hand positioning information is higher. Illustratively, the activation function of the activation sub-module may still employ a sigmoid function for mapping features between [0,1 ]; the multiplication submodule is used for carrying out element multiplication processing on the regression probability matrix and the target characteristic diagram so as to obtain a regression product matrix, and the regression probability matrix represents the probability value of the characteristic, so that the characteristic with higher correlation with the hand positioning information is amplified after the regression probability matrix is subjected to element multiplication with the target characteristic diagram; the addition submodule is used for carrying out element addition processing on the regression product matrix and the target characteristic diagram so as to obtain a regression addition matrix.

The convolution submodule can comprise a convolution layer, a transposition convolution layer and a jump layer, wherein the convolution layer comprises a convolution layer with a convolution kernel size of 1 × 1 and a convolution layer with a convolution kernel size of 3 × 3; the transposed convolutional layer includes a transposed convolutional layer having a convolutional kernel size of 3 × 3; the jump layer is used for jumping and connecting the convolution layer and the result of the convolution processing of the transposition convolution layer.

In the attention mechanism module of the regression branch network, each target characteristic graph of the sample picture is processed according to the method, and at least one regression sum matrix can be obtained. For at least one regression addition matrix, the regression branch network can perform duplication elimination through a non-maximum suppression algorithm and predict hand positioning information of a corresponding sample picture, the predicted hand positioning information is compared with the real hand positioning information, the value of a second model loss function is determined until the value of the second model loss function reaches a preset convergence value, namely the error of the model is small, or the model is trained again, the error of the model is not smaller, the model is proved to be converged, and at the moment, the trained regression branch network is determined to be a second sub-model to obtain a second sub-model.

It should be further noted that the second model loss function may also be a loss function commonly used in the art, such as an absolute value loss function, a 0-1 loss function, a Hinge loss function, a perceptual loss function, or a cross-entropy loss function, which is not specifically limited in this embodiment.

In addition, it should be noted that in the foregoing feature extraction network, initial feature maps of sample pictures at different feature layers are fused, so that a target feature map with richer features can be obtained, but feature comparison is redundant; therefore, the attention mechanism modules of the classification branch network and the regression branch network respectively amplify the relevant features and can also play a role in redundancy removal.

The embodiment provides a model training method, which includes the steps of obtaining at least one gesture picture and at least one non-gesture picture; respectively carrying out picture fusion processing on each gesture picture in the at least one gesture picture and the at least one non-gesture picture to obtain a sample picture set; training a preset network model by using a sample picture set to obtain a gesture recognition model; the gesture recognition model comprises a first sub-model and a second sub-model, the first sub-model is used for determining the gesture type of the picture to be detected, and the second sub-model is used for determining the hand positioning information of the picture to be detected. Therefore, the sample pictures in the sample picture set are obtained by carrying out picture fusion processing on the gesture pictures and the non-gesture pictures, so that the sample pictures not only comprise gestures, but also fuse non-gesture contents, and the gesture recognition model obtained by training the sample picture set is more suitable for a real scene, enhances the recognition accuracy and the model robustness of the gesture recognition model, and can reduce the interference of external factors such as illumination and the like; moreover, the image fusion processing of the gesture images can enrich the sample image set to the maximum extent under the condition that some gesture images are limited; in addition, in the process of model training, because the target feature map is fused with multilayer features when the sample picture is subjected to feature extraction, the features in the target feature map are richer, and the target feature map has high-level semantic features and low-level and middle-level contour detail features; furthermore, as the attention mechanism modules are introduced into the classification branch network and the regression branch network, the first sub-model obtained through training focuses more on the features related to the gesture types, the second sub-model obtained through training focuses more on the features related to the hand positioning information, and the accuracy of the gesture recognition model in recognizing the gesture types and determining the hand positioning information is further improved.

In another embodiment of the present application, referring to fig. 3, a flowchart of a gesture recognition method provided in the embodiment of the present application is shown, and the method may apply the gesture recognition model described in the foregoing embodiment. As shown in fig. 3, the method may include:

s301, acquiring the video stream to be detected.

It should be noted that the gesture recognition method provided in the embodiments of the present application may be applied to a gesture recognition apparatus, or an electronic device integrated with the apparatus. Here, the electronic device may be, for example, a computer, a smart phone, a tablet computer, a notebook computer, a palm computer, a personal digital assistant, a navigation device, a server, and the like, which are not particularly limited in this embodiment of the present application. In addition, the electronic device for executing the gesture recognition method and the electronic device for executing the model training method may be the same electronic device or different electronic devices, and the embodiment of the present application is not limited in particular.

It should be further noted that the video stream to be detected may be video data acquired in real time, or may be video data acquired in non-real time. In some embodiments, the obtaining the video stream to be detected may include:

acquiring video data through a video acquisition module, and determining an initial video stream; wherein the initial video stream comprises at least one video frame;

and preprocessing each video frame in the initial video stream to obtain the video stream to be detected.

It should be noted that, in order to obtain a more accurate gesture recognition result, in the embodiment of the present application, an initial video stream may be first acquired, where the initial video stream includes at least one video frame; for example, real-time or non-real-time video data may be collected by a video collection module in the electronic device to determine an initial video stream; then, each video frame in the initial video stream is preprocessed, so that a preprocessed video stream, that is, a video stream to be detected, is obtained.

Further, in some embodiments, the pre-processing each video frame in the initial video stream may include:

performing size adjustment on each video frame in the initial video stream; and/or the presence of a gas in the gas,

each video frame in the initial video stream is color mode converted.

It should be noted that the preprocessing of the video frame may specifically include resizing and color mode conversion of the video frame. For example, the size of the video frame is adjusted to be the same as the size of a sample picture used for training the gesture recognition model, and the color mode of the video frame is converted into the same color mode as the sample picture, so that the gesture recognition model can perform gesture recognition more quickly and accurately.

Illustratively, the size of the video frame may be adjusted to 320 × 320 pixel size, and the color mode of the video frame may be converted into an RGB color mode (the inventor has proved through practical tests that the picture data of the RGB color mode is more favorable for subsequent gesture recognition).

S302, performing gesture detection on each video frame in the video stream to be detected by using the gesture recognition model, and determining the gesture type and the hand positioning information in each video frame.

It should be noted that, gesture detection may be performed on each video frame in the video stream to be detected through the gesture recognition model obtained in the foregoing embodiment, so as to obtain a gesture recognition result of each video frame, where the gesture recognition result may specifically include a gesture type and hand positioning information of each hand in each video frame.

In some embodiments, the hand positioning information may include center coordinates of the hand positioning box and a width and a height of the hand positioning box.

It should be noted that, for any video frame (or picture to be detected), refer to fig. 4, which shows a schematic diagram of a gesture recognition result of a picture provided in an embodiment of the present application. As shown in fig. 4, the gesture recognition result of the picture is that the gesture type is victoriy-1.0, the white square represents the hand positioning frame, and the center coordinates of the hand positioning frame and the width and height of the hand positioning frame are not shown in fig. 4.

That is, the gesture recognition result shown in fig. 4 can be obtained for each video frame of the video stream to be detected. However, if the video frame itself does not contain a hand, the gesture recognition result shown in fig. 4 cannot be obtained naturally, that is, the gesture type and the hand positioning information cannot be recognized.

Further, in some embodiments, before the detecting the IOU for the hand positioning information in the video stream to be detected, the method may further include:

determining the size of a hand positioning frame in each video frame based on the hand positioning information in each video frame;

comparing the size of the hand positioning frame in each video frame with a preset size threshold;

and according to the comparison result, filtering the hand with the size of the hand positioning frame smaller than the preset size threshold value in the video stream to be detected, and determining the filtered video stream as the video stream to be detected.

It should be noted that after the hand positioning information of each video frame is determined, the embodiment of the present application further determines the size of the hand positioning frame in each video frame according to the hand positioning information, and filters the hand with the size of the hand positioning frame smaller than the preset size threshold, so as to eliminate the interference of too small hand detection frame on the gesture recognition accuracy. Illustratively, the preset size threshold is a pixel size of 50 × 50.

Specifically, the size of the hand positioning frame in each video frame may be respectively compared with a preset size threshold, and then according to the comparison result, the hand of which the size of the hand positioning frame in the video stream to be detected is smaller than the preset size threshold is filtered, that is, the hand of which the size of the hand positioning frame is larger than or equal to the preset size threshold is retained, and the hand of which the size of the hand positioning frame is smaller than the preset size threshold is filtered, and the filtered video stream is determined as the video stream to be detected.

S303, performing IOU detection on the hand positioning information in the video stream to be detected, and determining a hand detection result of the video stream to be detected.

It should be noted that, for a video stream to be detected, a hand detection result of the video stream to be detected is determined by performing overlap degree (IOU) detection on hand positioning information in the video stream, where the hand detection result is used to indicate whether hands in the video stream are the same hand.

In some embodiments, the performing an IOU detection on the hand positioning information in the video stream to be detected and determining a hand detection result of the video stream to be detected may include:

determining at least one hand tracking group in a video stream to be detected; each hand tracking group comprises two video frame sequence numbers to be subjected to IOU detection and corresponding hand positioning information;

in each hand tracking group, if the difference value of the video frame sequence numbers in the hand tracking group is smaller than a preset difference threshold value and the IOU of the hand positioning information in the hand tracking group is larger than a preset overlap threshold value, determining that the hand data detected in the hand tracking group is the same hand;

and under the condition that the hand data detected in at least one hand tracking group are the same hand, determining that the hand detection result indicates that the hands in the video stream to be detected are the same hand.

Further, in some embodiments, the method may further comprise: and if the difference value of the video frame sequence numbers in the hand tracking group is not smaller than a preset difference value threshold value, or the IOU of the hand positioning information in the hand tracking group is not larger than a preset overlapping degree threshold value, determining that the hand data detected in the hand tracking group are different hands.

It should be noted that one or more hands may exist in the video stream to be detected, and in the embodiment of the present application, on the basis of fully considering the time sequence characteristics, according to the characteristic that the hand motion is relatively slow, the tracking of the one or more hands in the video stream to be detected is respectively realized through a simple algorithm of the IOU determination. Specifically, at least one hand tracking group is determined from a video stream to be detected, each hand tracking group includes two video frame sequence numbers to be detected and corresponding hand positioning information, and the video frame sequence number is the video frame identification number (ID) of each video frame in the video stream. For example, a video stream to be detected includes ten video frames, and the video frame sequence number of each video frame may be 1,2,3 … … respectively according to the time sequence. I.e., including two hand data from two different video frames in one hand tracking group, it can be determined whether the two hand data are the same hand by the IOU detection.

For each hand tracking group, if the difference value of the video frame sequence numbers in the hand tracking group is less than a preset difference threshold value (for example: 5), and the IOU of the hand positioning information (mainly referring to the IOU of the hand positioning frame) is greater than a preset overlap threshold value (for example: 0.8), determining that the two hand data in the hand tracking group are the same hand; otherwise, the two hand data in the hand tracking group are determined to be different hands. Thus, when it is determined that the hand data detected in at least one hand tracking group are all the same hand, it is determined that the hand detection result indicates that the hands in the video stream to be detected are the same hand.

It should be noted that, if there is a demand for tracking the hands, an additional target tracking model (e.g., mean shift algorithm model) is usually trained to implement the tracking, which results in increased time redundancy and spatial redundancy of the algorithm and requires more resources to be consumed. By means of the method and the device, at least one hand in the video stream to be detected can be respectively subjected to hand tracking, and the hand tracking function is achieved on the premise that a target tracking model is not required to be introduced, so that redundancy of an algorithm is reduced, and resources are saved.

S304, when the hand detection result indicates that the hands in the video stream to be detected are the same hand, determining a gesture type recognition result of the same hand in the video stream to be detected according to the gesture type in each video frame.

It should be noted that after the same hand in the video stream to be detected is determined, the gesture type recognition result of the same hand is determined according to the gesture type of each video frame in the video stream to be detected. Specifically, in some embodiments, the determining, according to the gesture type in each video frame, a gesture type recognition result of the same hand in the video stream to be detected may include:

acquiring gesture types of the same hand in a current video frame and initial gesture type recognition results in a preset number of video frames before the current video frame;

and optimizing the initial gesture type recognition result by utilizing the gesture type in the current video frame, and determining the gesture type recognition result of the same hand in the current video frame.

It should be noted that, when determining a gesture type recognition result of the same hand in a video stream to be detected, the embodiment of the present application still fully considers the timing characteristics of gesture recognition, and obtains a gesture recognition type of the hand in a current video frame and an initial gesture type recognition result in a preset number of video frames before the current video frame, where the initial gesture type recognition result includes a gesture type of the hand in each video frame in the preset number of video frames; illustratively, the preset number may be five frames.

And then optimizing the initial gesture type recognition result, and determining the gesture type recognition result of the same hand in the current video frame until the last frame of the video stream to be detected is recognized.

Further, in some embodiments, the optimizing the initial gesture type recognition result by using the gesture type in the current video frame to determine the gesture type recognition result of the same hand in the current video frame may include:

forming a gesture type candidate set according to gesture types in a current video frame and initial gesture type recognition results in a preset number of video frames before the current video frame;

and determining the gesture type with the highest occurrence frequency from the gesture type candidate set, and determining the gesture type with the highest occurrence frequency as a gesture type recognition result of the same hand in the current video frame.

It should be noted that the gesture type candidate set is composed of gesture types of the same hand in a current video frame and gesture types in a preset number of video frames before the current video frame, and one of the gesture types with the highest occurrence frequency is determined as a final gesture type recognition result of the same hand. In this way, it is also possible to exclude the possibility of recognition errors due to errors in the gesture recognition model or due to noise.

The embodiment provides a gesture recognition method, which includes acquiring a video stream to be detected; performing gesture detection on each video frame in a video stream to be detected by using a gesture recognition model, and determining gesture types and hand positioning information in each video frame; performing IOU detection on hand positioning information in a video stream to be detected, and determining a hand detection result of the video stream to be detected; and when the hand detection result indicates that the hands in the video stream to be detected are the same hand, determining a gesture type identification result of the same hand in the video stream to be detected according to the gesture type in each video frame. In this way, gesture detection is carried out on each video frame in the video stream to be detected, hand tracking is carried out on the video stream to be detected according to the detected hand positioning information, the same hand in the video stream to be detected is determined, and a gesture type recognition result of the same hand in the video stream to be detected is further determined according to the detected gesture type; therefore, the accuracy and precision of gesture recognition are improved; in addition, when determining whether the hand data in the video stream to be detected is the same hand, the video frame sequence number and the hand positioning information are combined, so that the gesture recognition of multiple hands is realized on the basis of fully considering the time sequence characteristics without mutual interference, other target tracking models are not required to be introduced, the efficiency of realizing the gesture recognition by using the algorithm is optimized, the spatial redundancy and the temporal redundancy of the algorithm are effectively reduced, and the computing resources are saved; furthermore, when the gesture type recognition result of the same hand is determined, the time sequence characteristics among video frames are fully considered, the gesture type recognition stability is improved, the gesture type recognition precision is further improved, and the influence of model errors, noise and the like on the recognition result is reduced.

In another embodiment of the present application, fig. 5 is a schematic diagram illustrating an architecture of a gesture recognition apparatus provided in the embodiment of the present application. As shown in fig. 5, the architecture may include a real-time video acquisition module, a data processing module, a gesture recognition module, a hand tracking module, and an optimization and modification module; in this way, the gesture recognition method of the embodiment of the present application can be completed through cooperation between the modules.

Based on the architecture shown in fig. 5, refer to fig. 6, which shows a detailed flowchart of a gesture recognition method provided in an embodiment of the present application. As shown in fig. 6, the detailed process may include:

s601, acquiring a video stream.

It should be noted that, first, a real-time video capturing module may capture a real-time video stream or a non-real-time video stream (which is equivalent to the initial video stream in the foregoing embodiment), and this is not specifically limited in this embodiment of the application.

And S602, acquiring a video frame.

And S603, preprocessing the video frame.

It should be noted that, after the video stream is acquired, the data processing module may extract video frames from the video stream and perform pre-processing on each extracted video frame. Specifically, pre-processing a video frame may include resizing a picture, e.g., to a pixel size of 320 x 320; preprocessing the video frame may also include converting a color mode of the video frame, for example, converting the video frame to a red, green, and blue color mode.

And S604, detecting gesture information of the preprocessed video frame and filtering the preprocessed video frame.

It should be noted that, for the preprocessed video frames, the gesture recognition module may be used to perform gesture information recognition on each preprocessed video frame by using the gesture recognition model, and perform filtering processing. The gesture recognition result output by the gesture recognition model may include coordinates of a center point of the hand positioning box, a width and a height of the hand positioning box, a gesture type and a confidence corresponding to the gesture type. The recognition result of the gesture information recognition may refer to fig. 4 described above.

It should be further noted that the filtering process may include making the size (area) of the hand positioning box in the video frame smaller than a preset size (area) threshold γ 1, for example: the hand with the preset size of 50 pixels multiplied by 50 pixels (pixel) is filtered so as to eliminate the interference of the small hand positioning frame on the gesture recognition precision. Thus, the video frames after the preprocessing and filtering operations constitute the video stream to be detected.

And S605, respectively tracking the multiple hands in the video stream to be detected according to the hand positioning information and the video frame sequence number.

It should be noted that, the hand tracking module may be used to track multiple hands in the video stream to be detected respectively according to the hand positioning information and the video frame sequence number. The hand positioning information can comprise the coordinates of the center point of the hand positioning frame, the width and the height of the hand positioning frame and the like, and the order number of the video frames is combined, so that the aim of respectively tracking the hands of a plurality of hands can be fulfilled on the basis of fully considering the time sequence characteristics.

Specifically, if the difference between the video frame sequence numbers of two different video frames is smaller than a preset difference threshold γ 2 (e.g., 5) and the IOU of the hand positioning frame detected in the two video frames is larger than a preset overlap threshold γ 3 (e.g., 0.80), the hand data detected in the different video frames is regarded as the same hand; otherwise, the hand data detected in the different video frame is the different hand. By the hand tracking method, the hand tracking function can be realized on the premise of not introducing a target tracking model. Wherein the video frame sequence numbers of two different video frames and the corresponding hand positioning information from the hands in the two different video frames constitute the hand tracking set of the previous embodiment.

And S606, determining a final gesture type recognition result.

It should be noted that, the final gesture type recognition result may be determined by optimizing the modification module based on the timing characteristic of the gesture recognition. Wherein, for the same hand, it is obtained at the current frame t _i And the first n frames (t) _i-1 、t _i-2 、……、t _i-n Illustratively, n = 5) the preliminary gesture type recognition result in the video frame (i.e. the gesture type in each frame of the previous n frames), and the gesture type with the largest occurrence frequency is taken as the current video frame t _i I.e., the final gesture type recognition result.

S607, whether the last frame is recognized.

It should be noted that, if the determination result is yes, the last frame of the video stream to be detected, that is, the last frame of the video stream to be detected is identified at this time, the final gesture type identification result may be directly output, and the process is ended; if the determination result is negative, the last frame of the video stream to be detected is not identified at this time, then the process returns to step S602 until the last frame is identified. For example, when the user interacts with the electronic device through the gesture, the electronic device may respond or perform an operation according to the recognized gesture type recognition result.

That is to say, the gesture recognition method provided in the embodiment of the present application may be a static gesture recognition method with real-time performance, which is realized in real time that the method can perform gesture recognition on a video stream acquired in real time, and statically embodies that gesture detection is performed on each video frame in the video stream through a gesture recognition model. On one hand, the method selects six characteristic layers in a preset network model to extract the characteristics of the picture, reduces the algorithm complexity of the model, realizes the real-time output of the recognition result, realizes the time consumption of a CPU (central processing unit) from the video stream acquisition to the gesture recognition output, and can realize the application of a mobile terminal; on the other hand, the method fully considers the time sequence characteristics of the video frames, and improves the gesture recognition precision; on the other hand, according to the characteristic that the hand motion is relatively slow, the method realizes the hand tracking function through a simple algorithm of IOU discrimination without introducing a target tracking model, optimizes the algorithm efficiency, effectively reduces the spatial redundancy and the temporal redundancy of the algorithm, and realizes the multi-hand gesture recognition function on the basis of fully considering the time sequence characteristics; on the other hand, the method performs image fusion processing on the collected gesture drags, so that the diversity of the sample image set is increased, and the robustness of the gesture recognition model is effectively improved. The embodiment of the application provides a real-time multi-hand static gesture recognition method based on time sequence characteristics.

Besides, the embodiment of the present application further includes a training set (i.e. the sample picture set in the foregoing embodiment) acquisition and preprocessing of the gesture recognition model, and training of the gesture recognition model. In particular, training set acquisition and preprocessing of gesture recognition models may include: the method includes acquiring original gesture picture data (i.e., at least one gesture picture in the foregoing embodiment) in a real scene, performing picture fusion processing on the gesture picture by using a picture fusion technology, and performing hand information tagging (i.e., adding tagging information in the foregoing embodiment).

The image fusion technology is a technology for fusing the acquired gesture image with other types of image data (i.e., non-gesture images in the foregoing embodiments, such as landscape images, etc.), and includes, but is not limited to, fusion, splicing, etc. The process of performing the image fusion processing on the gesture picture can be seen in fig. 2, which fuses the gesture picture and a landscape picture.

Note that the annotation information may be added before or after the Image fusion process is performed, and the annotation tool used in the addition of the annotation information may be a public annotation tool known in the art, such as an annotation tool VGG Image indicator, and the hand information such as the hand positioning frame may be annotated by the annotation tool.

It should be noted that, in the embodiment of the present application, when performing the target detection task (in the embodiment of the present application, the target detection task includes a type classification task for detecting a gesture type and a positioning regression task for determining hand positioning information), the gesture recognition model needs to pay attention to both the classification accuracy of the gesture type and the regression accuracy of the hand positioning information, and therefore, needs to pay attention to both semantic information and contour detail information of the feature layer. An existing network structure (for example, see fig. 7, which shows a hierarchical fusion schematic diagram of a network structure provided in an embodiment of the present application) often focuses only on fusion of a high-level semantic feature to a low-level and a middle-level profile detail feature or fusion of a low-level and a middle-level profile detail feature to a high-level semantic feature, and a layer-by-layer cascade/sequential fusion manner is adopted, so that a good recognition effect cannot be achieved.

Based on this, the embodiment of the application adopts an improved SSD _ mobilenetV3_ I as a preset network model, realizes interactive fusion of low-level and middle-level profile detail features and high-level semantic features, and adopts a parallel and independent feature fusion mode to enable each feature layer to simultaneously have the high-level semantic features and the low-level and middle-level profile detail features.

Referring to fig. 8, a schematic diagram of a configuration of a default network model provided in an embodiment of the present application is shown. As shown in fig. 8, the preset network model may include a feature extraction network, a classification branch network, and a regression branch network, wherein the classification branch network and the regression branch network may be referred to as a Detection header or a Detection Prediction Layer (Detection Prediction Layer) as a whole. The feature extraction network may include a feature extraction layer, a feature fusion layer, and a feature convolution layer, where the feature convolution layer includes two Conv1 × 1 layers corresponding to each feature layer. After a sample picture is input into a preset network model, firstly, in a feature extraction network, the sample picture is subjected to up-sampling (up sample), down-sampling (down sample), cascading (concatenate), convolution (Conv (filter = (1,1)), conv1 × 1) and other operations to perform feature interactive fusion on six-Layer features such as Layer14, layer17, layer20, layer23, layer26 and Layer29 in a trunk network SSD _ mobilenetV3, and the fused six-Layer features are respectively input into a classification branch network and a regression branch network to perform gesture type determination and hand positioning information determination. The deconstruction function may be adopted for each feature Layer upsampling, wherein deconstruction function parameters of Layer17, layer20, and Layer29 may be set to kernel _ size = [1,2,2,1], stride = [1,2,2,1], padding =0; the deconstruction function parameters of Layer23 and Layer26 may be set to kernel _ size = [1,2,2,1], strand = [1,2,2,1], padding =1, output _ padding =1. Each feature Layer downsampling may employ a max _ posing function, the max _ posing function parameters of Layer14, layer17, and Layer26 may be set to kernel _ size = [1,2,2,1], stride = [1,2,2,1], padding =0, and the max _ posing function parameters of Layer20 and Layer23 may be set to kernel _ size = [1,2,2,1], stride = [1,2,2,1], padding =1.

After the upsampling and the downsampling are respectively completed, as shown in fig. 8, each feature layer is further subjected to feature fusion through a concatenation (concatenate) operation and two convolution operations (convolutional layer parameters are kernel _ size = [1,1,1,1], stride = [1,1,1,1], and padding = 0), so that fused target feature maps corresponding to six feature layers are respectively obtained. Where kernel _ size represents the convolution kernel size, stride represents the convolution step size, padding represents the feature map fill width, and output _ padding represents the output feature edge extension value.

Therefore, by carrying out feature interactive fusion on the feature maps of partial feature layers in the preset network model, the features of the target feature map of each layer are enriched, and the complexity of the algorithm can be reduced by only selecting partial feature layers for carrying out feature interactive fusion, so that the gesture recognition result can be output in real time.

Further, the inventors confirmed through a large number of experimental results that: when the target detection task is executed in the preset network model, the features concerned by the type classification task and the hand positioning information regression task have difference, the former is more concerned with semantic features, and the latter is more concerned with contour detail features. Based on this, the embodiment of the present application further improves the detection head (the detection head includes a classification branch network and a regression branch network) of the predetermined network model, that is, an Attention Module (Attention Module) is introduced into the classification branch network and the regression branch network, respectively.

Referring to fig. 9, a schematic diagram of a structure of a classification branch network and a regression branch network provided in an embodiment of the present application is shown. As shown in fig. 9, each fused target feature map is input to the attention module of the classification branch and the regression branch, and for each target feature map, the attention module performs feature extraction by convolution Conv3 × 3 (kernel _ size = [1,3,3,1], stride = [1,2,2,1], padding = 1), conv1 × 1 (kernel _ size = [1,1,1,1], stride = [1,1,1,1], padding = 0), and transposed convolution Deconv3 × 3 (kernel _ size = [1,3,3,1], stride = [1,2,2,1], padding =1, ddutput _ padding = 1), and performs skip connection (skip connection) processing to capture information of different proportions.

Taking a classification branch network as an example, for a target feature map of any layer, firstly, after some series of convolution processing, features (namely, an initial classification feature matrix in the foregoing embodiment) of the target feature map after the classification branch convolution are obtained, then, attention feature selection is performed on the convolved features through an activation function (softmax function) to obtain a probability value matrix (namely, a classification probability matrix in the foregoing embodiment) of each pixel point, then, the obtained probability value matrix of each point is subjected to element multiplication with the target feature map, so that important features can be subjected to "amplification" (the important feature probability value is larger), a multiplication result (namely, the classification product matrix in the foregoing embodiment) is subjected to element addition with the target feature map to obtain a classification addition matrix, and the target feature map of each layer is processed according to the method to obtain classification addition matrices respectively corresponding to six feature layers, then, after convolution, gesture type prediction is performed, and the like, until a first sub-model with higher accuracy is obtained.

In the embodiment of the present application, the attention module may adopt a mixed attention mechanism (mixed attention) to perform attention feature selection on both dimensions of the channel and the space. The function of the hybrid attention machine can be realized through a sigmoid function, that is, the activation function is the sigmoid function, that is, the sigmoid function processing is performed on each position pixel of the channel and the space, wherein the sigmoid function (also called Logistic function) is one of important activation functions of the neural network, and is specifically shown in formula (1).

Where i is the pixel spatial location, c is the pixel channel location, x _i,c Is the pixel data.

In summary, the key points of the embodiments of the present application mainly lie in:

(1) The process of forming the sample picture set of the gesture recognition model comprises the following steps: acquiring a gesture picture in a real scene; and (4) performing image fusion processing on the gesture picture and the non-gesture picture (such as a landscape) by using technologies such as, but not limited to, fusion, splicing and the like. The sample picture set formed in the way can enable the finally obtained gesture recognition model to be suitable for a real scene, have strong robustness and reduce interference of external factors such as illumination.

(2) The preset network model adopted in the embodiment of the application is an improved SSD _ mobilenetV3_ I model, the interactive fusion of the low-layer and middle-layer profile detail features and the high-layer semantic features is realized through operations such as up sample, down sample, concatenate, conv1 multiplied by 1 and the like, and a parallel and independent feature fusion mode is adopted; in addition, the embodiment of the application also improves the detection head of the preset network model, namely, a mixed attention module is respectively introduced into the classification branch network and the regression branch network.

(3) When gesture recognition is carried out, the hand tracking function is realized through the hand positioning frame IOU detected in the front video frame and the back video frame, and multi-hand gesture recognition is realized without mutual interference on the premise that the time sequence characteristics of the front video frame and the back video frame are fully considered and a target tracking model is not introduced.

In short, the gesture recognition method provided by the embodiment of the present application can be implemented by the following modules:

s1, a real-time video acquisition module acquires real-time video data, and can also be non-real-time video data.

And S2, a data processing module extracts video frames and carries out preprocessing, wherein the preprocessing comprises the steps of adjusting the size of the picture (320 multiplied by 320) and converting the color mode into an RGB mode (data tests show that picture data in the RGB format is more beneficial to a subsequent hand recognition module).

And S3, the gesture recognition module is used for recognizing gesture information of the preprocessed video frame through the gesture recognition model and filtering the preprocessed video frame. The gesture information output by the gesture recognition model comprises coordinates of a center point of the hand positioning frame, the width and the height of the hand positioning frame, a gesture recognition type and a confidence coefficient corresponding to the gesture recognition type; the filtering operation is to filter the hand with the area smaller than a preset size threshold value of the hand positioning frame so as to eliminate the interference of the undersize hand positioning frame on the gesture recognition precision;

and S4, the hand tracking module tracks the multiple hands respectively according to the hand positioning information and the video frame ID (video frame sequence number), so that the aim of respectively identifying the gestures of the multiple hands is fulfilled on the basis of fully considering the time sequence characteristics. If the video frame ID difference is smaller than a preset difference threshold value and the hand positioning frame IOU detected in the two video frames is larger than a preset overlapping degree threshold value, the hand data detected in the different video frames are regarded as the same hand data; otherwise, the hand data detected in the different video frame is different hand data. The hand tracking method realizes the hand tracking function on the premise of not introducing a target tracking model.

And S5, optimizing a correction module, and determining a final gesture recognition result by considering the time sequence characteristics of gesture recognition. For the same hand data, obtaining the preliminary gesture recognition results of the same hand data in the current frame ti and the previous n frames of video data, wherein the gesture type with the largest occurrence frequency is used as the final gesture recognition result of the current video frame ti.

The embodiment of the application also comprises the acquisition and the preprocessing of the sample picture set of the gesture recognition model and the training of the gesture recognition model. The acquisition and preprocessing of the sample picture set of the gesture recognition model comprises the following steps: the method comprises the steps of collecting gesture pictures in a real scene, and carrying out picture fusion processing and hand information labeling on the gesture pictures. Picture fusion is to fuse the acquired gesture picture with other types of picture data (such as landscape, etc.), including but not limited to fusion, splicing.

In the embodiment of the application, an improved SSD _ mobileneetV 3_ I model is adopted to realize interactive fusion of low-layer and middle-layer profile detail features and high-layer semantic features, and a parallel and independent feature fusion mode is adopted to enable each feature layer to simultaneously have the high-layer semantic features and the low-layer and middle-layer profile detail features, and the overall structure of the feature layer is shown in FIG. 8. In the structure, through operations of upsampling, downsampling, cascading, convolution and the like, feature interactive fusion is carried out on six layers of features such as layer14, layer17, layer20, layer23, layer26, layer29 and the like in a trunk network SSD _ mobilenetV3, and the fused six layers of features (target feature maps) are respectively input into a detection head to carry out gesture type judgment and hand positioning information regression.

The embodiment of the application also improves the detection head of the preset network model, and the attention module is respectively introduced into the classification branch network and the regression branch network. The attention module performs feature extraction on a target feature map by convolution and performs jump connection processing to capture information of different scales, and the structure of the attention module is shown in fig. 9. The attention module adopts a mixed attention mechanism to select attention characteristics of two dimensions, namely a channel and a space. The mixed attention mechanism function is realized through a sigmoid function, namely, the sigmoid function processing is carried out on each position pixel of a channel and a space.

The specific implementation of the foregoing embodiment is elaborated in detail through the foregoing embodiment, and it can be seen that, compared with the related art, the gesture recognition method provided by the embodiment of the present application has at least the following advantages: and (1) real-time performance. Through a large number of experiments, the gesture recognition method provided by the embodiment of the application is utilized, the time spent by a CPU (operating system windows10\ memory 16G \ system type 64 bit \ processor core (TM) i 7-8700) is about 40ms from the acquisition of a video stream to the output of a gesture recognition result, and under the same condition, the gesture recognition method based on the YOLO network provided by the related technology needs 300+ ms of time for detecting a single picture. And (2) the model has high precision. The gesture recognition model obtained by training the improved SSD _ mobilenetV3_ I model is remarkably improved in gesture type classification precision and hand positioning information regression precision. (3) By optimizing the correction module, the time sequence characteristics among video frames are fully considered, the gesture recognition stability is improved, and the gesture recognition precision is further improved. (4) Through hand positioning information and video frame sequence numbers detected among different video frames, the function of respectively carrying out gesture recognition on a plurality of hands without mutual interference is realized on the basis of fully considering time sequence characteristics. The function is realized without introducing MeanShift or other target tracking algorithm models, so that the algorithm efficiency is optimized, the spatial redundancy and the time redundancy of the algorithm are effectively reduced, and the computing resources are saved; (5) The robustness of the gesture recognition model is improved by carrying out image fusion processing on the gesture images.

In another embodiment of the present application, refer to fig. 10, which shows a schematic structural diagram of a model training apparatus 100 provided in an embodiment of the present application. As shown in fig. 10, the model training apparatus 100 may include a first obtaining unit 1001, a fusing unit 1002, and a training unit 1003, wherein,

a first obtaining unit 1001 configured to obtain at least one gesture picture and at least one non-gesture picture;

the fusion unit 1002 is configured to perform picture fusion processing on each gesture picture in the at least one gesture picture and the at least one non-gesture picture respectively to obtain a sample picture set;

the training unit 1003 is configured to train a preset network model by using the sample picture set to obtain a gesture recognition model; the gesture recognition model comprises a first submodel and a second submodel, the first submodel is used for determining the gesture type of the picture to be detected, and the second submodel is used for determining the hand positioning information of the picture to be detected.

In some embodiments, the picture fusion process at least includes: and carrying out picture fusion processing and/or picture splicing processing.

In some embodiments, the first obtaining unit 1001 is further configured to add annotation information to the sample pictures in the sample picture set; the labeling information comprises gesture type information in the sample picture and hand positioning information in the sample picture, and the hand positioning information comprises the center coordinate of a hand positioning frame and the width and height of the hand positioning frame.

In some embodiments, the predetermined network model comprises a feature extraction network, a classification branch network, and a regression branch network; a training unit 1003, further configured to input a sample picture in the sample picture set into the preset network model; performing feature processing on the sample pictures in the sample picture set through the feature extraction network to obtain at least one target feature picture; training the classification branch network through the at least one target feature map to obtain the first sub-model; training the regression branch network through the at least one target feature map to obtain the second submodel; and determining the gesture recognition model according to the first sub-model and the second sub-model.

In some embodiments, the feature extraction network comprises a feature extraction layer, a feature fusion layer, and a feature convolution layer; the training unit 1003 is further configured to perform initial feature extraction on the sample picture by using the feature extraction layer, and determine at least one initial feature map corresponding to the sample picture; performing feature interactive fusion on the at least one initial feature map by using the feature fusion layer to obtain at least one fusion feature map corresponding to the sample picture; and performing convolution operation on the at least one fused feature map by using the feature convolution layer to obtain the at least one target feature map.

In some embodiments, the training unit 1003 is further configured to determine an initial feature map of the sample picture at an ith feature layer and initial feature maps of feature layers other than the ith feature layer; sampling the initial characteristic graphs of the characteristic layers except the ith characteristic layer to obtain a sampling result; performing feature fusion on the initial feature map of the ith feature layer and the sampling result by using a feature fusion layer to obtain a fusion feature map of the sample picture on the ith feature layer; wherein i is an integer greater than zero.

In some embodiments, the feature layers other than the ith feature layer include: a first feature layer portion located before the ith feature layer and a second feature layer portion located after the ith feature layer, wherein the level of the first feature layer portion is higher than that of the ith feature layer, and the level of the ith feature layer is higher than that of the second feature layer portion; the training unit 1003 is further configured to perform downsampling on the initial feature map of the first feature layer portion and perform upsampling on the initial feature map of the second feature layer portion to obtain the sampling result.

In some embodiments, the downsampling process employs a maximum pooling function and the upsampling process employs a deconvolution function.

In some embodiments, the preset network model comprises an SSD _ mobilenetV3_ I model; wherein the feature extraction Layer comprises feature layers Layer14, layer17, layer20, layer23, layer26 and Layer29 in the SSD _ mobilenetV3_ I model, the feature fusion Layer comprises cascade Concatenate layers, and the feature convolution Layer comprises convolution layers with convolution kernel size of 1 × 1.

In some embodiments, the classification branching network includes an attention mechanism module; the training unit 1003 is further configured to determine a first model loss function, perform classification training on the at least one target feature map by using the classification branch network including the attention mechanism module, and determine the trained classification branch network as the first sub-model when a value of the first model loss function reaches a preset convergence value; the attention mechanism module comprises a convolution submodule, an activation submodule, a multiplication submodule and an addition submodule, wherein the convolution submodule consists of a convolutional layer, a transposed convolutional layer and a jump layer.

In some embodiments, the regression limb network includes an attention mechanism module; the training unit 1003 is further configured to determine a second model loss function, perform regression training on the at least one target feature map by using the regression branch network including the attention mechanism module, and determine the trained regression branch network as the second sub-model when a value of the second model loss function reaches a preset convergence value; the attention mechanism module comprises a convolution submodule, an activation submodule, a multiplication submodule and an addition submodule, wherein the convolution submodule consists of a convolutional layer, a transposed convolutional layer and a jump layer.

In yet another embodiment of the present application, referring to fig. 11, a schematic structural diagram of a gesture recognition apparatus 110 provided in an embodiment of the present application is shown. As shown in fig. 11, the gesture recognition apparatus 110 may include a second acquisition unit 1101, a gesture detection unit 1102, an IOU detection unit 1103, and a determination unit 1104, wherein,

a second obtaining unit 1101 configured to obtain a video stream to be detected;

the gesture detection unit 1102 is configured to perform gesture detection on each video frame in the video stream to be detected by using the gesture recognition model, and determine a gesture type and hand positioning information in each video frame;

an IOU detection unit 1103 configured to perform overlapping IOU detection on the hand positioning information in the video stream to be detected, and determine a hand detection result of the video stream to be detected;

a determining unit 1104 configured to determine a gesture type recognition result of the same hand in the video stream to be detected according to the gesture type in each video frame when the hand detection result indicates that the hand in the video stream to be detected is the same hand.

In some embodiments, the second obtaining unit 1101 is further configured to perform video data acquisition through a video acquisition module, and determine an initial video stream; wherein the initial video stream comprises at least one video frame; and preprocessing each video frame in the initial video stream to obtain the video stream to be detected.

In some embodiments, the second obtaining unit 1101 is further configured to resize each video frame in the initial video stream; and/or, performing color mode conversion on each video frame in the initial video stream.

In some embodiments, the hand positioning information includes center coordinates of a hand positioning box and a width and a height of the hand positioning box.

In some embodiments, the second obtaining unit 1101 is further configured to determine a hand positioning frame size in each video frame based on the hand positioning information in each video frame; comparing the size of the hand positioning frame in each video frame with a preset size threshold; and according to the comparison result, filtering the hand with the size of the hand positioning frame smaller than a preset size threshold value in the video stream to be detected, and determining the filtered video stream as the video stream to be detected.

In some embodiments, the IOU detection unit 1103 is further configured to determine at least one hand tracking group in the video stream to be detected; each hand tracking group comprises two video frame sequence numbers to be subjected to IOU detection and corresponding hand positioning information; in each hand tracking group, if the difference value of the video frame sequence numbers in the hand tracking group is smaller than a preset difference value threshold value and the IOU of the hand positioning information in the hand tracking group is larger than a preset overlapping degree threshold value, determining that the hand data detected in the hand tracking group is the same hand; and determining that the hand detection result indicates that the hands in the video stream to be detected are the same hand under the condition that the hand data detected in the at least one hand tracking group are the same hand.

In some embodiments, the IOU detection unit 1103 is further configured to determine that the hand data detected in the hand tracking group is a different hand if the difference between the video frame sequence numbers in the hand tracking group is not less than a preset difference threshold, or the IOU of the hand positioning information in the hand tracking group is not greater than a preset overlap threshold.

In some embodiments, the determining unit 1104 is further configured to obtain the gesture type of the same hand in the current video frame and the initial gesture type recognition result in a preset number of video frames before the current video frame; and optimizing the initial gesture type recognition result by using the gesture type in the current video frame, and determining the gesture type recognition result of the same hand in the current video frame.

In some embodiments, the determining unit 1104 is further configured to form a gesture type candidate set according to the gesture type in the current video frame and the initial gesture type recognition results in a preset number of video frames before the current video frame; and determining the gesture type with the highest occurrence frequency from the gesture type candidate set, and determining the gesture type with the highest occurrence frequency as a gesture type recognition result of the same hand in the current video frame.

It is understood that, in this embodiment, a "unit" may be a part of a circuit, a part of a processor, a part of a program or software, etc., and may also be a module, or may be non-modular. Moreover, each component in the embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.

Based on the understanding that the technical solution of the present embodiment essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

Accordingly, the present embodiment provides a computer storage medium storing a computer program which, when executed by a processor, implements the model training method of any one of the preceding embodiments, or implements the gesture recognition method of any one of the preceding embodiments.

Based on the above-mentioned components of the model training apparatus 100 and/or the gesture recognition apparatus 40 and the computer storage medium, refer to fig. 12, which shows a schematic diagram of a specific hardware structure of an electronic device 120 according to an embodiment of the present application. As shown in fig. 12, may include: a communication interface 1201, a memory 1202, and a processor 1203; the various components are coupled together by a bus system 1204. It is understood that the bus system 1204 is used to enable connective communication between these components. The bus system 1204 includes a power bus, a control bus, and a status signal bus, in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 1204 in fig. 12. The communication interface 1201 is used for receiving and sending signals in the process of receiving and sending information with other external network elements;

a memory 1202 for storing a computer program operable on the processor 1203;

a processor 1203, configured to execute, when the computer program runs, the following:

acquiring at least one gesture picture and at least one non-gesture picture;

training a preset network model by using the sample picture set to obtain a gesture recognition model; the gesture recognition model comprises a first sub-model and a second sub-model, wherein the first sub-model is used for determining the gesture type of the picture to be detected, and the second sub-model is used for determining the hand positioning information of the picture to be detected;

alternatively, the processor 1203, when executing the computer program, is configured to perform:

acquiring a video stream to be detected;

It will be appreciated that the memory 1202 in the subject embodiment can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), enhanced Synchronous SDRAM (ESDRAM), synchronous Link Dynamic Random Access Memory (SLDRAM), and Direct Rambus RAM (DRRAM). The memory 1202 of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

And the processor 1203 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 1203. The Processor 1203 may be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1202, and the processor 1203 reads the information in the memory 1202 to complete the steps of the above-mentioned method in combination with the hardware thereof.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

Optionally, as another embodiment, the processor 1203 is further configured to execute the method of any one of the preceding embodiments when the computer program is executed.

In the embodiment of the application, for the electronic device 120, when the model training is performed, since the sample picture not only includes the gesture but also is fused with the non-gesture content, the gesture recognition model obtained through the training is more suitable for a real scene, the recognition accuracy and the model robustness of the gesture recognition model are enhanced, and the interference of external factors on the recognition result is reduced. When gesture recognition is executed, because gesture detection is carried out on each video frame in a video stream to be detected, hand tracking is carried out on the video stream according to detected hand positioning information, the same hand in the video stream is determined, and a gesture type recognition result of the same hand in the video stream to be detected is further determined according to the detected gesture type; therefore, the accuracy and precision of gesture recognition are improved.

The above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application.

It should be noted that, in the present application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of model training, the method comprising:

acquiring at least one gesture picture and at least one non-gesture picture;

2. The method according to claim 1, wherein the picture fusion process at least comprises: and carrying out picture fusion processing and/or picture splicing processing.

3. The method of claim 1, further comprising:

adding marking information to the sample pictures in the sample picture set;

the labeling information comprises gesture type information in the sample picture and hand positioning information in the sample picture, and the hand positioning information comprises the center coordinate of a hand positioning frame and the width and height of the hand positioning frame.

4. The method of claim 1, wherein the predetermined network model comprises a feature extraction network, a classification branch network, and a regression branch network; the training of the preset network model by utilizing the sample picture set to obtain the gesture recognition model comprises the following steps:

inputting the sample pictures in the sample picture set into the preset network model;

performing feature processing on the sample pictures in the sample picture set through the feature extraction network to obtain at least one target feature picture;

training the classification branch network through the at least one target feature map to obtain the first submodel; training the regression branch network through the at least one target feature map to obtain the second submodel;

and determining the gesture recognition model according to the first sub-model and the second sub-model.

5. The method of claim 4, wherein the feature extraction network comprises a feature extraction layer, a feature fusion layer, and a feature convolution layer; the processing of the features of the sample pictures in the sample picture set through the feature extraction network to obtain at least one target feature picture comprises:

and performing convolution operation on the at least one fused feature map by using the feature convolution layer to obtain the at least one target feature map.

6. The method according to claim 5, wherein the performing feature interactive fusion on the at least one initial feature map by using the feature fusion layer to obtain at least one fused feature map corresponding to the sample picture comprises:

determining an initial feature map of the sample picture in an ith feature layer and initial feature maps of feature layers except the ith feature layer;

sampling the initial characteristic graphs of the characteristic layers except the ith characteristic layer to obtain a sampling result;

performing feature fusion on the initial feature map of the ith feature layer and the sampling result by using a feature fusion layer to obtain a fusion feature map of the sample picture on the ith feature layer; wherein i is an integer greater than zero.

7. The method of claim 6, wherein the other feature layers than the ith feature layer comprise: a first feature layer part located before the ith feature layer and a second feature layer part located after the ith feature layer, wherein the level of the first feature layer part is higher than that of the ith feature layer, and the level of the ith feature layer is higher than that of the second feature layer part;

correspondingly, the sampling processing on the initial feature maps of the feature layers other than the ith feature layer to obtain a sampling result includes:

and performing down-sampling processing on the initial feature map of the first feature layer part and performing up-sampling processing on the initial feature map of the second feature layer part to obtain the sampling result.

8. The method of claim 7, wherein the downsampling process uses a maximum pooling function and the upsampling process uses a deconvolution function.

9. The method of claim 5, wherein the preset network model comprises an SSD _ mobilenetV3_ I model;

wherein the feature extraction Layer comprises feature layers Layer14, layer17, layer20, layer23, layer26 and Layer29 in the SSD _ mobilenetV3_ I model, the feature fusion Layer comprises cascade Concatenate layers, and the feature convolution Layer comprises convolution layers with convolution kernel size of 1 × 1.

10. The method of claim 4, wherein the classification branch network comprises an attention mechanism module; the training the classification branch network through the at least one target feature map to obtain the first submodel includes:

determining a first model loss function, performing classification training on the at least one target feature map by using the classification branch network comprising the attention mechanism module, and determining the trained classification branch network as the first sub-model when the value of the first model loss function reaches a preset convergence value;

the attention mechanism module comprises a convolution submodule, an activation submodule, a multiplication submodule and an addition submodule, wherein the convolution submodule consists of a convolutional layer, a transposed convolutional layer and a jump layer.

11. The method of claim 4, wherein the regression branch network comprises an attention mechanism module; the training the regression branch network through the at least one target feature map to obtain the second submodel includes:

determining a second model loss function, performing regression training on the at least one target feature map by using the regression branch network comprising the attention mechanism module, and determining the trained regression branch network as the second sub-model when the value of the second model loss function reaches a preset convergence value;

the attention mechanism module comprises a convolution submodule, an activation submodule, a multiplication submodule and an addition submodule, wherein the convolution submodule consists of a convolution layer, a transposition convolution layer and a jump layer.

12. A gesture recognition method applied to the gesture recognition model of claim 1, the method comprising:

acquiring a video stream to be detected;

13. The method according to claim 12, wherein said obtaining the video stream to be detected comprises:

14. The method of claim 13, wherein the pre-processing each video frame in the initial video stream comprises:

resizing each video frame in the initial video stream; and/or the presence of a gas in the gas,

color mode conversion is performed on each video frame in the initial video stream.

15. The method of claim 12, wherein the hand positioning information comprises center coordinates of a hand positioning box and a width and a height of the hand positioning box.

16. The method of claim 15, wherein before the detecting the IOU of the hand positioning information in the video stream to be detected, the method further comprises:

determining a hand location box size in each video frame based on the hand location information in each video frame;

and according to the comparison result, filtering the hand with the size of the hand positioning frame smaller than a preset size threshold value in the video stream to be detected, and determining the filtered video stream as the video stream to be detected.

17. The method according to claim 12, wherein the performing the IOU detection on the hand positioning information in the video stream to be detected and determining the hand detection result of the video stream to be detected comprises:

determining at least one hand tracking group in the video stream to be detected; each hand tracking group comprises two video frame sequence numbers to be subjected to IOU detection and corresponding hand positioning information;

and under the condition that the hand data detected in the at least one hand tracking group are the same hand, determining that the hand detection result indicates that the hands in the video stream to be detected are the same hand.

18. The method of claim 17, further comprising:

and if the difference value of the video frame sequence numbers in the hand tracking group is not less than a preset difference value threshold value, or the IOU of the hand positioning information in the hand tracking group is not more than a preset overlap threshold value, determining that the hand data detected in the hand tracking group are different hands.

19. The method according to any one of claims 12 to 18, wherein the determining the gesture type recognition result of the same hand in the video stream to be detected according to the gesture type in each video frame comprises:

20. The method of claim 19, wherein the optimizing the initial gesture type recognition result by using the gesture type in the current video frame to determine the gesture type recognition result of the same hand in the current video frame comprises:

forming a gesture type candidate set according to the gesture types in the current video frame and the initial gesture type recognition results in a preset number of video frames before the current video frame;

21. A model training apparatus, characterized in that the model training apparatus comprises a first acquisition unit, a fusion unit, and a training unit, wherein,

the training unit is configured to train a preset network model by using the sample picture set to obtain a gesture recognition model; the gesture recognition model comprises a first submodel and a second submodel, the first submodel is used for determining the gesture type of the picture to be detected, and the second submodel is used for determining the hand positioning information of the picture to be detected.

22. A gesture recognition apparatus comprising a second acquisition unit, a gesture detection unit, an IOU detection unit, and a determination unit, wherein,

the gesture detection unit is configured to perform gesture detection on each video frame in the video stream to be detected by using the gesture recognition model, and determine gesture types and hand positioning information in each video frame;

23. An electronic device, comprising a memory and a processor, wherein,

the memory for storing a computer program operable on the processor;

the processor, when running the computer program, for performing the model training method of any one of claims 1 to 11; or, performing the gesture recognition method of any one of claims 12 to 20.

24. A computer storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the model training method according to any one of claims 1 to 11; or, implementing a gesture recognition method according to any one of claims 12 to 20.