CN110458095B

CN110458095B - Effective gesture recognition method, control method and device and electronic equipment

Info

Publication number: CN110458095B
Application number: CN201910735669.8A
Authority: CN
Inventors: 徐绍凯; 贾宝芝
Original assignee: Xiamen Ruiwei Information Technology Co ltd
Current assignee: Xiamen Ruiwei Information Technology Co ltd
Priority date: 2019-08-09
Filing date: 2019-08-09
Publication date: 2022-11-18
Anticipated expiration: 2039-08-09
Also published as: CN110458095A

Abstract

The invention provides an effective gesture recognition method, a control method, a device and electronic equipment, wherein the recognition method comprises the steps of S11, obtaining a current frame image collected by a camera; s12, performing gesture detection and recognition on the current frame image according to a preset recognition algorithm to obtain a possible region, a gesture category and a confidence coefficient of a gesture in the current frame image; s13, sequentially carrying out gesture detection and recognition on all image frames of the video within a fixed time interval after the current frame to obtain possible regions of gestures in the image, gesture categories and confidence degrees of the gestures; s14, judging whether the proportion of the image frames with the same gesture in the image frames in the fixed time interval is larger than a preset proportion threshold value or not, and if yes, determining that the gesture is an effective gesture. The invention can effectively and quickly detect and identify gestures at the embedded terminal, and can conveniently and quickly carry out human-computer interaction.

Description

Effective gesture recognition method, control method and device and electronic equipment

Technical Field

The invention relates to a real-time gesture detection and judgment method and device based on artificial intelligence deep learning technology and computer vision and electronic equipment.

Background

With the rapid development of computer technology, deep learning is increasingly applied to the field of computer vision. The method for man-machine interaction by using gestures is a very convenient method and has very high application value. A remote non-contact man-machine interaction mode can be provided through a gesture recognition and control technology, so that a fast and accurate gesture recognition algorithm can bring convenient and friendly experience to a user. The difficulty of the application of the current deep neural network on the embedded device is that the network is huge and complex, the computing power of the embedded device is insufficient, the limitations of slow algorithm operation speed, unsmooth system operation, long response time and the like exist, and poor use experience is brought to users. In order to solve the above problems, the present invention mainly provides a method, an apparatus and an electronic device for real-time gesture recognition and control based on a neural network.

Disclosure of Invention

The invention aims to solve the technical problems of providing an effective gesture recognition method, a control method, a recognition device and recognized electronic equipment, which can effectively and quickly detect and recognize gestures in an embedded terminal and carry out convenient and quick human-computer interaction.

According to a first aspect of the present invention, there is provided a method for recognizing a valid gesture, comprising the steps of:

s11, acquiring a current frame image acquired by a camera;

s12, performing gesture detection and recognition on the current frame image according to a preset recognition algorithm to obtain a possible region of a gesture in the current frame image, a gesture type and a confidence coefficient of a recognition result, and judging whether to accept the recognition result according to the confidence coefficient;

s13, sequentially carrying out gesture detection and recognition on all image frames of the video within a fixed time interval after the current frame to obtain possible regions of gestures in the image, gesture categories and confidence degrees of recognition results, and judging whether to accept the recognition results according to the confidence degrees;

and S14, judging whether the proportion of the image frames with the same gesture category in the image frames in the fixed time interval is larger than a preset proportion threshold value, if so, considering the gesture as an effective gesture, if not, taking the next frame of the current frame in the step S13 as the current frame, and returning to the step S13.

Optionally, in step S11, the obtained current frame image is further preprocessed: firstly, the current frame image is normalized, and whether the gesture is detected in the previous frame image is judged according to the gesture detection and recognition result of the previous frame image.

Optionally, the detection and identification in step S12 and step S13 specifically include:

selecting a first neural network model or a second neural network model according to the gesture detection result in the previous frame of image, wherein the first neural network model is a pre-trained convolutional network single detection model and is used for directly predicting the possible area and the category of the gesture on the full image, and the second neural network model is a pre-trained convolutional network single detection model and is used for tracking the gesture according to the detection result of the previous frame;

if the gesture is not detected in the first frame image or the previous frame image, inputting the current frame image into a first neural network model for gesture detection and recognition, outputting coordinates of a possible region of the gesture in the current frame image, possible types of the gesture and a confidence coefficient of a recognition result by the first neural network model, if the confidence coefficient is greater than or equal to a preset confidence coefficient threshold value, receiving the detection and recognition result predicted by the first neural network model, and if the confidence coefficient is smaller than the preset confidence coefficient threshold value, ignoring the current frame image;

if the gesture is detected in the previous frame of image, mapping the position of the gesture in the previous frame of image to the current frame of image, expanding the mapping region on the current frame of image outwards according to a preset multiple, inputting the expanded mapping region to a second neural network model for gesture detection and recognition, outputting the coordinates of the possible region of the gesture in the current image, the possible types of the gesture and the confidence coefficient of the result by the second neural network model, if the confidence coefficient is greater than a preset confidence coefficient threshold value, receiving the prediction result of the second neural network model, and if the confidence coefficient is less than the preset confidence coefficient threshold value, ignoring the confidence coefficient.

Optionally, the training method of the first neural network model is as follows: acquiring a first type of training sample set and labeling information of gestures; performing data preprocessing on the first class training sample set: cutting the first type of training samples in random size and turning the first type of training samples in a mirror image mode according to a preset aspect ratio; converting the labeling information of the gesture according to the cutting and turning conditions, and performing random color enhancement on the cut picture; and training a first neural network model by using the preprocessed first class sample set.

Optionally, the training method of the second neural network model is as follows: acquiring a second type of training sample set and labeling information of gestures; and (3) carrying out data preprocessing on the second class training sample set: taking the position of the gesture frame and the position after random offset as the center, randomly expanding the second type of training sample by 3 to 6 times outwards to perform cutting and mirror image turning, converting the labeling information of the gesture according to the cutting and turning conditions, and performing random color enhancement on the cut picture; training a second neural network model using the preprocessed second class sample set.

According to a second aspect of the present invention, there is provided a control method after recognition of a valid gesture, comprising the steps of:

s21, counting and analyzing the effective gesture recognition results of all detection frames in a fixed time interval before the current frame, and judging whether continuous and stable effective gestures exist in the fixed time interval;

s22, judging whether a continuous stable effective gesture type is changed into another continuous stable effective gesture type in the fixed time interval;

and S23, when the gesture type is found to be changed, executing control operation corresponding to the gesture change.

Wherein, the judgment of whether the gesture generates the category change is as follows: judging all image frames in the fixed time interval, and if the detected gesture in a certain image frame is changed from the stable state of one type to the stable state of another type, determining that the gesture type is changed; wherein the steady state of the classes are: and the proportion of the image frames with the same gesture in all the image frames of the video in the fixed time interval is greater than a preset proportion threshold value.

According to a third aspect of the present invention, there is provided an apparatus for recognizing a valid gesture, comprising:

the image acquisition module is used for acquiring a current frame image acquired by the camera;

the gesture detection and recognition module is used for carrying out gesture detection and recognition on the current frame image according to a preset recognition algorithm to obtain the gesture category of the gesture in the current frame image and the confidence coefficient of a recognition result, and judging whether to accept the recognition result according to the confidence coefficient;

sequentially performing gesture detection and recognition on all image frames of the video within a fixed time interval after the current frame to obtain gesture types of the gestures in the images and confidence degrees of recognition results, and judging whether to accept the recognition results according to the confidence degrees;

and the gesture recognition module is also used for judging whether the proportion of the image frames with the same gesture in the image frames in the time interval is greater than a preset proportion threshold value, if so, the gesture is considered to be a valid gesture, and a judgment result is returned.

Optionally, the method further includes:

the image preprocessing module is used for carrying out normalization processing on the current frame image and judging whether the gesture is detected in the previous frame image or not according to the gesture detection and recognition result of the previous frame image;

the model selection module is used for selecting a first neural network model or a second neural network model according to the gesture detection result in the previous frame of image, the first neural network model is a pre-trained convolutional network single detection model and is used for directly predicting the possible regions and types of gestures on the full graph, and the second neural network model is a pre-trained convolutional network single detection model and is used for tracking the gestures according to the previous frame of detection result;

if the gesture is not detected in the previous frame of image, inputting the current frame of image into a first neural network model for gesture detection and recognition, outputting coordinates of a possible region of the gesture in the current frame of image, possible types of the gesture and a confidence coefficient of a recognition result by the first neural network model, if the confidence coefficient is greater than or equal to a preset confidence coefficient threshold value, receiving the detection and recognition result predicted by the first neural network model, and if the confidence coefficient is less than the preset confidence coefficient threshold value, ignoring the confidence coefficient;

if the gesture is detected in the first detected image or the previous frame of image, mapping the position of the gesture in the previous frame of image to the current frame of image, expanding the mapping region on the current frame of image outwards according to a preset multiple, inputting the expanded mapping region to a second neural network model for gesture detection and recognition, outputting the coordinates of the possible region of the gesture in the current image, the possible types of the gesture and the confidence coefficient of the recognition result by the second neural network model, if the confidence coefficient is greater than a preset confidence coefficient threshold value, receiving the prediction result of the second neural network model, and if the confidence coefficient is less than the preset confidence coefficient threshold value, ignoring the gesture.

According to a fourth aspect of the present invention, there is provided an electronic device for recognizing valid gestures, comprising a processor and a memory, wherein the processor is capable of executing the method for recognizing valid gestures as described above; the memory is used for storing all the obtained detection images, the result of image preprocessing and the result of gesture detection and recognition, and also storing an executable program for gesture response.

The invention has the advantages that:

(1) The gesture detection and recognition are carried out on the basis of the images acquired by the common camera, extra wearing equipment, parameters and excessive image preprocessing are not needed, the cost is saved, the use is more convenient and faster, and the operation speed is favorably improved;

(2) The gesture detection and recognition are carried out by alternately matching two neural network models, the first neural network model can directly predict the possible position, gesture type and confidence coefficient of the gesture in the full image, and the second neural network can track and recognize the possible region of the gesture of the next frame on the basis of the position of the gesture of the previous frame; obviously, the second neural network ensures the accuracy of gesture detection and recognition, has extremely fast speed while only consuming extremely small computing resources, has the running speed on an ARM chip of more than 10FPS and can meet the real-time detection requirement;

(3) In the gesture recognition process, the detection results of multiple frames are used as the detection result of the final gesture, so that the stability of the system can be ensured, the equipment can be accurately controlled through the gesture, and better experience is brought to human-computer interaction.

Drawings

The invention will be further described with reference to the following examples and figures.

FIG. 1 is a flowchart illustrating an effective gesture recognition method according to a preferred embodiment of the present invention.

FIG. 2 is a flowchart illustrating an exemplary method for performing control operations corresponding to gesture changes according to an exemplary embodiment of the present invention.

FIG. 3 is a flowchart illustrating the training process of the neural network model in the effective gesture method according to the present invention.

FIG. 4 is a block diagram of an effective gesture system according to a preferred embodiment of the present invention.

Detailed Description

Referring to fig. 1 to 3, a detailed description is given of the effective gesture recognition method of the present invention, which includes the following steps:

s11, acquiring a current frame image acquired by a camera; and preprocessing the current frame image: firstly, normalizing the current frame image, and judging whether the gesture is detected in the previous frame image according to the gesture detection and recognition result of the previous frame image; however, if the image is the first frame image, only normalization processing is required.

S12, performing gesture detection and recognition on the current frame image according to a preset recognition algorithm to obtain possible regions of gestures in the current frame image, gesture categories and confidence of recognition results; and judging whether to accept the recognition result according to the confidence degree. The confidence coefficient is a measure value of the gesture area and the gesture category, the confidence coefficient is obtained by model output and represents probability values of the gesture area and the gesture category predicted by the model, the higher the confidence coefficient is, the more credible the detected gesture area and the gesture category is, in practice, a fixed threshold value is usually set for the confidence coefficient, and the gesture area and the gesture category higher than the threshold value are considered as a detected effective gesture area and gesture category.

S13, sequentially carrying out gesture detection and recognition on all image frames of the video within a fixed time interval after the current frame to obtain possible regions of gestures in the images, gesture categories and confidence coefficients of the gestures; and judging whether to accept the recognition result according to the confidence degree. There are multiple possible regions of the gesture detected actually, and it is determined which is the gesture region that is finally taken by the confidence.

S14, judging whether the proportion of the image frames with the same gesture in the image frames in the fixed time interval is larger than a preset proportion threshold value or not, and if yes, determining that the gesture is an effective gesture. The same gesture is the same gesture which is predicted to be in the same category, for example, if the gesture detected in the T-th frame image is the 1 st category, and the gesture detected in the T + 1-th frame image is the 1 st category, the gestures of the two frames of images are the same gesture.

The detection and identification in step S12 and step S13 are specifically:

if the gesture is not detected in the first detected image or the previous image, inputting the current image into a first neural network model for gesture detection and recognition, outputting coordinates of a possible region of the gesture in the current image, possible types of the gesture and a confidence coefficient of a result by the first neural network model, if the confidence coefficient is greater than or equal to a preset confidence coefficient threshold value, receiving the detection and recognition result predicted by the first neural network model, and if the confidence coefficient is less than the preset confidence coefficient threshold value, ignoring the result;

if the gesture is detected in the previous frame of image, mapping the position of the gesture in the previous frame of image to the current frame of image, expanding the mapping region on the current frame of image outwards according to a preset multiple, inputting the expanded mapping region to a second neural network model for gesture detection and recognition, outputting the coordinates of the possible region of the gesture in the current image, the possible types of the gesture and the confidence coefficient of the result by the second neural network model, if the confidence coefficient is greater than or equal to a preset confidence coefficient threshold value, receiving the prediction result of the second neural network model, and if the confidence coefficient is less than the preset confidence coefficient threshold value, ignoring the result.

The gesture recognition of the present invention actually involves two tasks, detection and recognition respectively. The gesture detection is to locate the position of the gesture in the whole picture, namely predicting the possible area of the gesture; after the gesture area is located, the type of the gesture is judged, namely the gesture is identified. Detecting the area where the gesture exists is a prerequisite for gesture recognition. The "gesture detection and recognition" is already described in both steps S12 and S13. The possible regions of the gesture are given in the form of four values — (x, y, w, h) representing the vertex coordinates and width and height, respectively, as detailed below:

in the present invention, the training method of the first neural network model is:

(1) Acquiring labeling information of a first type of training sample set and gestures, wherein the labeling information of the gestures comprises two aspects: (a) The method comprises the steps that frame information of all gestures to be recognized in an image comprises a central point x value of a gesture frame, a central point y value of the gesture frame, the width of the gesture frame and the height of the gesture frame; (b) All gestures to be recognized in the image are coded in a category mode, and gesture marking information is manually marked;

(2) Performing data preprocessing on the first class training sample set: cutting and mirror image turning the first type of training samples according to a preset aspect ratio; converting the labeling information of the gesture according to the cutting and turning conditions, and performing random color enhancement on the cut picture;

the cutting area keeps the input aspect ratio of the first neural network model to ensure that the image is not deformed when being input to a network for training, and simultaneously label information can be correspondingly converted, and the random size can ensure that the cut image comprises gestures with different proportions, which is beneficial to ensuring that the neural network model can adapt to gestures with different distances and sizes when gesture detection is carried out;

(3) And training a first neural network model by using the preprocessed first-class sample set. The training method of the second neural network model comprises the following steps:

(1) Acquiring a second type of training sample set and labeling information of the gesture;

(2) And (3) performing data preprocessing on the second type training sample set: taking the position of the gesture frame and the position after random deviation as the center, randomly expanding the second type of training samples outwards by 3 to 6 times for cutting and mirror image turning, converting the labeling information of the gesture according to the cutting and turning conditions, and performing random color enhancement on the cut picture;

if the cutting area exceeds the range of the original image, zero value filling is carried out, the training sample diversity is increased by random multiple cutting, the tracking model is beneficial to adapting to the fluctuation of the size of the gesture frame caused by the detection error of the previous frame, and therefore the stability of the model is improved;

(3) And training a second neural network model by using the preprocessed second class training sample set.

After the effective gesture is recognized, the instruction corresponding to the effective gesture can be executed, and the method comprises the following steps:

The method or apparatus according to the invention described above is illustrated below:

example one

As shown in fig. 1, an embodiment of a gesture recognition method includes the following steps:

11. and acquiring image data of the current frame from the camera, and converting the image data into a three-channel RGB image format.

12. The method comprises the steps of preprocessing an acquired image, firstly, normalizing the image, and generally, normalizing the image by the following formula:

wherein min is x _i Minimum value of (i =1,2.. N), max being x _i A maximum value of (i =1,2.. N).

And analyzing the gesture detection and recognition results of the previous frame of image, judging whether an effective gesture is detected and carrying out corresponding processing. And if the effective gesture is not detected in the previous frame of image, scaling the image size after the normalization processing to the input size of the first neural network model. If an effective gesture is detected and recognized in the previous frame of image, mapping the position of the gesture in the previous frame to the image after normalization processing, taking the position as the center, outwards expanding the gesture frame to be k times of the width-height mean value of the original gesture frame, wherein k is a numerical value with a preset size, filling the gesture frame with zero values if the gesture frame exceeds the range of the original image in the expansion process, cutting the expanded area, and scaling the size of the image to the input size of the second neural network model.

13. And inputting the preprocessed image into a corresponding neural network model for gesture detection and recognition. And if the effective gesture is not detected in the previous frame of image, inputting the current preprocessed image into the first neural network model for gesture detection and recognition. And if the effective gesture is recognized in the last frame of image, inputting the current preprocessed image into a second neural network model for gesture tracking and recognition.

14. And outputting a gesture recognition result in the current image. Outputting a prediction result of whether the effective gesture exists in the current image and a possible region of the gesture by the model, wherein the output result is a one-dimensional vector with the length of 6, and the length of the one-dimensional vector is respectively represented as follows: the x value of the center point of the gesture box, the y value of the center point of the gesture box, the width of the gesture box, the height of the gesture box, the type of the gesture and the confidence coefficient of the prediction result.

According to the embodiment, the gestures existing in the image are detected and recognized through two neural network models, and possible existence areas, gesture categories and prediction confidence degrees of the gestures in the image are output. The first neural network model is responsible for detecting and recognizing gestures in the whole image range, and the second neural network model is responsible for tracking and recognizing gestures around the gesture area of the previous frame, so that the stability and reliability of gesture recognition can be ensured, the gesture recognition speed is greatly improved by using the second neural network model, and real-time detection can be realized on embedded equipment.

Example two

As shown in fig. 2, an embodiment of a gesture control method includes the following steps:

and 21, counting and analyzing the gesture recognition results of all the detection frames in a fixed time interval before the current frame, and judging whether continuous and stable effective gestures exist in the fixed time interval.

A continuously stable active gesture is defined as: in the specified number of continuous frames, the proportion of the frames with the detected effective gestures is larger than a specified threshold, the fluctuation range of the gesture area is smaller, and the gesture category is not changed. The number of consecutive frames and the proportional threshold are specified by those skilled in the art according to the model performance and the product reality, and the fluctuation of the gesture area is measured by the relative position of the effective gesture area detected in two adjacent frames.

And 22, counting whether the gesture type continuously and stably changes from one type to another type within the fixed time interval, and performing corresponding control operation according to the change of the gesture.

If the statistical result is yes, the gesture change detected in the picture is effective, and corresponding control operation is executed according to the change of the gesture category;

if the statistical result is negative, the gesture change detected in the picture is invalid, at this moment, the control operation is not executed, and the gesture detection and recognition of the next frame are continued.

According to the embodiment, the intelligent device is correspondingly controlled through the result of gesture detection and recognition on the continuous frames. It should be noted that the gesture category recognizable by the model can be flexibly defined by those skilled in the art according to the actual requirement, and is not a condition for limiting the present invention. The effective gesture category change can be flexibly defined by those skilled in the art according to the actual requirement, and is not a condition for limiting the present invention.

EXAMPLE III

As shown in fig. 3, an embodiment of a neural network model training process in an effective gesture recognition method is provided, which includes:

and 31, acquiring a training image containing the required gesture and gesture labeling information. The training images are all images containing gesture categories to be recognized, and the training images are not independently used as the training images under the condition of no gesture. The gesture category to be recognized can be flexibly designated by a person skilled in the art according to actual requirements, and is not limited to a certain category or several categories. The gesture labeling information comprises two aspects: (1) Frame information of all gestures to be recognized in the image, wherein the frame information comprises a central point x value of the gesture frame, a central point y value of the gesture frame, a width of the gesture frame and a height of the gesture frame; and (2) encoding all gesture classes to be recognized in the image. The gesture labeling information is manually labeled.

32, preprocessing training samples and labeling information:

32-1, in order to obtain a training sample which can be used for training the first neural network model, cutting the training image in random size, wherein the cutting area keeps the input aspect ratio of the first neural network model so as to ensure that the image is not deformed when being input into the network for training, and simultaneously carrying out corresponding conversion on the labeling information. The random size can enable the cut image to contain gestures with different proportions, which is beneficial to enabling the neural network model to adapt to gestures with different distances and sizes when gesture detection is carried out.

32-2, in order to improve the robustness of the neural network model and ensure that the neural network model can correctly identify the left hand and the right hand, the image obtained in the step 32-1 is subjected to random mirror image overturning, and meanwhile, the labeling information is subjected to corresponding conversion.

32-3, in order to improve the robustness of the neural network model and enable the neural network model to adapt to color differences caused by different illumination, different scenes and different cameras, carrying out random color enhancement, brightness enhancement, contrast enhancement and the like on the image obtained in the step 32-2. This step includes, but is not limited to, the three enhancements described above.

32-4, in order to obtain a training sample which can be used for training the second neural network model, cutting the training image in the step 3-1 in the following manner: taking the center point of a gesture frame in the image as a reference, and randomly increasing a certain amount of offset in the x direction and the y direction at the point, wherein the offset does not exceed the gesture frame; taking the point after the deviation as a central point, taking 3-6 times of the maximum width and height of the gesture frame as the side length of the cutting frame to perform square cutting, wherein the multiple of the side length is obtained by taking a random floating point number between 3 and 6; if the cutting area exceeds the range of the original image, the cutting area is filled with zero values. And simultaneously, correspondingly converting the labeling information. The random multiple cutting increases the diversity of training samples, and is beneficial to enabling the tracking model to adapt to the fluctuation of the size of the gesture frame caused by the detection error of the previous frame, thereby improving the stability of the model.

32-5, in order to improve the robustness of the second neural network model and enable the second neural network model to correctly track and identify the left hand and the right hand, the image obtained in the step 32-4 is subjected to random mirror image overturning, and meanwhile, the labeling information is subjected to corresponding conversion.

32-6, in order to improve the robustness of the second neural network model and enable the second neural network model to adapt to color differences caused by different illumination, different scenes and different cameras, carrying out random color enhancement, brightness enhancement, contrast enhancement and the like on the image obtained in the step 3-2-5. This step includes, but is not limited to, the three enhancements described above.

And 33, training a neural network model:

33-1, training a first neural network model using the first neural network training samples preprocessed in step 32. Adjusting the width and height of a training sample into the input size of a first neural network model, inputting the training sample into a network for forward propagation, and calculating loss according to the output result of the model, wherein the loss value consists of three parts: a loss of gesture box position, a loss of gesture box confidence, a loss of gesture category. The gesture box position loss is calculated by adopting a mean square error loss function, and the gesture box confidence loss and the gesture category loss are calculated by adopting a cross entropy loss function. And optimizing parameters in the network model by using a gradient descent method and a back propagation algorithm according to the calculated loss value. And repeating the steps for a plurality of times, judging whether the model is converged, if so, stopping the training process to obtain a trained first neural network model, otherwise, continuing the training until the model is converged. Since non-gestures are not considered as a separate class, a specific strategy needs to be adopted to distinguish between positive and negative samples during the training process: when the IoU of a certain bounding box and group channel is larger than that of all other bounding boxes, the target is given 1. If a bounding box is not the one with the largest IoU, but the IoU is also greater than 0.5, then we ignore it (neither penalize nor reward). We assign only one best bounding box to each group route. If a bounding box does not correspond to any group route, it does not contribute to the regression of the frame position size and the prediction of class, and only penalizes its confidence.

33-2, training the second neural network model by using the preprocessed second neural network training samples in the step 32. Adjusting the width and height of the training sample to be the input size of a second neural network model, inputting the training sample into a network for forward propagation, and calculating loss according to the output result of the model, wherein the loss value comprises three parts: a loss of gesture box position, a loss of gesture box confidence, a loss of gesture category. The gesture box position loss is calculated by adopting a mean square error loss function, and the gesture box confidence loss and the gesture category loss are calculated by adopting a cross entropy loss function. And optimizing parameters in the network model by using a gradient descent method and a back propagation algorithm according to the calculated loss value. And after repeating the steps for a plurality of times, judging whether the model is converged, if so, stopping the training process to obtain a trained second neural network model, otherwise, continuing to train until the model is converged. Since non-gestures are not considered as a separate class, a specific strategy needs to be adopted to distinguish between positive and negative samples during the training process: when the IoU of a certain bounding box and group route is larger than that of all other bounding boxes, the target is given 1. If a bounding box is not the one with the largest IoU, but the IoU is also greater than 0.5, then we ignore it (neither penalize nor reward). We assign only one best bounding box to each group channel. If a bounding box does not correspond to any one group route, it does not contribute to the regression of the size of the bezel position and the prediction of class, and only penalizes its confidence.

In practical applications, the first neural network model and the second neural network model may both adopt the variant structure of the YOLOV3 model, and the YOLOV3 variant structures of the first neural network model and the second neural network model are summarized below respectively.

The input size of the first neural network is 576 width and 320 height, the convolution kernel with the step length of 1, the size of 3 x 3 and 1 x 1 is used for extracting features, the feature map is subjected to down-sampling by using the maximum pooling layer, the bilinear interpolation method is used as an up-sampling layer, the feature maps with different depths are spliced by using the routing layer, and the 19 th layer and the 25 th layer of the model are respectively used as two output layers and used for predicting a gesture frame on the scales with two sizes, so that gestures with different distances can be detected more accurately.

The input size of the second neural network is width 208 and height 208, the convolution kernel with the step length of 1, the size of 3 x 3 and 1 x 1 is used for extracting features, the feature map is subjected to down-sampling by using the maximum pooling layer, the bilinear interpolation method is used as an up-sampling layer, the feature maps with different depths are spliced by using the routing layer, and the 14 th layer and the 21 st layer of the model are respectively used as two output layers and used for predicting the gesture frame on the scales with two sizes, so that the model can be more stably represented when the gesture frame based on the previous frame is tracked.

Example four

As shown mainly in fig. 4, an embodiment of a recognition device for valid gestures belongs to a virtual device of software, and includes: the device comprises a picture acquisition module, an image preprocessing module, a gesture detection and identification module, a model selection module and a gesture response module.

The image acquisition module is used for acquiring the current frame image acquired by the camera.

The image preprocessing module is used for carrying out normalization processing on the current frame image and judging whether the gesture is detected in the previous frame image or not according to the gesture detection and recognition result of the previous frame image; however, if the image is the first frame image, only normalization processing is required.

The gesture detection and recognition module is used for carrying out gesture detection and recognition on the current frame image according to a preset recognition algorithm to obtain a possible region, a gesture category and a confidence coefficient of a gesture in the current frame image; sequentially carrying out gesture detection and recognition on all image frames of the video within a fixed time interval after the current frame to obtain possible regions of gestures in the image, gesture categories and confidence degrees of the gestures; and the gesture recognition module is also used for judging whether the proportion of the image frames with the same gesture in the image frames in the time interval is greater than a preset proportion threshold value or not, if so, the gesture is considered to be an effective gesture, and a judgment result is returned.

The model selection module is used for selecting a first neural network model or a second neural network model according to a gesture detection result in a previous frame of image, the first neural network model is a pre-trained convolution network single detection model and is used for directly predicting possible regions and types of gestures on a full graph, the second neural network model is a pre-trained convolution network single detection model, and the gestures are tracked according to the previous frame of detection result;

if the gesture is not detected in the first frame of image or the last frame of image, inputting the current frame of image into a first neural network model for gesture detection and recognition, outputting coordinates of a possible region of the gesture, possible types of the gesture and a confidence coefficient of a result in the current frame of image by the first neural network model, if the confidence coefficient is greater than a preset confidence coefficient threshold value, receiving the detection and recognition result predicted by the first neural network model, and if the confidence coefficient is less than the preset confidence coefficient threshold value, ignoring the result;

if the gesture is detected in the previous frame of image, mapping the position of the gesture in the previous frame of image into the current frame of image, expanding the mapping region on the current frame of image outwards according to a preset multiple, inputting the expanded mapping region into a second neural network model for gesture detection and recognition, outputting the coordinates of the possible region of the gesture in the current image, the possible types of the gesture and the confidence coefficient of the result by the second neural network model, if the confidence coefficient is greater than a preset confidence coefficient threshold value, receiving the prediction result of the second neural network model, and if the confidence coefficient is less than the preset confidence coefficient threshold value, ignoring the result;

the gesture response module is configured to determine a change of a gesture category within a preset time period, and execute a preset control operation, where the preset time period is a time period obtained by determining the change of the gesture category.

The judgment process of whether the gesture response module generates the category change is as follows: judging all image frames in a preset time interval, and if the detected gesture in a certain image frame is changed from the stable state of one type to the stable state of another type, determining that the gesture type is changed; wherein the steady state of the classes are: the proportion of the image frames with the same gesture in all the image frames of the video in the preset time interval is larger than a preset proportion threshold value.

EXAMPLE five

As shown generally in fig. 1, an embodiment of an electronic device for recognizing valid gestures includes: a processor and a memory, wherein the processor is capable of executing the above-mentioned method for recognizing valid gestures (the specific process is as described above and is not repeated here); the memory is used for storing all the obtained detection images, the result of image preprocessing and the result of gesture detection and recognition, and storing an executable program for gesture response.

While specific embodiments of the invention have been described, it will be understood by those skilled in the art that the specific embodiments described are illustrative only and are not limiting upon the scope of the invention, as equivalent modifications and variations as will be made by those skilled in the art in light of the spirit of the invention are intended to be included within the scope of the appended claims.

Claims

1. A method for recognizing valid gestures is characterized in that: the method comprises the following steps:

s11, acquiring a current frame image acquired by a camera, and preprocessing the acquired current frame image: firstly, normalizing a current frame image, and judging whether a gesture is detected in a previous frame image according to a gesture detection and recognition result of the previous frame image;

s12, performing gesture detection and recognition on the current frame image according to a preset recognition algorithm to obtain a gesture category of a gesture in the current frame image and a confidence coefficient of a recognition result, and judging whether to accept the recognition result according to the confidence coefficient;

s13, sequentially carrying out gesture detection and recognition on all image frames of the video within a fixed time interval after the current frame to obtain the gesture category of the gesture in the image and the confidence coefficient of the recognition result, and judging whether to accept the recognition result according to the confidence coefficient;

s14, judging whether the proportion of the image frames with the same gesture category in the image frames in the fixed time interval is larger than a preset proportion threshold value or not, if so, considering the gesture as an effective gesture, if not, identifying the next image frame, and returning to the step S13;

the detection and identification in the step 12 and the step 13 are specifically:

selecting a first neural network model or a second neural network model according to a gesture detection result in a previous frame of image, wherein the first neural network model is a pre-trained convolution network single detection model and is used for directly predicting possible regions and types of gestures on a full graph, and the second neural network model is a pre-trained convolution network single detection model and is used for tracking the gestures according to the gesture regions detected in the previous frame;

if the gesture is not detected in the first frame image or the previous frame image, inputting the current frame image into a first neural network model for gesture detection and recognition, outputting coordinates of a possible region of the gesture, possible types of the gesture and a confidence coefficient of a recognition result in the current frame image by the first neural network model, if the confidence coefficient is greater than or equal to a preset confidence coefficient threshold value, receiving the detection and recognition result predicted by the first neural network model, and if the confidence coefficient is less than the preset confidence coefficient threshold value, ignoring the current frame image;

if the gesture is detected in the previous frame of image, mapping the position of the gesture in the previous frame of image to the current frame of image, expanding the mapping region on the current frame of image outwards according to a preset multiple, inputting the expanded mapping region to a second neural network model for gesture detection and recognition, outputting the coordinates of the possible region of the gesture in the current image, the possible types of the gesture and the confidence coefficient of the recognition result by the second neural network model, if the confidence coefficient is greater than a preset confidence coefficient threshold value, receiving the prediction result of the second neural network model, and if the confidence coefficient is less than the preset confidence coefficient threshold value, ignoring the prediction result.

2. A method of active gesture recognition according to claim 1, wherein: the training method of the first neural network model comprises the following steps:

acquiring a first type of training sample set and labeling information of gestures;

performing data preprocessing on the first class training sample set: cutting and mirror image turning the first type of training samples according to a preset aspect ratio;

converting the labeling information of the gesture according to the cutting and turning conditions, and performing random color enhancement on the cut picture;

and training a first neural network model by using the preprocessed first-class sample set.

3. A method of recognizing valid gestures according to claim 1, characterized by: the training method of the second neural network model comprises the following steps:

acquiring a second type of training sample set and labeling information of gestures;

and (3) performing data preprocessing on the second type training sample set: taking the position of the gesture frame and the position after random offset as the center, randomly expanding the second type of training sample by 3 to 6 times outwards to perform cutting and mirror image turning, converting the labeling information of the gesture according to the cutting and turning conditions, and performing random color enhancement on the cut picture;

training a second neural network model using the preprocessed second class sample set.

4. A control method after recognition of an effective gesture is characterized by comprising the following steps: after the effective gesture is recognized by the effective gesture recognition method according to claim 1, the following steps are performed:

5. The method of claim 4, wherein the method further comprises the steps of: the judgment of whether the gesture generates the category change is as follows:

judging all image frames in the fixed time interval, and if the detected gesture in a certain image frame is changed from a stable state of one type to a stable state of another type, determining that the gesture type is changed;

wherein the steady state of the classes are: and the proportion of the image frames with the same gesture in all the image frames of the video in the fixed time interval is greater than a preset proportion threshold value.

6. An apparatus for recognizing a valid gesture, comprising: the method comprises the following steps:

sequentially carrying out gesture detection and recognition on all image frames of the video within a fixed time interval after the current frame to obtain the gesture category of the gesture in the image and the confidence of the recognition result, and judging whether to accept the recognition result according to the confidence;

the gesture recognition module is also used for judging whether the proportion of the image frames with the same gesture in the image frames in the time interval is larger than a preset proportion threshold value or not, if so, the gesture is considered to be an effective gesture, and a judgment result is returned;

if the gesture is not detected in the previous frame of image, inputting the current frame of image into a first neural network model for gesture detection and recognition, outputting coordinates of a possible region of the gesture, possible types of the gesture and a confidence coefficient of a recognition result in the current frame of image by the first neural network model, if the confidence coefficient is greater than or equal to a preset confidence coefficient threshold value, receiving the detection and recognition result predicted by the first neural network model, and if the confidence coefficient is less than the preset confidence coefficient threshold value, ignoring the result;

if the gesture is detected in the first frame of image or the last frame of image, mapping the position of the gesture in the last frame of image into the current frame of image, expanding the mapping region on the current frame of image outwards according to a preset multiple, inputting the expanded mapping region into a second neural network model for gesture detection and recognition, outputting the coordinates of the possible region of the gesture in the current image, the possible category of the gesture and the confidence coefficient of the recognition result by the second neural network model, if the confidence coefficient is greater than a preset confidence coefficient threshold value, receiving the prediction result of the second neural network model, and if the confidence coefficient is less than the preset confidence coefficient threshold value, ignoring the gesture.

7. An electronic device for recognition of valid gestures, characterized by: a processor and a memory, the processor being operable to perform a method of recognition of a valid gesture according to any one of claims 1 to 3; the memory is used for storing all the obtained detection images, the result of image preprocessing and the result of gesture detection and recognition, and also storing an executable program for gesture response.