CN107368182B

CN107368182B - Gesture detection network training, gesture detection and gesture control method and device

Info

Publication number: CN107368182B
Application number: CN201610696340.1A
Authority: CN
Inventors: 李全全; 闫俊杰; 钱晨
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2016-08-19
Filing date: 2016-08-19
Publication date: 2020-02-18
Anticipated expiration: 2036-08-19
Also published as: CN107368182A

Abstract

The embodiment of the application discloses a method and a device for gesture detection network training, gesture detection and gesture control, and relates to the technical field of image processing, wherein the gesture detection network training comprises the following steps: training a first convolutional neural network according to a sample image containing human hand labeling information to obtain prediction information of the first convolutional neural network for a human hand candidate region of the sample image; replacing a second feature extraction layer parameter of a second convolutional neural network for detecting a gesture with a first feature extraction layer parameter of the trained first convolutional neural network; and training the second convolutional neural network parameters according to the prediction information of the human hand candidate region and the sample image, and keeping the second feature extraction layer parameters unchanged in the training process. By applying the scheme provided by the embodiment of the application, the requirement on the user in the man-machine interaction is reduced, and the user experience is improved.

Description

Gesture detection network training, gesture detection and gesture control method and device

Technical Field

The application relates to the technical field of image processing, in particular to a method and a device for gesture detection network training, gesture detection and gesture control.

Background

With the rapid development of electronic technology, more and more application scenes are involved in human-computer interaction technology. In the human-computer interaction technology, firstly, human gestures are detected, and then, further human-computer interaction operation can be carried out according to the detected human gestures.

In the prior art, gesture detection is mainly performed based on a sensor, so that the gesture detection can be realized only by wearing or holding related equipment by a user. However, no matter the user wears the related device or holds the related device, the user needs to master the related operation technical knowledge of the device, the requirement on the user is high, and the user experience is poor.

Disclosure of Invention

The embodiment of the application discloses a gesture detection and control scheme.

The embodiment of the application discloses a gesture detection network training method, which comprises the following steps:

training a first convolutional neural network according to a sample image containing human hand labeling information to obtain prediction information of the first convolutional neural network for a human hand candidate region of the sample image;

replacing a second feature extraction layer parameter of a second convolutional neural network for detecting a gesture with a first feature extraction layer parameter of the trained first convolutional neural network;

and training the second convolutional neural network parameters according to the prediction information of the human hand candidate region and the sample image, and keeping the second feature extraction layer parameters unchanged in the training process.

Optionally, training the second convolutional neural network parameter according to the prediction result of the human hand candidate region and the sample image, including:

correcting the prediction information of the human hand candidate area;

and training the second convolutional neural network parameters according to the corrected prediction information of the human hand candidate region and the sample image.

Optionally, the hand labeling information includes labeling information of a hand region.

Optionally, the human hand labeling information includes labeling information of a gesture.

Optionally, the first convolutional neural network comprises: the image processing device comprises a first input layer, a first feature extraction layer and a first classification output layer, wherein the first classification output layer is used for predicting whether a plurality of candidate regions of the sample image are human hand candidate regions.

Optionally, the second convolutional neural network comprises: the second input layer, the second feature extraction layer and the second classification output layer are used for outputting the gesture detection result of the sample image.

Optionally, the gesture detection result includes at least one of the following predetermined gesture types: waving hand, scissor hand, fist, holding hand, thumb, pistol, OK hand, peach heart hand, opening and closing.

Optionally, the gesture detection result further includes: a non-predetermined gesture type.

Optionally, the correcting the prediction information of the human hand candidate region includes:

inputting the supplementary negative sample images and the prediction information of the human hand candidate region into a third convolutional neural network for classification so as to filter the negative samples in the human hand candidate region and obtain the corrected prediction information of the human hand candidate region.

Optionally, the difference between the number of human hand candidate regions in the prediction information of the human hand candidate region and the number of the supplementary negative sample images falls within a predetermined allowable range.

Optionally, the difference between the number of human hand candidate regions in the prediction information of the human hand candidate region and the number of the supplementary negative sample images is equal.

Optionally, the first convolutional neural network is an RPN, and/or the second convolutional neural network is an FRCNN.

Optionally, the third convolutional neural network is FRCNN.

In order to achieve the above object, an embodiment of the present application discloses a gesture detection method, including:

detecting an image by adopting a fourth convolutional neural network, and obtaining first characteristic information of the image and prediction information of a human hand candidate region, wherein the image comprises a static image or an image in a video;

taking the first characteristic information and the prediction information of the hand candidate area as second characteristic information of a fifth convolutional neural network, and performing gesture detection on the image by adopting the fifth convolutional neural network according to the second characteristic information to obtain a gesture detection result of the image; wherein the fourth feature extraction layer parameters of the fourth convolutional neural network are the same as the fifth feature extraction layer parameters of the fifth convolutional neural network.

Optionally, the fourth convolutional neural network comprises: the image segmentation device comprises a fourth input layer, a fourth feature extraction layer and a fourth classification output layer, wherein the fourth classification output layer is used for detecting whether a plurality of candidate regions divided by the image are human hand candidate regions.

Optionally, the fifth convolutional neural network comprises: the image gesture detection device comprises a fifth input layer, a fifth feature extraction layer and a fifth classification output layer, wherein the fifth classification output layer is used for outputting gesture detection results of the image.

In order to achieve the above object, an embodiment of the present application discloses a gesture control method, including:

adopting the gesture detection network training method to train to obtain a gesture detection network detection image, or adopting the gesture detection method to detect an image to obtain a gesture detection result of the image, wherein the image comprises a static image or an image in a video;

and triggering corresponding control operation at least according to the gesture detection result of the image.

Optionally, triggering the corresponding control operation according to at least the gesture detection result of the image includes:

recording the times of obtaining the same gesture detection result by continuously detecting the images in the video within a time period;

and triggering corresponding control operation according to the gesture detection result when the recorded times meet a preset condition.

Optionally, triggering a corresponding control operation according to the gesture detection result includes:

determining a control instruction corresponding to the gesture detection result;

and triggering corresponding operation according to the control instruction.

In order to achieve the above object, an embodiment of the present application discloses a gesture detection network training device, including:

the system comprises a first training module, a second training module and a third training module, wherein the first training module is used for training a first convolutional neural network according to a sample image containing human hand labeling information to obtain prediction information of the first convolutional neural network aiming at a human hand candidate region of the sample image;

the parameter replacement module is used for replacing a second feature extraction layer parameter of a second convolutional neural network for detecting the gesture with a first feature extraction layer parameter of the trained first convolutional neural network;

and the second training module is used for training the second convolutional neural network parameters according to the prediction information of the human hand candidate region and the sample image, and keeping the second feature extraction layer parameters unchanged in the training process.

Optionally, the second training module comprises:

the correction submodule is used for correcting the prediction information of the human hand candidate area;

and the training sub-module is used for training the second convolutional neural network parameters according to the corrected prediction information of the human hand candidate region and the sample image, and keeping the second feature extraction layer parameters unchanged in the training process.

Optionally, the modification sub-module is specifically configured to input the multiple supplementary negative sample images and the prediction information of the human hand candidate region into a third convolutional neural network for classification, so as to filter the negative samples in the human hand candidate region, and obtain the modified prediction information of the human hand candidate region.

Optionally, the third convolutional neural network is FRCNN.

In order to achieve the above object, an embodiment of the present application discloses a gesture detection device, including:

the first obtaining module is used for detecting an image by adopting a fourth convolutional neural network, and obtaining first characteristic information of the image and prediction information of a human hand candidate area, wherein the image comprises a static image or an image in a video;

the detection module is used for taking the first characteristic information and the prediction information of the human hand candidate area as second characteristic information of a fifth convolutional neural network, and performing gesture detection on the image by adopting the fifth convolutional neural network according to the second characteristic information to obtain a gesture detection result of the image; wherein the fourth feature extraction layer parameters of the fourth convolutional neural network are the same as the fifth feature extraction layer parameters of the fifth convolutional neural network.

In order to achieve the above object, an embodiment of the present application discloses a gesture control apparatus, including:

a second obtaining module, configured to obtain a gesture detection network detection image obtained by training with the gesture detection network training apparatus, or obtain a gesture detection result of the image by using the gesture detection apparatus to detect the image, where the image includes a still image or an image in a video;

and the triggering module is used for triggering corresponding control operation at least according to the gesture detection result of the image.

Optionally, the triggering module includes:

the recording submodule is used for recording the times of obtaining the same gesture detection result by continuously detecting the images in the video within a time period;

and the triggering submodule is used for triggering corresponding control operation according to the gesture detection result when the recorded times meet the preset condition.

Optionally, the trigger submodule includes:

the determining unit is used for determining a control instruction corresponding to the gesture detection result when the recorded times meet a preset condition;

and the triggering unit is used for triggering corresponding operation according to the control instruction.

In order to achieve the above object, an embodiment of the present application discloses an application program, where the application program is configured to execute the above gesture detection network training method, or the above gesture detection method, or the above gesture control method when running.

In order to achieve the above object, an embodiment of the present application discloses an electronic device, including: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; the power supply circuit is used for supplying power to each circuit or device of the terminal; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to execute the above-mentioned gesture detection network training method, or the above-mentioned gesture detection method, or the above-mentioned gesture control method.

As can be seen from the above, in the embodiment of the application, the first convolutional neural network is trained according to the sample image containing the human hand labeling information, so as to obtain the prediction information of the first convolutional neural network for the human hand candidate region of the sample image; replacing the second feature extraction layer parameters of the second convolutional neural network for detecting the gesture with the first feature extraction layer parameters of the trained first convolutional neural network; and training a second convolutional neural network parameter according to the prediction information of the human hand candidate region and the sample image, and keeping the second feature extraction layer parameter unchanged in the training process. Therefore, the gesture detection network can be obtained through training by applying the scheme provided by the embodiment of the application, the gesture detection can be carried out through the convolutional neural network obtained through the training, a user does not need to wear or hold any equipment, and the user does not need to master the operation technical knowledge of related equipment, so that the requirement on the user in man-machine interaction is reduced, and the user experience is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a gesture detection network training method according to an embodiment of the present disclosure;

FIG. 2a is a schematic diagram of a gesture according to an embodiment of the present disclosure;

FIG. 2b is a schematic diagram of another gesture provided in the embodiments of the present application;

fig. 3 is a schematic structural diagram of a gesture detection network training system according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of another gesture detection network training method according to an embodiment of the present disclosure;

fig. 5 is a schematic flowchart of a gesture detection method according to an embodiment of the present disclosure;

fig. 6 is a schematic flowchart of a gesture control method according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a gesture detection network training apparatus according to an embodiment of the present disclosure;

FIG. 8 is a schematic structural diagram of another gesture detection network training apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a gesture detection apparatus according to an embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of a gesture control apparatus according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic flowchart of a gesture detection network training method provided in an embodiment of the present application, where the method includes:

s101: and training the first convolutional neural network according to the sample image containing the hand labeling information to obtain the prediction information of the first convolutional neural network for the hand candidate region of the sample image.

The sample image may be an image in RGB format, or may be an image in other format, for example, YUV, etc., which is not limited in this application.

As can be understood by those skilled in the art, the larger the resolution of the image is, the larger the data size of the image is, and the more computing resources are required and the slower the detection speed is when performing gesture detection. For example, the preset resolution condition may be: the longest edge of the image is no more than 640 pixel points, the shortest edge is no more than 480 pixel points and the like.

In addition, the sample image may be obtained by an image capturing device, however, in practical applications, due to different hardware parameters, different settings, and the like of the image capturing device, the captured image may not meet the preset resolution condition, and in order to obtain the target image meeting the preset resolution condition, in an optional implementation manner of the present application, the captured image may be scaled after the image capturing device captures the image, so as to obtain the sample image.

Specifically, the human hand labeling information may include labeling information of a human hand region.

Optionally, the hand labeling information may further include labeling information of a gesture.

The present application is described only by way of example, and the actual application is not limited to the specific presentation form of the human hand marking information.

In one implementation manner of the present application, the first convolutional neural network may include: the image processing device comprises a first input layer, a first feature extraction layer and a first classification output layer, wherein the first classification output layer is used for predicting whether a plurality of candidate regions of a sample image are human hand candidate regions.

Each layer included in the first convolutional neural network is only functionally divided, and specifically, the first feature extraction layer may be composed of a convolutional layer, or may be composed of a convolutional layer and a nonlinear conversion layer, or may be composed of a convolutional layer, a nonlinear conversion layer, and a pooling layer; the output result of the first classification output layer may be understood as a result of two classifications, which may be implemented by a convolutional layer, but is not limited to being implemented by a convolutional layer.

When the first convolutional neural network is trained, a first input layer parameter, a first feature extraction layer parameter and a first classification output layer parameter are obtained through training, and then the first convolutional neural network is constructed according to the obtained parameters.

Specifically, the training of the first convolutional neural network by using the sample image can be understood as follows: and training the initial model of the first convolution neural network by adopting the sample image to obtain the final first convolution neural network. When the initial model of the first convolutional neural network is trained by using the sample image, a gradient descent method and a back propagation algorithm can be used for training.

The initial model of the first convolutional neural network may be determined according to the number of convolutional layers, the number of neurons in each convolutional layer, and the like, which are manually set, and the number of convolutional layers, the number of neurons, and the like may be determined according to actual requirements.

When the sample image is calibrated, in order to ensure the calibration result to be accurate, a manual calibration mode can be adopted. In addition, the area where the human hand is located in the sample image may be a minimum rectangular area that can cover the whole hand in the image, and the human hand gesture may be an opening gesture, a closing gesture, or the like. Specifically, referring to fig. 2a and fig. 2b, schematic diagrams of two gestures are provided, in which an area where a gray rectangular box is located is an area where a human hand is located, fig. 2a is an opening gesture, and fig. 2b is a closing gesture.

In addition, in order to make the trained first convolutional neural network more accurate, sample images under various conditions can be selected when selecting the sample images, and the sample images can include: a positive sample image containing the human hand and a negative sample image not containing the human hand.

S102: and replacing the second feature extraction layer parameters of the second convolutional neural network for detecting the gesture with the first feature extraction layer parameters of the trained first convolutional neural network.

In one implementation of the present application, the second convolutional neural network may include: the gesture detection device comprises a second input layer, a second feature extraction layer and a second classification output layer, wherein the second classification output layer is used for outputting a gesture detection result of a sample image.

It should be noted that the second feature extraction layer is similar to the first feature extraction layer, and is not described herein again. The output result of the second classification output layer may be understood as a result of multi-classification, which may be implemented by a fully-connected layer, but is not limited to being implemented by a fully-connected layer.

It is worth mentioning that, in this step, the first feature extraction layer parameter of the trained first convolutional neural network is directly used as the second feature extraction parameter, so that the training of the second feature extraction layer of the second neural network is omitted, that is, the first convolutional neural network and the second convolutional neural network are jointly trained, and both share the feature extraction layer.

Specifically, the gesture detection result may include at least one of the following predetermined gesture types: waving hand, scissor hand, fist, holding hand, thumb, pistol, OK hand, peach heart hand, opening and closing.

In addition, the gesture detection result may further include: a non-predetermined gesture type, wherein the non-predetermined gesture type may be understood as: gesture types other than the predetermined gesture type described above or a situation representing "no gesture" may further improve the gesture classification accuracy of the second convolutional neural network.

It should be noted that, the present application is only described by taking the gesture types as examples, and actually, the predetermined gesture types are not limited to the above.

In an implementation manner of the present application, the first convolutional neural network is rpn (region pro losal network), and/or the second convolutional neural network is frcnn (fast rcnn).

In addition, the first convolutional neural Network may also be another two-class or more CNNs (convolutional neural networks), and may also be a Multi-Box Network or YOLO, etc.; the second convolutional Neural Network may also be other multi-class CNNs, and may also be a current Neural Network, and the like, which is not limited in this application.

S103: and training a second convolutional neural network parameter according to the prediction information of the human hand candidate region and the sample image, and keeping the second feature extraction layer parameter unchanged in the training process.

According to the prediction information of the hand candidate area and the sample image, when the second convolutional neural network is trained, the hand gesture in the sample image can be calibrated again, namely the hand gesture is calibrated to be in an open state, a closed state and the like, and the initial model of the second convolutional neural network is trained on the basis of the current calibration result and the prediction information to obtain the final second convolutional neural network.

Specifically, referring to fig. 3, a schematic structural diagram of a gesture detection network training system is provided, when the system is applied to gesture detection network training, a sample image is used as input information of a first convolution neural network and is input to a first input layer, after feature extraction is performed by a first feature extraction layer, a first classification output layer performs classification, prediction information of a human hand candidate region of the sample image is obtained, and a first convolution neural network parameter is output; and then the sample image and the prediction information are used as input information of the first convolutional neural network and input to a second input layer, after feature extraction is carried out by a second feature extraction layer, a second classification output layer is used for classifying and obtaining parameters of the second convolutional neural network, and thus the training of the first convolutional neural network and the second convolutional neural network is completed. Wherein the first feature extraction layer parameter is the same as the second feature extraction layer parameter.

As can be seen from the above, in this embodiment, the first convolutional neural network is trained according to the sample image containing the human hand labeling information, so as to obtain the prediction information of the human hand candidate region of the first convolutional neural network for the sample image; replacing the second feature extraction layer parameters of the second convolutional neural network for detecting the gesture with the first feature extraction layer parameters of the trained first convolutional neural network; and training a second convolutional neural network parameter according to the prediction information of the human hand candidate region and the sample image, and keeping the second feature extraction layer parameter unchanged in the training process. Therefore, the scheme provided by the embodiment can be used for training to obtain the gesture detection network, the convolutional neural network obtained through the training can be used for performing gesture detection, a user does not need to wear or hold any equipment, and the user does not need to master the operation technical knowledge of related equipment, so that the requirement on the user in human-computer interaction is reduced, and the user experience is improved.

In a specific implementation manner of the present application, referring to fig. 4, a flowchart of another gesture detection network training method is provided, and compared with the foregoing embodiment, in this embodiment, training a second convolutional neural network parameter according to a prediction result of a human hand candidate region and a sample image, and keeping a second feature extraction layer parameter unchanged in a training process (S103), the method includes:

S103A: and correcting the prediction information of the hand candidate area.

Specifically, when the prediction information of the human hand candidate region is corrected, the plurality of supplementary negative sample images and the prediction information of the human hand candidate region may be input to a third convolutional neural network for classification, so as to filter the negative samples in the human hand candidate region, and obtain the corrected prediction information of the human hand candidate region.

The supplementary negative sample image is only used as an input of the third convolutional neural network, but not used as an input of the first convolutional neural network and the second convolutional neural network, and the supplementary negative sample image may be a blank image without a hand or an image that includes an area similar to a hand (not a hand) but is not labeled as including a hand.

Specifically, the third convolutional neural network may be FRCNN, and of course, the third convolutional neural network may also be other two or more classes CNN.

Optionally, the difference between the number of human hand candidate regions and the number of complementary negative sample images in the prediction information of the human hand candidate region falls within a predetermined allowable range. When the difference falls within a predetermined allowable range, the number of the human hand candidate regions in the prediction information of the human hand candidate region may be considered to be equal to or similar to the number of the supplementary negative sample images, so that the predetermined allowable range is generally small in value, and the specific value may be determined according to actual conditions.

Preferably, the difference between the number of the human hand candidate regions in the prediction information of the human hand candidate regions and the number of the supplementary negative sample images is equal, and obviously, in this case, the positive sample rate of the human hand candidate regions obtained through the third neural network is obviously improved.

In addition, the predicted information of the human hand candidate region may be corrected by a manual correction method by a marker, which is not limited in the present application.

S103B: and training a second convolutional neural network parameter according to the corrected prediction information of the human hand candidate region and the sample image.

Because a preset measurement result of the first convolutional neural network may have a larger error, that is, the accuracy rate is poorer when the second convolutional neural network is trained by using the prediction information obtained by the first convolutional neural network, and compared with the prediction information obtained by the first convolutional neural network, the accuracy rate of the corrected prediction information of the human hand candidate region is much higher, so that the accuracy rate of the second convolutional neural network obtained by training the second convolutional neural network by using the corrected prediction information of the human hand candidate region and the sample image is higher.

As can be seen from the above, in the scheme provided in this embodiment, the prediction information of the human hand candidate region is corrected, and then the second convolutional neural network parameter is trained according to the corrected prediction information of the human hand candidate region and the sample image, so that the accuracy of the trained second convolutional neural network is improved.

Fig. 5 is a schematic flowchart of a gesture detection method provided in an embodiment of the present application, where the method includes:

s501: and detecting the image by adopting a fourth convolutional neural network to obtain first characteristic information of the image and prediction information of the human hand candidate region, wherein the image comprises a static image or an image in a video.

Specifically, the fourth convolutional neural network may include: the image segmentation device comprises a fourth input layer, a fourth feature extraction layer and a fourth classification output layer, wherein the fourth classification output layer is used for detecting whether a plurality of candidate regions divided by the image are human hand candidate regions.

Each layer included in the fourth convolutional neural network is only functionally divided, and specifically, the fourth feature extraction layer may be composed of a convolutional layer, or may be composed of a convolutional layer and a nonlinear conversion layer, or may be composed of a convolutional layer, a nonlinear conversion layer, and a pooling layer; the output result of the fourth classification output layer may be understood as a result of the second classification, which may be implemented by the convolutional layer, but is not limited to the convolutional layer.

The fourth convolutional neural Network is an RPN, may be another two-class or more CNNs, and may be a Multi-Box Network or a YOLO, etc

In addition, the fourth convolutional neural network may be the same as the first convolutional neural network, and will not be described in detail here.

Specifically, the fourth convolutional neural network may include: an input layer, an output layer, and a plurality of convolutional layers. When the images are processed by adopting the plurality of convolutional layers of the fourth convolutional neural network, the feature extraction is performed on the images. When the fourth convolutional neural network obtains the candidate human hand region in the image, the image is obtained through the input layer, then the features of the image are extracted through the convolutional layer, the candidate human hand region in the image is determined by combining the extracted features, and then the result is output through the output layer.

Simply, the fourth convolutional neural network can be understood as: and (3) performing two-classification processing on the region in the image, namely distinguishing whether the region in the image is a human hand region, namely finding a candidate human hand region in the image, and then performing two-classification on the candidate human hand region.

S502: taking the first characteristic information and the prediction information of the hand candidate area as second characteristic information of a fifth convolutional neural network, and performing gesture detection on the image by adopting the fifth convolutional neural network according to the second characteristic information to obtain a gesture detection result of the image; and the fourth feature extraction layer parameters of the fourth convolutional neural network are the same as the fifth feature extraction layer parameters of the fifth convolutional neural network.

Specifically, the fifth convolutional neural network may include: the gesture detection device comprises a fifth input layer, a fifth feature extraction layer and a fifth classification output layer, wherein the fifth classification output layer is used for outputting gesture detection results of the images.

It should be noted that the fifth feature extraction layer is similar to the fourth feature extraction layer, and is not described herein again. The output result of the fifth classification output layer may be understood as a result of multi-classification, which may be specifically implemented by a fully-connected layer, but is not limited to being implemented by a fully-connected layer.

The fifth convolutional Neural Network is FRCNN, may be other multi-class CNN (convolutional Neural Network), may also be a current Neural Network, and the like.

In addition, the fifth convolutional neural network may be the same as the second convolutional neural network, and will not be described in detail here.

Wherein the fifth convolutional neural network may include: an input layer, an output layer, a plurality of convolutional layers, and a plurality of fully-connected layers. The convolutional layer is mainly used for feature extraction, and the fully connected layer is equivalent to a classifier and classifies the features extracted by the fifth convolutional layer. When the fifth convolutional neural network obtains a gesture detection result in the image, a candidate hand area is obtained through the input layer, then the characteristics of the candidate hand area are extracted through the convolutional layer, the full-link layer carries out classification processing according to the characteristics of the candidate hand area, whether the image contains a hand or not and the gesture of the hand under the condition that the image contains the hand are determined, and finally the classification result is output through the output layer.

Briefly, the fifth convolutional neural network is mainly used to solve the multi-classification problem, i.e. to distinguish the types of human hand regions, such as open, closed, not human hand, etc.

In addition, the gesture detection result may further include: a non-predetermined gesture type, wherein the non-predetermined gesture type may be understood as: gesture types other than the predetermined gesture types described above may further improve the gesture classification accuracy of the second convolutional neural network.

Specifically, the gesture detection result for the image may include: hands-free, open gestures, closed gestures, and the like, as this application is not limited thereto. Under the condition that the gesture detection result shows that the gesture of the human hand is an opening gesture and a closing gesture, the gesture of the human hand can be represented in a probability mode, when the probability of the opening gesture is high, the human hand containing the opening gesture in the image can be considered, and when the probability of the closing gesture is high, the human hand containing the closing gesture in the image can be considered.

Of course, in an alternative implementation of the present application, the output result of the fifth convolutional neural network model may include: a probability that the candidate hand region does not contain a human hand, a probability that the candidate hand region contains a human hand with an open gesture, a probability that the candidate hand region contains a human hand with a closed gesture, and so on.

It should be noted that, in this document, the term "plurality" may be understood as: at least two.

The image-based gesture detection method is described below by taking an RPN convolutional neural network and a Fast RCNN convolutional neural network as examples.

Processing the target image by adopting an RPN (resilient packet network) convolutional neural network to obtain first characteristic information in the target image and prediction information of a hand candidate region, correspondingly obtaining position information of the hand candidate region, inputting the position information serving as information to be detected into a Fast RCNN convolutional neural network, outputting a characteristic value by the last layer of the Fast RCNN convolutional neural network, judging whether the region corresponding to the position information is the hand region or not according to the characteristic value, and if the region is the hand, judging whether a specific gesture is an opening gesture or a closing gesture.

Specifically, a full link layer with an output of 2 and a softmax layer of the last layer of the Fast RCNN convolutional neural network form a bipartite device, the output results of the bipartite device can be a probability of being a hand and a probability of being a non-hand, and the situation corresponding to the approximate probability of the two probabilities is used as a judgment result.

As can be seen from the above, in the above embodiments, the fourth convolutional neural network is used to detect the image, obtain the first feature information of the image and the prediction information of the human hand candidate region, use the first feature information and the prediction information of the human hand candidate region as the second feature information of the fifth convolutional neural network, and use the fifth convolutional neural network to perform gesture detection on the image according to the second feature information, so as to obtain the gesture detection result of the image. Therefore, when the gesture is detected by applying the scheme provided by each embodiment, a user does not need to wear or hold any equipment, and the user does not need to master the operation technical knowledge of the related equipment, so that the requirements on the user in human-computer interaction are reduced, and the user experience is improved.

Fig. 6 is a schematic flowchart of a gesture control method provided in an embodiment of the present application, where the method includes:

s601: and (3) obtaining a gesture detection network detection image obtained by training by adopting the gesture detection network training method, or obtaining a gesture detection result of the image by adopting the gesture detection method to detect the image, wherein the image comprises a static image or an image in a video.

S602: and triggering corresponding control operation at least according to the gesture detection result of the image.

Specifically, when a corresponding control operation is triggered according to a gesture detection result of an image, it can be understood that: under the condition that a control instruction is not required to be found, the control instruction is directly triggered, namely, a corresponding relation exists between the gesture and the control operation;

in addition, it can also be understood that: the corresponding relation exists between the gesture and the control operation, the control instruction corresponding to the gesture is found according to the corresponding relation, and then the control operation is triggered in a mode of sending the control instruction.

The present application is described by way of example only, and is not limited to the above embodiments.

In an implementation manner of the application, when triggering corresponding control operation at least according to a gesture detection result of an image, the number of times of obtaining the same gesture detection result by continuously detecting the image in the video within a time period may be recorded first; and then triggering corresponding control operation according to the gesture detection result when the recorded times meet the preset condition.

The predetermined condition may be that the number of times of detecting the open gesture in a certain time is greater than a preset value, the number of times of detecting the close gesture in a certain time is greater than a preset threshold value, and the like.

Specifically, when triggering corresponding control operation according to the gesture detection result, determining a control instruction corresponding to the gesture detection result, and triggering corresponding operation according to the control instruction.

Specifically, a mapping relationship between the control instruction and the gesture may be pre-established, and then the control instruction corresponding to the gesture detection result may be determined according to the pre-established mapping relationship.

The mapping relationship may be:

the opening gesture corresponds to starting the system;

closing the system corresponding to the hand waving gesture;

the scissor hand gesture corresponds to a cutting operation;

the fist making gesture corresponds to a pasting operation;

the vertical thumb gesture corresponds to a praise operation;

the hand-robbing gesture corresponds to a selection operation, and the like.

The present application is described only by way of example, and the specific form of the mapping relationship is not limited in practical applications.

After the hand gesture information of the human hand is detected, when human-computer interaction is further performed according to the hand gesture information of the human hand, considering factors such as accuracy of a detection algorithm, human-computer interaction is generally performed based on a detection result of an image acquired by the image acquisition device within a period of time, for example, 2 seconds, 3 seconds, and the like.

The above control method is described below by way of a specific example.

Suppose that the triggering action of the human-computer interaction is: the hands were detected to be continuously open for more than 2 seconds and continuously closed for more than 2 seconds. If a gesture with 20 frames of images per second can be detected, the corresponding trigger actions are to continuously detect more than 40 frames of open gestures and continuously detect more than 40 frames of closed gestures.

Detecting a target image may yield three results: does not contain a human hand region, an open gesture, or a close gesture.

And if the human hand region is not detected in the target image, resetting the opening gesture counter and the closing gesture counter, and continuing to detect the next frame of image.

If the target image is detected to contain a hand region, further judging the hand gesture, if the hand gesture is an opening gesture, increasing 1 to an opening gesture counter, resetting a closing gesture counter, and starting to perform gesture detection on the next frame of image; if the image is a closed gesture, further judging whether the accumulated value of the opening gesture counter is greater than 40, if not, resetting the opening gesture counter and the closing gesture counter, and continuing to detect the next frame of image; and if the number of the closed gestures is larger than 40, increasing 1 by the closed gesture counter, judging whether the closed gesture counter exceeds 40, if so, triggering human-computer interaction, and if not, continuing to detect the next frame of image, wherein the closed gesture counter exceeds 40.

As can be seen from the above, when the above embodiments are applied to control, the gesture detection network detection image obtained by training with the gesture detection network training method is adopted, or the gesture detection method is adopted to detect an image, so as to obtain a gesture detection result of the image, and a corresponding control operation is triggered at least according to the gesture detection result of the image. Therefore, when the scheme provided by each embodiment is applied to operation control, a user does not need to wear or hold any equipment, and the user does not need to master operation control of related equipment, so that the requirements on the user in human-computer interaction are reduced, and the user experience is improved.

Corresponding to the gesture detection network training method, the embodiment of the application also provides a gesture detection network training device.

Fig. 7 is a schematic structural diagram of a gesture detection network training apparatus provided in an embodiment of the present application, where the apparatus includes:

a first training module 701, configured to train a first convolutional neural network according to a sample image containing human hand labeling information, to obtain prediction information of a human hand candidate region of the first convolutional neural network for the sample image;

a parameter replacement module 702, configured to replace a second feature extraction layer parameter of a second convolutional neural network used for detecting a gesture with a first feature extraction layer parameter of the trained first convolutional neural network;

the second training module 703 is configured to train the second convolutional neural network parameters according to the prediction information of the human hand candidate region and the sample image, and keep the second feature extraction layer parameters unchanged in the training process.

Specifically, the hand labeling information may include labeling information of a gesture.

Specifically, the first convolutional neural network may include: the image processing device comprises a first input layer, a first feature extraction layer and a first classification output layer, wherein the first classification output layer is used for predicting whether a plurality of candidate regions of the sample image are human hand candidate regions.

Specifically, the second convolutional neural network may include: the second input layer, the second feature extraction layer and the second classification output layer are used for outputting the gesture detection result of the sample image.

Specifically, the gesture detection result may further include: a non-predetermined gesture type.

Specifically, the first convolutional neural network may be an RPN, and/or the second convolutional neural network may be an FRCNN.

In a specific implementation manner of the present application, referring to fig. 8, a schematic structural diagram of another gesture detection network training apparatus is provided, and compared with the foregoing embodiment, in this embodiment, the second training module 703 includes:

a correction submodule 703A for correcting the prediction information of the hand candidate region;

the training sub-module 703B is configured to train the second convolutional neural network parameters according to the corrected prediction information of the human hand candidate region and the sample image, and keep the second feature extraction layer parameters unchanged in the training process.

Specifically, the modification sub-module 703A is specifically configured to input the multiple supplementary negative sample images and the prediction information of the human hand candidate region into a third convolutional neural network for classification, so as to filter the negative samples in the human hand candidate region, and obtain the modified prediction information of the human hand candidate region.

Specifically, the difference between the number of human hand candidate regions in the prediction information of the human hand candidate region and the number of the supplementary negative sample images falls within a predetermined allowable range.

Specifically, the difference between the number of the human hand candidate regions in the prediction information of the human hand candidate region and the number of the supplementary negative sample images is equal.

Specifically, the third convolutional neural network may be FRCNN.

Corresponding to the gesture detection method, the embodiment of the application also provides a gesture detection device.

Fig. 9 is a schematic structural diagram of a gesture detection apparatus according to an embodiment of the present application, where the apparatus includes:

a first obtaining module 901, configured to detect an image using a fourth convolutional neural network, and obtain first feature information of the image and prediction information of a human hand candidate region, where the image includes a still image or an image in a video;

a detection module 902, configured to use the first feature information and the prediction information of the human hand candidate region as second feature information of a fifth convolutional neural network, and perform gesture detection on the image according to the second feature information by using the fifth convolutional neural network to obtain a gesture detection result of the image; wherein the fourth feature extraction layer parameters of the fourth convolutional neural network are the same as the fifth feature extraction layer parameters of the fifth convolutional neural network.

Specifically, the fifth convolutional neural network may include: the image gesture detection device comprises a fifth input layer, a fifth feature extraction layer and a fifth classification output layer, wherein the fifth classification output layer is used for outputting gesture detection results of the image.

Corresponding to the gesture control method, the embodiment of the application also provides a gesture control device.

Fig. 10 is a schematic structural diagram of a gesture control apparatus according to an embodiment of the present application, where the apparatus includes:

a second obtaining module 1001, configured to obtain a gesture detection result of an image by using a gesture detection network detection image obtained by training with any one of the apparatuses provided in the embodiments of the present application, or by using any one of the apparatuses provided in the embodiments of the present application to detect an image, where the image includes an image in a still image or a video;

the triggering module 1002 is configured to trigger a corresponding control operation at least according to a gesture detection result of the image.

Specifically, the triggering module 1002 may include:

Specifically, the trigger sub-module may include:

The embodiment of the present application further provides an application program, where the application program is configured to execute the aforementioned gesture detection network training method, the aforementioned gesture detection method, or the aforementioned gesture control method during running.

The gesture detection network training method comprises the following steps:

The gesture detection method comprises the following steps:

The gesture control method comprises the following steps:

the method comprises the steps of obtaining a gesture detection network detection image obtained by training through the gesture detection network training method, or obtaining a gesture detection result of the image through detecting the image through the gesture detection method, wherein the image comprises a static image or an image in a video;

Here, only the above gesture detection network training method, the gesture detection method, and the gesture control method are briefly described, and specific cases may refer to the foregoing embodiments, which are not described herein again.

Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application, including: the device comprises a housing 1101, a processor 1102, a memory 1103, a circuit board 1104 and a power circuit 1105, wherein the circuit board 1104 is arranged inside a space surrounded by the housing 1101, and the processor 1102 and the memory 1103 are arranged on the circuit board 1104; a power supply circuit 1105 for supplying power to each circuit or device of the electronic apparatus; the memory 1103 is used to store executable program code; the processor 1102 runs a program corresponding to the executable program code by reading the executable program code stored in the memory 1103, so as to execute the aforementioned gesture detection network training method, the aforementioned gesture detection method, or the aforementioned gesture control method.

The gesture detection network training method comprises the following steps:

The gesture detection method comprises the following steps:

The gesture control method comprises the following steps:

The above electronic devices exist in a variety of forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include: smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.

(3) A portable entertainment device: such devices can display and play multimedia content. This type of device comprises: audio, video players (e.g., ipods), handheld game consoles, electronic books, and smart toys and portable car navigation devices.

(4) A server: the device for providing the computing service comprises a processor, a hard disk, a memory, a system bus and the like, and the server is similar to a general computer architecture, but has higher requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like because of the need of providing high-reliability service.

(5) And other electronic devices with data interaction functions.

An embodiment of the present application provides a storage medium, configured to store an executable code, where the executable code is configured to execute the foregoing gesture detection network training method, or the foregoing gesture detection method, or the foregoing gesture control method.

The gesture detection network training method comprises the following steps:

The gesture detection method comprises the following steps:

The gesture control method comprises the following steps:

For the device, application, electronic device and storage medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiments.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Those skilled in the art will appreciate that all or part of the steps in the above method embodiments may be implemented by a program to instruct relevant hardware to perform the steps, and the program may be stored in a computer-readable storage medium, which is referred to herein as a storage medium, such as: ROM/RAM, magnetic disk, optical disk, etc.

The above description is only for the preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims

1. A gesture detection network training method is characterized by comprising the following steps:

training a first convolutional neural network according to a sample image which contains human hand labeling information and meets a preset resolution condition to obtain prediction information of a human hand candidate region of the first convolutional neural network aiming at the sample image; the preset resolution condition is that the number of pixels is within a preset number range;

training the second convolutional neural network parameters according to the prediction information of the human hand candidate region and the sample image, and keeping the second feature extraction layer parameters unchanged in the training process;

training the second convolutional neural network parameters according to the prediction result of the human hand candidate region and the sample image, wherein the training comprises the following steps:

correcting the prediction information of the human hand candidate area;

2. The method of claim 1, wherein the human hand labeling information comprises labeling information of a region of a human hand.

3. The method of claim 2, wherein the human hand labeling information comprises labeling information for a gesture.

4. The method of any one of claims 1, 2, and 3, wherein the first convolutional neural network comprises: the image processing device comprises a first input layer, a first feature extraction layer and a first classification output layer, wherein the first classification output layer is used for predicting whether a plurality of candidate regions of the sample image are human hand candidate regions.

5. The method of any one of claims 1, 2 and 3, wherein the second convolutional neural network comprises: the second input layer, the second feature extraction layer and the second classification output layer are used for outputting the gesture detection result of the sample image.

6. The method of any one of claims 1, 2 and 3, wherein the gesture detection result comprises at least one of the following predetermined gesture types: waving hand, scissor hand, fist, holding hand, thumb, pistol, OK hand, peach heart hand, opening and closing.

7. The method of any of claims 1, 2, and 3, wherein the gesture detection result further comprises: a non-predetermined gesture type.

8. The method according to any one of claims 1-3, wherein said modifying the prediction information of the candidate region of the human hand comprises:

9. The method according to claim 8, wherein a difference between the number of human hand candidate regions and the number of the supplementary negative sample images in the prediction information of the human hand candidate region falls within a predetermined allowable range.

10. The method according to claim 9, wherein the number of human hand candidate regions in the prediction information of the human hand candidate region is equal to the number of the supplementary negative sample images.

11. The method of any one of claims 1, 2, 9 and 10, wherein the first convolutional neural network is RPN and/or the second convolutional neural network is FRCNN.

12. The method of any of claims 9 and 10, wherein the third convolutional neural network is FRCNN.

13. A gesture detection method, comprising:

detecting an image meeting a preset resolution condition by adopting a fourth convolutional neural network, and obtaining first characteristic information of the image and prediction information of a human hand candidate area, wherein the image comprises a static image or an image in a video; the preset resolution condition is that the number of pixels is within a preset number range;

14. The method of claim 13, wherein the fourth convolutional neural network comprises: the image segmentation device comprises a fourth input layer, a fourth feature extraction layer and a fourth classification output layer, wherein the fourth classification output layer is used for detecting whether a plurality of candidate regions divided by the image are human hand candidate regions.

15. The method of claim 13 or 14, wherein the fifth convolutional neural network comprises: the image gesture detection device comprises a fifth input layer, a fifth feature extraction layer and a fifth classification output layer, wherein the fifth classification output layer is used for outputting gesture detection results of the image.

16. The method according to claim 13 or 14, wherein the gesture detection result comprises at least one of the following predetermined gesture types: waving hand, scissor hand, fist, holding hand, thumb, pistol, OK hand, peach heart hand, opening and closing.

17. The method of claim 13 or 14, wherein the gesture detection result further comprises: a non-predetermined gesture type.

18. A gesture control method, comprising:

detecting an image meeting a preset resolution condition by using a gesture detection network trained by the method according to any one of claims 1 to 12, or detecting an image meeting a preset resolution condition by using the method according to any one of claims 13 to 17 to obtain a gesture detection result of the image, wherein the image comprises a still image or an image in a video;

19. The method of claim 18, wherein triggering a corresponding control operation based at least on the gesture detection result of the image comprises:

20. The method according to claim 18 or 19, wherein triggering a corresponding control operation according to the gesture detection result comprises:

and triggering corresponding operation according to the control instruction.

21. A gesture detection network training device, comprising:

the system comprises a first training module, a second training module and a third training module, wherein the first training module is used for training a first convolutional neural network according to a sample image which contains human hand labeling information and meets a preset resolution condition to obtain prediction information of a human hand candidate area of the sample image by the first convolutional neural network; the preset resolution condition is that the number of pixels is within a preset number range;

the second training module is used for training the second convolutional neural network parameters according to the prediction information of the human hand candidate region and the sample image and keeping the second feature extraction layer parameters unchanged in the training process;

the second training module comprising:

22. The apparatus of claim 21, wherein the human hand labeling information comprises labeling information of a region of a human hand.

23. The apparatus of claim 22, wherein the human hand labeling information comprises labeling information for a gesture.

24. The apparatus of any one of claims 21 and 23, wherein the first convolutional neural network comprises: the image processing device comprises a first input layer, a first feature extraction layer and a first classification output layer, wherein the first classification output layer is used for predicting whether a plurality of candidate regions of the sample image are human hand candidate regions.

25. The apparatus of any one of claims 21 and 23, wherein the second convolutional neural network comprises: the second input layer, the second feature extraction layer and the second classification output layer are used for outputting the gesture detection result of the sample image.

26. The apparatus according to any one of claims 21 and 23, wherein the gesture detection result comprises at least one of the following predetermined gesture types: waving hand, scissor hand, fist, holding hand, thumb, pistol, OK hand, peach heart hand, opening and closing.

27. The apparatus according to any one of claims 21 and 23, wherein the gesture detection result further comprises: a non-predetermined gesture type.

28. The apparatus according to any one of claims 21 and 23, wherein the modification sub-module is specifically configured to input a plurality of supplementary negative sample images and the prediction information of the human hand candidate region into a third convolutional neural network for classification, so as to filter the negative samples in the human hand candidate region and obtain the modified prediction information of the human hand candidate region.

29. The apparatus according to claim 28, wherein a difference between the number of human hand candidate regions and the number of the supplementary negative sample images in the prediction information of the human hand candidate region falls within a predetermined allowable range.

30. The apparatus according to claim 29, wherein the number of human hand candidate regions in the prediction information of the human hand candidate region is equal to the number of the supplementary negative sample images.

31. The apparatus of any one of claims 21, 23, 29 and 30, wherein the first convolutional neural network is RPN and/or the second convolutional neural network is FRCNN.

32. The apparatus of any one of claims 29 and 30, wherein the third convolutional neural network is FRCNN.

33. A gesture detection apparatus, comprising:

the first obtaining module is used for detecting an image meeting a preset resolution condition by adopting a fourth convolutional neural network, and obtaining first characteristic information of the image and prediction information of a human hand candidate area, wherein the image comprises a static image or an image in a video; the preset resolution condition is that the number of pixels is within a preset number range;

34. The apparatus of claim 33, wherein the fourth convolutional neural network comprises: the image segmentation device comprises a fourth input layer, a fourth feature extraction layer and a fourth classification output layer, wherein the fourth classification output layer is used for detecting whether a plurality of candidate regions divided by the image are human hand candidate regions.

35. The apparatus of claim 33 or 34, wherein the fifth convolutional neural network comprises: the image gesture detection device comprises a fifth input layer, a fifth feature extraction layer and a fifth classification output layer, wherein the fifth classification output layer is used for outputting gesture detection results of the image.

36. The apparatus according to claim 33 or 34, wherein the gesture detection result comprises at least one of the following predetermined gesture types: waving hand, scissor hand, fist, holding hand, thumb, pistol, OK hand, peach heart hand, opening and closing.

37. The apparatus of claim 33 or 34, wherein the gesture detection result further comprises: a non-predetermined gesture type.

38. A gesture control apparatus, comprising:

a second obtaining module, configured to detect, by using a gesture detection network trained by the apparatus according to any one of claims 21 to 32, an image that meets a preset resolution condition, or detect, by using the apparatus according to any one of claims 33 to 37, an image that meets the preset resolution condition, and obtain a gesture detection result of the image, where the image includes an image in a still image or a video;

39. The apparatus of claim 38, wherein the triggering module comprises:

40. The apparatus of claim 38 or 39, wherein the trigger submodule comprises:

41. An electronic device, comprising: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; the power supply circuit is used for supplying power to each circuit or device of the terminal; the memory is used for storing executable program codes; the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory for performing the gesture detection network training method of any one of claims 1-12, or the gesture detection method of any one of claims 13-17, or the gesture control method of any one of claims 18-20.