WO2021115181A1

WO2021115181A1 - Gesture recognition method, gesture control method, apparatuses, medium and terminal device

Info

Publication number: WO2021115181A1
Application number: PCT/CN2020/133410
Authority: WO
Inventors: 刘高强
Original assignee: RealMe重庆移动通信有限公司
Priority date: 2019-12-13
Filing date: 2020-12-02
Publication date: 2021-06-17
Also published as: CN111062312A; CN111062312B

Abstract

A gesture recognition method, a gesture control method, apparatuses, a storage medium and a terminal device. The gesture recognition method is applied to a terminal device provided with a camera, and comprises: acquiring multiple frames of original images collected by the camera; extracting face images from among the multiple frames of original images respectively to obtain multiple frames of face images; detecting hand key points in each frame of the face images, and generating a hand trajectory according to position changes of the hand key points in the multiple frames of face images; and recognizing the hand trajectory to obtain a gesture recognition result. The amount of image processing data in gesture recognition is reduced, and the time consumed by the recognition process is reduced.

Description

Gesture recognition method, gesture control method, device, medium and terminal equipment

This application claims the priority of the Chinese patent application filed on December 13, 2019 with the application number 201911284143.9 and the name "gesture recognition method, gesture control method, device, medium and terminal equipment". All of the Chinese patent application The content is incorporated herein by reference.

Technical field

The present disclosure relates to the field of computer vision technology, and in particular to a gesture recognition method, a gesture control method, a gesture recognition device, a gesture control device, a computer-readable storage medium, and a terminal device.

Background technique

Gesture control refers to the use of computer vision, graphics and other technologies to recognize human operation gestures without touching the terminal device, and convert them into control instructions for the device. It is a new interaction after the mouse, keyboard and touch screen. In this way, it can get rid of the dependence of traditional interaction methods on input devices and increase the diversity of interactions.

Gesture recognition is the premise of gesture control. Only by accurately and promptly recognizing the user's gestures can it be transformed into effective gesture control and achieve the interactive results that the user wants.

Summary of the invention

The present disclosure provides a gesture recognition method, a gesture control method, a gesture recognition device, a gesture control device, a computer-readable storage medium, and a terminal device, thereby improving at least to a certain extent the high processing volume and time-consuming of gesture recognition data The problem.

According to a first aspect of the present disclosure, a gesture recognition method is provided, which is applied to a terminal device equipped with a camera, and the method includes: acquiring multiple frames of original images collected by the camera; and extracting from the multiple frames of original images respectively A face image to obtain multiple frames of face images; detecting hand key points in each frame of face image, and generating a hand trajectory according to the position changes of the hand key points in the multiple frames of face images; The hand trajectory is recognized, and the gesture recognition result is obtained.

According to a second aspect of the present disclosure, there is provided a gesture control method, which is applied to a terminal device with a camera, and the method includes: when the gesture control function is turned on, obtaining a gesture recognition result according to the gesture recognition method of the first aspect; executing; The control instruction corresponding to the gesture recognition result.

According to a third aspect of the present disclosure, there is provided a gesture recognition device, which is configured in a terminal device equipped with a camera, the device includes a processor; wherein the processor is used to execute the following program modules stored in the memory: original image acquisition module , Used to obtain multiple frames of original images collected by the camera; a face image extraction module, used to extract face images from the multiple frames of original images to obtain multiple frames of face images; hand trajectory generation module, It is used to detect the hand key points in each frame of face image, and generate hand trajectories according to the position changes of the hand key points in the multi-frame face image; the hand trajectory recognition module is used to compare The hand trajectory is recognized, and the gesture recognition result is obtained.

According to a fourth aspect of the present disclosure, there is provided a gesture control device, which is configured in a terminal device equipped with a camera, the device includes a processor; wherein the processor is configured to execute the following program modules stored in the memory: original image acquisition module , Used to obtain multiple frames of original images collected by the camera when the gesture control function is turned on; a face image extraction module, used to extract face images from the multiple frames of original images to obtain multiple frames of face images The hand trajectory generation module is used to detect the hand key points in each frame of face image, and generate the hand trajectory according to the position changes of the hand key points in the multi-frame face image; hand The trajectory recognition module is used to recognize the hand trajectory to obtain the gesture recognition result; the control instruction execution module is used to execute the control instruction corresponding to the gesture recognition result.

According to a fifth aspect of the present disclosure, there is provided a computer-readable storage medium having a computer program stored thereon, and the computer program implements the gesture recognition method of the first aspect or the gesture control method of the second aspect when the computer program is executed by a processor .

According to a sixth aspect of the present disclosure, there is provided a terminal device, including: a processor; a memory for storing executable instructions of the processor; and a camera; wherein the processor is configured to execute the executable Instructions to execute the gesture recognition method of the first aspect or the gesture control method of the second aspect.

The technical solution of the present disclosure has the following beneficial effects:

The camera collects multiple frames of original images, extracts the face images separately, and detects the key points of the hand from each frame of the face image, and then generates the hand trajectory according to the position change of the hand key points, and finally recognizes the hand trajectory. Gesture recognition result. Since the user's hand is generally located in front of or near the face when performing gesture operations, extracting the face image from the original image to detect the key points of the hand is equivalent to cropping the original image, and the cropping has nothing to do with gesture recognition. This reduces the amount of image processing data, so that the system only needs to perform gesture recognition in the face image, which reduces the time-consuming process, improves the real-time performance of gesture recognition, and does not require high hardware processing performance. It is conducive to deployment in lightweight scenarios such as mobile terminals. Further, based on the real-time gesture recognition, when the user makes a gesture operation, the control instruction corresponding to the gesture recognition result can be executed immediately, so as to achieve a fast interactive response, improve the interaction delay problem, and improve the user experience. For somatosensory games And so on has a high practicability.

Description of the drawings

Fig. 1 shows a flowchart of a gesture recognition method in this exemplary embodiment;

Fig. 2 shows a sub-flow chart of a gesture recognition method in this exemplary embodiment;

FIG. 3 shows a schematic flowchart of extracting hand candidate regions in this exemplary embodiment;

FIG. 4 shows a schematic flowchart of gesture recognition in this exemplary embodiment;

Fig. 5 shows a flow chart of a gesture control method in this exemplary embodiment;

Fig. 6 shows a structural block diagram of a gesture recognition device in this exemplary embodiment;

FIG. 7 shows a structural block diagram of another gesture recognition device in this exemplary embodiment;

Fig. 8 shows a structural block diagram of a gesture control device in this exemplary embodiment;

Fig. 9 shows a structural block diagram of another gesture control device in this exemplary embodiment;

FIG. 10 shows a computer-readable storage medium for implementing the above-mentioned method in this exemplary embodiment;

Fig. 11 shows a terminal device for implementing the above-mentioned method in this exemplary embodiment.

Detailed ways

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the example embodiments can be implemented in various forms, and should not be construed as being limited to the examples set forth herein; on the contrary, these embodiments are provided so that the present disclosure will be more comprehensive and complete, and the concept of the example embodiments will be fully conveyed To those skilled in the art. The described features, structures or characteristics can be combined in one or more embodiments in any suitable way. In the following description, many specific details are provided to give a sufficient understanding of the embodiments of the present disclosure. However, those skilled in the art will realize that the technical solutions of the present disclosure can be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. can be used. In other cases, the well-known technical solutions are not shown or described in detail in order to avoid overwhelming the crowd and obscure all aspects of the present disclosure.

In addition, the drawings are only schematic illustrations of the present disclosure, and are not necessarily drawn to scale. The same reference numerals in the figures denote the same or similar parts, and thus their repeated description will be omitted. Some of the block diagrams shown in the drawings are functional entities and do not necessarily correspond to physically or logically independent entities. These functional entities may be implemented in the form of software, or implemented in one or more hardware modules or integrated circuits, or implemented in different networks and/or processor devices and/or microcontroller devices.

In related technologies, gesture recognition methods are mostly based on gesture positioning and feature extraction in images captured by a camera. As the number of pixels of the camera on the terminal device becomes higher and the image resolution becomes higher and higher, the amount of data processing in the gesture recognition process becomes higher and more time-consuming, which affects the real-time performance of gesture recognition. , Resulting in a certain delay in gesture control, and poor user experience; and the above-mentioned method requires high hardware processing performance, which is not conducive to deployment in scenarios such as mobile terminals.

In view of the foregoing problems, exemplary embodiments of the present disclosure provide a gesture recognition method, which can be applied to terminal devices equipped with cameras, such as mobile phones, tablet computers, digital cameras, virtual reality devices, and the like. Fig. 1 shows a flow of the gesture recognition method, which may include the following steps S110 to S140:

Step S110: Obtain multiple frames of original images collected by the camera.

A gesture is an action that requires multiple frames to record the gesture completely. In this exemplary embodiment, when the gesture recognition function is turned on, the camera can collect a fixed number of original images, such as 10 frames, 50 frames, etc.; or it can detect whether there is an object in front of the camera through a matching infrared sensor. When there is an object (usually the hand is the default), the camera is started to collect the original image, and when the object is detected to move away, the camera stops collecting, thereby obtaining multiple frames of original images. In an optional implementation manner, after the original image is collected, appropriate frame dropping can be performed, for example, one frame is reserved every three frames to reduce the amount of subsequent processing, and has little effect on gesture recognition. Specific frame dropping The rate depends on the number of frames of the original image collected by the camera, which is not limited in the present disclosure.

In step S120, face images are extracted from the above-mentioned multiple frames of original images to obtain multiple frames of face images.

Among them, the face area can be recognized by color and shape detection, such as pre-setting the color range and shape range of the human face, and detecting whether there is a local area that satisfies both the color range and the shape range in the original image. The local area is the person. Face area. Deep learning techniques can also be used, such as YOLO (You Look Only Once, an algorithm framework for real-time target detection, including v1, v2, v3, etc.), and this disclosure can use any one of them), SSD (Single Shot) Multibox Detector, single-step multi-frame target detection), R-CNN (Region-Convolutional Neural Network, regional convolutional neural network, or improved versions such as Fast R-CNN, Faster R-CNN) and other neural networks for face region detection . When the face area is detected, the face area can be marked with a rectangular frame and extracted as a face image. In order to facilitate subsequent processing, the face image can be extracted or sampled according to a preset size (or resolution), so that the size (or resolution) of each frame of the face image is the same.

In an optional implementation manner, a hardware face detection module (HWFD) can be set on the terminal device, and after the collected multiple frames of original images are input to the HWFD, the coordinates of the face area are output, and the coordinates are mapped to From the original image, a face image can be extracted.

In an optional implementation manner, after step S110, the resolution of the collected multiple frames of original images can be adjusted to a preset resolution, and in step S120, the original image after the adjusted resolution can be executed. Face image extraction. The preset resolution may be determined according to the algorithm standard adopted in step S120. For example: YOLO is used for face detection, and the input layer of YOLO is set to 640*480, then the preset resolution can be 640*480; if the terminal's camera is 16 million pixels, the original image resolution collected is 4608* 3456, the system can down-sample the original image to get a 640*480 image, which can be input to YOLO for processing. Generally, the preset resolution is lower than the resolution of the original image itself, which is equivalent to compressing the original image, reducing the data volume of the original image, and improving processing efficiency.

Step S130: Detect the key points of the hand in each frame of the face image, and generate a hand trajectory according to the position changes of the key points of the hand in the multiple frames of the face image.

Among them, the key points of the hand can be selected according to the needs of the scene and the image quality. For example, 21 bone points can be selected as the key points of the hand, including the 4 joint feature points of each finger and the palm feature points, or it can be based on It is necessary to select only a part of the bone points. For example, when performing index finger gesture recognition, only the joint feature points or fingertip points of the index finger can be used as the key points of the hand.

In an alternative embodiment, the detection of the key points of the hand can be achieved through shape detection. For example: Perform fingertip shape detection on face images, detect areas with arcs in the face image, and match the arcs of these areas with the preset standard fingertip arcs, and the arcs of the areas with higher matching degrees The top of the shape is the fingertip point (that is, the key point of the hand). Or perform finger shape detection on the face image, and determine the area that is more similar to the standard finger shape as the finger area, and the circular boundary points of the finger area can be designated as the key points of the hand. Or perform ellipse fitting on the figure in the face image, and use the long axis end point of the fitted ellipse as the key point of the hand.

In an optional implementation manner, referring to FIG. 2, the detection of key points of the hand may be specifically implemented through the following steps S210 and S220:

Step S210: Perform region feature detection on each frame of face image, so as to extract hand candidate regions from each frame of face image;

Step S220: Detect key points of the hand in the candidate hand area.

Among them, regional feature detection refers to segmenting a lot of local areas from the face image, extracting and identifying the features of each local area, and when detecting a local area containing hand features, the local area is used as a hand candidate area. Then, further detecting the key points of the hand in the candidate hand area can improve the detection accuracy of the key points of the hand.

Further, step S210 may be specifically implemented through the following steps:

Extract features from face images through convolutional layers;

Process the extracted features through RPN (Region Proposal Network) to obtain candidate frames;

Classify the candidate frame through the classification layer to obtain the hand candidate area;

Optimize the position and size of the hand candidate area through the regression layer.

The foregoing process can be referred to as shown in FIG. 3, and R-CNN (or Fast R-CNN, Faster R-CNN) can be adopted as a whole. After the face image is input, it first passes through the convolution layer for convolution processing (usually including the pooling processing of the pooling layer) to extract the image features. The feature enters the RPN, and the RPN can extract candidate frames. Generally, there are a large number of candidate frames. In this process, you can also use the NMS (Non-Maximum Suppression) algorithm to optimize the candidate frames to obtain More accurate candidate frame. The candidate frames extracted at this time include various categories. For example, there are not only candidate frames for hands, but also candidate frames for nose, mouth, glasses and other parts. Input these candidate frames into the classification layer to classify each candidate frame. Obtain the hand candidate frame (ie, the hand candidate area). The classification layer can use a Softmax (normalized index) function, etc., for the target categories that may exist in the face image, respectively output probability values, and the category with the highest probability value is the category of the candidate frame. Can delete the candidate frame of the non-hand category, and keep only the hand candidate frame. Finally, the hand candidate area is input into the regression layer. The regression layer can fine-tune the position and size of the hand candidate area to obtain the coordinate array (x, y, w, h) of the hand candidate area, where x and y represent the hand The position coordinates of the candidate area (usually the coordinates of the upper left corner point), w and h represent the width and height of the hand candidate area.

The above-mentioned R-CNN can be obtained by training a large number of face image samples. Set the R-CNN to the structure shown in Figure 3, including the basic network, convolutional layer (and pooling layer), RPN, classification layer, and regression layer. The hand candidate area is artificially labeled in the image to obtain the label. Samples and labels are trained, network parameters are updated, and a usable R-CNN is obtained.

It should be noted that the method in FIG. 2 can be used for each frame of the face image, and the key points of the hand are detected in each frame. However, considering that there may not be hands in some frames, or the image quality is poor, the hands cannot be detected. In an optional implementation, if the hand candidate area extracted from the current frame of the face image is null , The hand key points detected in the previous frame are regarded as the hand key points of the current frame. Among them, the hand candidate area is null, that is, the hand cannot be detected. In this case, the hand key points of the previous frame can be directly copied to the current frame. This can improve the robustness of the algorithm.

What needs to be added is that if the number of frames in which the hand candidate area is empty reaches the preset threshold, indicating that the number of frames where the hand cannot be detected is too large, the previously detected data can be cleared, re-detected, or the output gesture recognition is unsuccessful As a result, the corresponding information is displayed in the user interface, such as "gesture recognition failed, please make the gesture again."

The detection of hand key points in the hand candidate area can also be achieved by models such as R-CNN. The hand key points are used as the target to be detected. Through the extraction and processing of image features, the area where the target is located can be output to mark the hand. Department of key points.

By determining the position of the key point of the hand in each frame of the face image, the change of the position between different frames forms a hand trajectory. The hand trajectory can be in the form of an array, a vector, or a picture. This disclosure does not do this limited.

Step S140: Recognize the trajectory of the hand to obtain a gesture recognition result.

The hand trajectory reflects the user's gesture operation action, so to recognize it, the gesture made by the user can be recognized, and the gesture recognition result can be obtained.

In an optional implementation manner, the hand trajectory generated in step S130 may be matched with a preset standard trajectory. The standard trajectory may include shaking the hand left and right, shaking the finger left and right, sliding the finger up and down, opening the hand, and the like. If there is a standard trajectory and the matching rate of the hand trajectory reaches a certain threshold, the hand trajectory is judged to be the standard trajectory, and the gesture represented by the standard trajectory is output as the gesture recognition result of the hand trajectory.

In an optional implementation manner, step S140 may be specifically implemented through the following steps:

Map the hand trajectory to the bitmap to get the hand trajectory bitmap;

The hand trajectory bitmap is processed by Bayesian classifier, and the result of gesture recognition is obtained.

Among them, the size of the bitmap can be preset, or it can be the same as the size of the face image or the candidate area of the hand. The hand trajectory is the position change of the key points of the hand. The position of each frame is mapped to a bitmap and connected in sequence, which is equivalent to representing the hand trajectory in the bitmap, and this bitmap is called the hand trajectory bitmap.

Bayesian classifier is based on the known probability and misjudgment loss to select the optimal category to minimize the risk of classification. Refer to the following formula:

h represents a Bayesian classifier, x is the sample, λ _ij is the loss generated when the error c _j into _{_{c i, p (c j |}} x) is the expected loss generated when misclassification, N is samples number. The hand trajectory bitmap is input into the Bayesian classifier, and the gesture recognition result can be output.

Fig. 4 shows a schematic flow of a gesture recognition method. As shown in the figure, after the camera collects the original image, the resolution can be adjusted according to the preset resolution to reduce the image; then the face image is extracted from the original image with the adjusted resolution through HWFD, so that the subsequent processing is concentrated on the original image Detect and extract hand candidate areas from the face image to further narrow the image range; detect hand key points from the hand candidate area, and determine according to the position changes of the hand key points between different frames The hand trajectory is mapped as a hand trajectory bitmap; the hand trajectory bitmap is input to the Bayesian classifier, and the Bayesian classifier is processed to output the gesture recognition result.

In an optional implementation manner, the foregoing terminal device may include multiple cameras. After the gesture recognition result is obtained, the above-mentioned multiple cameras can be switched according to the gesture recognition result. For example, when the gesture recognition result is that the finger is shaken left and right, the terminal device is triggered to switch to the main camera; when the gesture recognition result is that the finger is swiped up and down, the terminal device is triggered to switch to the telephoto camera, and so on. In this way, when the user is at a certain distance from the terminal device, he can face the camera to perform operations through gestures, which is more convenient.

In the gesture recognition method of this exemplary embodiment, multiple frames of original images are collected by the camera, face images are extracted respectively, and key points of the hands are detected from each frame of face images, and then generated according to the position changes of the key points of the hands. The hand trajectory, and finally the hand trajectory is recognized, and the gesture recognition result is obtained. Since the user's hand is generally located in front of or near the face when performing gesture operations, extracting the face image from the original image to detect the key points of the hand is equivalent to cropping the original image, and the cropping has nothing to do with gesture recognition. This reduces the amount of image processing data, so that the system only needs to perform gesture recognition in the face image, which reduces the time-consuming process, improves the real-time performance of gesture recognition, and does not require high hardware processing performance. It is conducive to deployment in lightweight scenarios such as mobile terminals.

Exemplary embodiments of the present disclosure also provide a gesture control method, which can be applied to a terminal device equipped with a camera. The gesture control method may include:

When the gesture control function is turned on, the gesture recognition result is obtained according to the gesture recognition method in this exemplary embodiment;

Execute the control instruction corresponding to the gesture recognition result.

Wherein, enabling the gesture control function includes but is not limited to: when a game program with gesture control function is started, the terminal automatically starts the gesture control function; in interfaces such as taking pictures or browsing the web, the user chooses to enable the gesture control function. The corresponding relationship between gestures and control commands can be preset in the program. For example, waving the palm corresponds to the screenshot command, sliding the finger downward corresponds to the page turning command, etc., when the user's gesture is recognized, it can be quickly found and executed according to the gesture recognition result Corresponding control instructions. In particular, in the camera interface, the user can be allowed to take pictures through specific gesture control, for example, when the user makes a thumbs-up gesture, the terminal device is triggered to automatically press the shutter button for taking pictures; or when the terminal device is equipped with multiple cameras, the user is allowed The camera switching is controlled through specific gestures. For example, when the user shakes a finger, the terminal device is triggered to switch between the main camera, the telephoto camera, and the wide-angle camera, thereby facilitating the user's photographing operation.

FIG. 5 shows a flow of a gesture control method, which may include the following steps S510 to S550:

Step S510, when the gesture control function is turned on, acquire multiple frames of original images collected by the camera;

Step S520, extracting face images from the foregoing multiple frames of original images respectively to obtain multiple frames of face images;

Step S530, detecting the hand key points in each frame of the face image, and generating a hand trajectory according to the position changes of the hand key points in the multi-frame face image;

Step S540: Recognizing the trajectory of the hand to obtain a gesture recognition result;

Step S550, execute the control instruction corresponding to the gesture recognition result.

In the gesture control method of this exemplary embodiment, based on real-time gesture recognition, when the user makes a gesture operation, the control instruction corresponding to the gesture recognition result can be executed immediately, so as to achieve a fast interactive response and improve the interaction delay Problems, improve user experience, and have high practicability for somatosensory games.

Exemplary embodiments of the present disclosure also provide a gesture recognition device, which can be configured in a terminal device equipped with a camera. As shown in Fig. 6, the gesture recognition device 600 may include a processor 610 and a memory 620; the memory 620 stores the following program modules:

The original image acquisition module 621 is used to acquire multiple frames of original images collected by the camera;

The face image extraction module 622 is configured to extract face images from the above-mentioned multiple frames of original images to obtain multiple frames of face images;

The hand trajectory generation module 623 is used to detect the hand key points in each frame of face image, and generate the hand trajectory according to the position changes of the hand key points in the multi-frame face image;

The hand trajectory recognition module 624 is used for recognizing the hand trajectory to obtain a gesture recognition result;

The processor 610 is configured to execute the foregoing program modules.

In an optional implementation manner, the original image acquisition module 621 is configured to:

After acquiring multiple frames of original images collected by the camera, the resolution of the multiple frames of original images is adjusted to a preset resolution.

In an optional implementation manner, the hand trajectory generating module 623 is configured to:

Perform regional feature detection on each frame of face image to extract hand candidate regions from each frame of face image;

Detect key points of the hand in the hand candidate area.

If the hand candidate region extracted from the face image of the current frame is a null value, the hand key points detected in the previous frame are used as the hand key points of the current frame.

Extract features from face images through convolutional layers;

Process the extracted features through the area generation network to obtain candidate frames;

In an optional implementation manner, the hand trajectory recognition module 624 is configured to:

Map the hand trajectory to the bitmap to get the hand trajectory bitmap;

In an optional implementation manner, the foregoing terminal device includes multiple cameras; the hand track recognition module 624 is configured to:

After obtaining the gesture recognition result, switch between the above-mentioned multiple cameras according to the gesture recognition result.

Exemplary embodiments of the present disclosure also provide another gesture recognition device, which can be configured in a terminal device equipped with a camera. As shown in FIG. 7, the gesture recognition apparatus 700 may include:

The original image acquisition module 710 is configured to acquire multiple frames of original images collected by the camera;

The face image extraction module 720 is configured to extract face images from the foregoing multiple frames of original images to obtain multiple frames of face images;

The hand trajectory generating module 730 is used to detect the hand key points in each frame of face image, and generate the hand trajectory according to the position changes of the hand key points in the multi-frame face image;

The hand trajectory recognition module 740 is used for recognizing the hand trajectory to obtain a gesture recognition result.

In an optional implementation manner, the original image acquisition module 710 is configured to:

In an optional implementation manner, the hand trajectory generating module 730 is configured to:

Detect key points of the hand in the hand candidate area.

Extract features from face images through convolutional layers;

In an optional implementation manner, the hand trajectory recognition module 740 is configured to:

Map the hand trajectory to the bitmap to get the hand trajectory bitmap;

In an optional implementation manner, the foregoing terminal device includes multiple cameras; the hand track recognition module 740 is configured to:

Exemplary embodiments of the present disclosure also provide a gesture control method, which can be configured in a terminal device equipped with a camera. As shown in FIG. 8, the gesture control device 800 may include a processor 810 and a memory 820; wherein the memory 820 stores the following program modules:

The original image acquisition module 821 is configured to acquire multiple frames of original images collected by the camera when the gesture control function is turned on;

The face image extraction module 822 is configured to extract face images from the above-mentioned multiple frames of original images to obtain multiple frames of face images;

The hand trajectory generation module 823 is used to detect the hand key points in each frame of face image, and generate the hand trajectory according to the position changes of the hand key points in the multi-frame face image;

The hand trajectory recognition module 824 is used for recognizing the hand trajectory to obtain a gesture recognition result;

The control instruction execution module 825 is configured to execute the control instruction corresponding to the gesture recognition result;

The processor 810 is used to execute the above-mentioned program modules.

In an optional implementation manner, the aforementioned control instruction includes a camera switching instruction.

In an optional implementation manner, the original image acquisition module 821 is configured to:

In an optional implementation manner, the hand trajectory generating module 823 is configured to:

Detect key points of the hand in the hand candidate area.

Extract features from face images through convolutional layers;

In an optional implementation manner, the hand trajectory recognition module 824 is configured to:

Map the hand trajectory to the bitmap to get the hand trajectory bitmap;

Exemplary embodiments of the present disclosure also provide another gesture control device, which can be configured in a terminal device equipped with a camera. As shown in FIG. 9, the gesture control device 900 may include:

The original image acquisition module 910 is used to acquire multiple frames of original images collected by the camera when the gesture control function is turned on;

The face image extraction module 920 is configured to extract face images from the foregoing multiple frames of original images to obtain multiple frames of face images;

The hand trajectory generating module 930 is used to detect the hand key points in each frame of face image, and generate the hand trajectory according to the position changes of the hand key points in the multi-frame face image;

The hand trajectory recognition module 940 is used for recognizing the hand trajectory to obtain a gesture recognition result;

The control instruction execution module 950 is used to execute the control instruction corresponding to the gesture recognition result.

In an optional implementation manner, the original image acquisition module 910 is configured to:

In an optional implementation manner, the hand trajectory generating module 930 is configured to:

Detect key points of the hand in the hand candidate area.

Extract features from face images through convolutional layers;

In an optional implementation manner, the hand trajectory recognition module 940 is configured to:

Map the hand trajectory to the bitmap to get the hand trajectory bitmap;

In the aforementioned gesture recognition device and gesture control device, the specific details of each module have been described in the implementation of the gesture recognition method and the gesture control method respectively. For undisclosed details, please refer to the related content of the implementation in the method section. Go into details again.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium, which can be implemented in the form of a program product, which includes program code. When the program product runs on a terminal device, the program code is used to make the terminal device Perform the steps according to various exemplary embodiments of the present disclosure described in the above-mentioned "Exemplary Method" section of this specification.

As shown in FIG. 10, a program product 1000 for implementing the above-mentioned method according to an exemplary embodiment of the present disclosure is described. It may adopt a portable compact disk read-only memory (CD-ROM) and include program codes, and may be installed in a terminal Running on equipment, such as a personal computer. However, the program product of the present disclosure is not limited thereto. In this document, the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, device, or device.

The program product can adopt any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

The computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. The readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.

The program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.

The program code for performing the operations of the present disclosure can be written in any combination of one or more programming languages. The programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming. Language-such as "C" language or similar programming language. The program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on. In the case of a remote computing device, the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).

Exemplary embodiments of the present disclosure also provide a terminal device capable of implementing the above method. The terminal device may be a mobile phone, a tablet computer, a digital camera, or the like. The terminal device 1100 according to this exemplary embodiment of the present disclosure will be described below with reference to FIG. 11. The terminal device 1100 shown in FIG. 11 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.

As shown in FIG. 11, the terminal device 1100 may be represented in the form of a general-purpose computing device. The components of the terminal device 1100 may include but are not limited to: at least one processing unit 1110, at least one storage unit 1120, a bus 1130 connecting different system components (including the storage unit 1120 and the processing unit 1110), a display unit 1140, and an image acquisition unit 1170, The image acquisition unit 1170 includes at least one camera.

The storage unit 1120 stores program codes, and the program codes can be executed by the processing unit 1110, so that the processing unit 1110 executes the steps according to various exemplary embodiments of the present disclosure described in the above-mentioned "Exemplary Method" section of this specification. For example, the processing unit 1110 may execute the method steps shown in FIG. 1, FIG. 2 or FIG. 5.

The storage unit 1120 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 1121 and/or a cache storage unit 1122, and may further include a read-only storage unit (ROM) 1123.

The storage unit 1120 may also include a program/utility tool 1124 having a set of (at least one) program modules 1125. Such program modules 1125 include but are not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.

The bus 1130 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.

The terminal device 1100 may also communicate with one or more external devices 1200 (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable the user to interact with the terminal device 1100, and/or communicate with Any device (such as a router, modem, etc.) that enables the terminal device 1100 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 1150. In addition, the terminal device 1100 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 1160. As shown in the figure, the network adapter 1160 communicates with other modules of the terminal device 1100 through the bus 1130. It should be understood that although not shown in the figure, other hardware and/or software modules can be used in conjunction with the terminal device 1100, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.

Through the description of the above embodiments, those skilled in the art can easily understand that the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the exemplary embodiment of the present disclosure.

Those skilled in the art can understand that various aspects of the present disclosure can be implemented as a system, a method, or a program product. Therefore, various aspects of the present disclosure can be specifically implemented in the following forms, namely: complete hardware implementation, complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software implementations, which may be collectively referred to herein as "Circuit", "Module" or "System".

In addition, the above-mentioned drawings are merely schematic illustrations of the processing included in the method according to the exemplary embodiment of the present disclosure, and are not intended for limitation. It is easy to understand that the processing shown in the above drawings does not indicate or limit the time sequence of these processings. In addition, it is easy to understand that these processes can be executed synchronously or asynchronously in multiple modules, for example.

It should be noted that although several modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory. In fact, according to the exemplary embodiments of the present disclosure, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.

Those skilled in the art will easily think of other embodiments of the present disclosure after considering the specification and practicing the invention disclosed herein. This application is intended to cover any variations, uses, or adaptive changes of the present disclosure. These variations, uses, or adaptive changes follow the general principles of the present disclosure and include common knowledge or conventional technical means in the technical field that are not disclosed in the present disclosure. . The description and the embodiments are only regarded as exemplary, and the true scope and spirit of the present disclosure are pointed out by the claims.

It should be understood that the present disclosure is not limited to the precise structure that has been described above and shown in the drawings, and various modifications and changes can be made without departing from its scope. The scope of the present disclosure is limited only by the appended claims.

Claims

A gesture recognition method applied to a terminal device equipped with a camera, characterized in that the method includes:

Acquiring multiple frames of original images collected by the camera;

Extracting face images from the multiple frames of original images to obtain multiple frames of face images;

Detecting hand key points in each frame of face image, and generating a hand trajectory according to position changes of the hand key points in the multi-frame face image;

The hand trajectory is recognized, and the gesture recognition result is obtained.
The method according to claim 1, wherein after acquiring multiple frames of original images collected by the camera, the method further comprises:

The resolution of the multiple frames of original images is adjusted to a preset resolution.
The method according to claim 1, wherein the detecting the key points of the hands in each frame of the face image comprises:

Performing region feature detection on each frame of the face image, so as to extract a hand candidate region from each frame of the face image;

The key points of the hand are detected in the hand candidate area.
The method according to claim 3, wherein the detecting the key points of the hands in each frame of the face image further comprises:

If the hand candidate region extracted from the face image of the current frame is a null value, the hand key points detected in the previous frame are used as the hand key points of the current frame.
The method according to claim 3, wherein said performing area feature detection on each frame of face image to extract a hand candidate area from said frame of face image comprises:

Extracting features from the face image through a convolutional layer;

Process the extracted features through the area generation network to obtain candidate frames;

Classify the candidate frame through a classification layer to obtain a hand candidate area;

The position and size of the candidate hand region are optimized through the regression layer.
The method according to claim 1, wherein the recognizing the hand trajectory to obtain a gesture recognition result comprises:

Mapping the hand trajectory to a bitmap to obtain a hand trajectory bitmap;

The hand trajectory bitmap is processed by the Bayesian classifier to obtain the gesture recognition result.
The method according to claim 1, wherein the terminal device comprises multiple cameras; after obtaining a gesture recognition result, the method further comprises:

Switching between the multiple cameras according to the gesture recognition result.
A gesture control method applied to a terminal device equipped with a camera, characterized in that the method includes:

When the gesture control function is turned on, a gesture recognition result is obtained according to the method according to any one of claims 1 to 7;

Execute the control instruction corresponding to the gesture recognition result.
The method according to claim 8, wherein the control instruction includes a camera switching instruction.
A gesture recognition device configured in a terminal device equipped with a camera, wherein the device includes a processor; wherein the processor is used to execute the following program modules stored in the memory:

An original image acquisition module for acquiring multiple frames of original images collected by the camera;

The face image extraction module is configured to extract face images from the multiple frames of original images to obtain multiple frames of face images;

The hand trajectory generating module is used to detect the hand key points in each frame of face image, and generate the hand trajectory according to the position changes of the hand key points in the multi-frame face image;

The hand trajectory recognition module is used to recognize the hand trajectory and obtain the gesture recognition result.
The device according to claim 10, wherein the original image acquisition module is configured to:

After acquiring the multiple frames of original images collected by the camera, the resolution of the multiple frames of original images is adjusted to a preset resolution.
The device according to claim 10, wherein the hand trajectory generating module is configured to:

Performing region feature detection on each frame of the face image, so as to extract a hand candidate region from each frame of the face image;

The key points of the hand are detected in the hand candidate area.
The device according to claim 12, wherein the hand trajectory generating module is configured to:

If the hand candidate region extracted from the face image of the current frame is a null value, the hand key points detected in the previous frame are used as the hand key points of the current frame.
The device according to claim 12, wherein the hand trajectory generating module is configured to:

Extracting features from the face image through a convolutional layer;

Process the extracted features through the area generation network to obtain candidate frames;

Classify the candidate frame through a classification layer to obtain a hand candidate area;

The position and size of the candidate hand region are optimized through the regression layer.
The device according to claim 10, wherein the hand trajectory recognition module is configured to:

Mapping the hand trajectory to a bitmap to obtain a hand trajectory bitmap;

The hand trajectory bitmap is processed by the Bayesian classifier to obtain the gesture recognition result.
The apparatus according to claim 10, wherein the terminal device comprises a plurality of cameras; and the hand track recognition module is configured to:

After obtaining the gesture recognition result, switch between the multiple cameras according to the gesture recognition result.
A gesture control device configured in a terminal device equipped with a camera, wherein the device includes a processor; wherein the processor is used to execute the following program modules stored in the memory:

An original image acquisition module for acquiring multiple frames of original images collected by the camera when the gesture control function is turned on;

The face image extraction module is configured to extract face images from the multiple frames of original images to obtain multiple frames of face images;

The hand trajectory generating module is used to detect the hand key points in each frame of face image, and generate the hand trajectory according to the position changes of the hand key points in the multi-frame face image;

The hand trajectory recognition module is used to recognize the hand trajectory to obtain a gesture recognition result;

The control instruction execution module is used to execute the control instruction corresponding to the gesture recognition result.
The device according to claim 17, wherein the control instruction comprises a camera switching instruction.
A computer-readable storage medium having a computer program stored thereon, wherein the computer program implements the method according to any one of claims 1 to 9 when the computer program is executed by a processor.
A terminal device, characterized in that it comprises:

processor;

A memory for storing executable instructions of the processor; and

webcam;

Wherein, the processor is configured to execute the method according to any one of claims 1 to 9 by executing the executable instructions.