WO2023273372A1

WO2023273372A1 - Gesture recognition object determination method and apparatus

Info

Publication number: WO2023273372A1
Application number: PCT/CN2022/078623
Authority: WO
Inventors: 黄允臻; 王浩; 李冬虎; 冷继南; 常胜
Original assignee: 华为技术有限公司
Priority date: 2021-06-30
Filing date: 2022-03-01
Publication date: 2023-01-05

Abstract

The present application discloses a gesture recognition object determination method and apparatus, and belongs to the field of computer vision. First, a device determines, according to multiple first image frames obtained by a camera capturing a capture area, one or more potential users in the capture area, potential users satisfying: each first image frame among the multiple first image frames comprising a face image of the potential user. Then, the device determines a hand action of the potential user according to an area to be recognized corresponding to the potential user in the multiple first image frames, the area to be recognized corresponding to the potential user comprising a hand image of the potential user. Finally, a target user of the one or more potential users is determined to be a gesture recognition object, and a hand action of the target user is matched to a preset gesture. In the present application, a gesture recognition object can be automatically determined on the basis of an image captured by a camera, and further, an air gesture operation of the gesture recognition object can be implemented, which is applicable to gesture recognition in multiple scenarios, especially in a multi-user scenario. An implementation means is simple.

Description

Gesture recognition object determination method and device

This application requires that the application number submitted on June 30, 2021 is 202110736357.6, and the name of the invention is "a method, device and system for gesture recognition" and the application number submitted on September 3, 2021 is 202111034365.2, and the name of the invention is The priority of the Chinese patent application "Method and Device for Gesture Recognition Object Determination", the entire content of which is incorporated in this application by reference.

technical field

The present application relates to the field of computer vision, in particular to a method and device for determining a gesture recognition object.

Background technique

In the field of computer vision, gesture recognition is a very important way of human-computer interaction. Gesture recognition technology uses various sensors to model the shape and displacement of the hand (arm), forms an information sequence, and then converts the information sequence into corresponding instructions to control certain operations.

Since the gestures of multiple users will be recognized in a multi-user scenario, how to determine the gesture recognition object among multiple users is the key to realizing accurate gesture control.

Contents of the invention

The present application provides a gesture recognition object determination method and device.

In a first aspect, a method for determining a gesture recognition object is provided. The method can be applied to general computing devices. The method includes: determining one or more potential users in the shooting area according to multiple frames of first images obtained by shooting the shooting area with the camera, and the potential users satisfy that each frame of the first images in the multiple frames of first images includes Face images of potential users. The hand movement of the potential user is determined according to the to-be-recognized area corresponding to the potential user in the multiple frames of the first image, and the to-be-identified area corresponding to the potential user includes the hand image of the potential user. A target user among one or more potential users is determined as a gesture recognition object, and a hand movement of the target user matches a preset gesture.

In the present application, a user whose face image exists in each frame of multiple images captured by the camera and whose hand motion matches a preset gesture is determined as a gesture recognition object within the shooting area of the camera. Based on the image captured by the camera, the gesture recognition object can be automatically determined, and gesture recognition can be performed on the gesture recognition object to realize the air gesture operation. It is suitable for gesture recognition in various scenarios, especially multi-user scenarios, and the implementation method is simple.

Optionally, after determining the target user among the one or more potential users as the gesture recognition object, the above method further includes: acquiring a region to be recognized corresponding to the target user in multiple frames of second images, the region to be recognized corresponding to the target user Including the target user's hand image, the multi-frame second images are obtained by shooting the shooting area by the camera after the target user is determined as the gesture recognition object. Gesture recognition is performed on the target user according to the regions to be recognized corresponding to the target user in the multiple frames of second images. Here it can be understood as: only obtain the to-be-recognized area corresponding to the target user in multiple frames of second images, and only perform gesture recognition on the target user.

In this application, after the target user is determined as the gesture recognition target, gesture recognition is only performed on the target user as the gesture recognition target within a period of time, and no gesture recognition is performed on other users except the target user, that is, within a period of time The time-locked recognition of a user's gesture can avoid the problem that the gestures of the users interfere with each other, resulting in the inability to realize accurate gesture control.

Optionally, the preset gesture includes the initial part of the gesture to be recognized, and the realization process of performing gesture recognition on the target user according to the region to be recognized corresponding to the target user in the multiple frames of the second image includes: according to the multiple frames of the first image The area to be recognized corresponding to the target user and the area to be recognized corresponding to the target user in the multiple frames of second images are used to determine whether the target user performs a gesture to be recognized.

In this application, the initial part of the gesture to be recognized is used as the preset gesture used to determine the gesture recognition object. When the user needs to perform an air gesture operation, the gesture to be recognized can be performed directly in the shooting area of the camera without Perform other specific wake-up gestures to enable the gesture recognition function of the device, and realize the determination of the gesture recognition object without the user's perception, which can simplify user operations and improve user experience.

Optionally, the implementation process of acquiring the to-be-recognized area corresponding to the target user in multiple frames of second images includes: determining the face image position of the target user in the second image according to the saved face information of the target user. According to the face image position of the target user in the second image, the region to be recognized corresponding to the target user in the second image is determined.

In this application, after the target user is determined as a gesture recognition object, the face information of the target user can be saved, so that the hand movement of the target user can be associated with the face information of the target user, and the hand tracking of the target user can be realized. Then realize the gesture recognition of the target user.

Optionally, when the camera captures the target user as the gesture recognition object, the number of images that do not include the target user's face image in the image captured by the camera exceeds the number threshold, or the duration of the target user as the gesture recognition object When the duration threshold is reached, the target user is terminated as a gesture recognition object.

In this application, only one gesture recognition object can be determined at most at the same time. Since the gesture recognition object may change over time, by setting some conditions to end the target user as the gesture recognition object, the gesture recognition object in the application scenario can be satisfied. Recognize the flexible and changing needs of the object.

Optionally, after determining one or more potential users in the shooting area, the above method further includes: determining the face image positions of the potential users in the first image. According to the position of the face image of the potential user in the first image, a region to be recognized corresponding to the potential user in the first image is determined.

Optionally, the implementation process of determining a target user among one or more potential users as a gesture recognition object includes: when there are multiple potential users whose hand movements match the preset gestures in the shooting area, set the target user closest to the camera potential users as target users.

Optionally, after determining one or more potential users in the shooting area, the above method further includes: acquiring a distance from the potential users to the camera. When the distance from the potential user to the camera exceeds the distance threshold, a distance prompt is output, which is used to prompt the potential user to approach the camera.

If the potential user is far away from the camera, the human body image of the potential user in the image captured by the camera will be small, which may not reflect the details of the user's hand, which may lead to subsequent misjudgment of the user's hand movement. Output distance prompts to remind potential users to approach the camera. If potential users want to perform air gesture operations, they can approach the camera according to the distance prompt, which can improve the accuracy of determining the gesture recognition object, and further improve the distance of the gesture recognition object. Recognition accuracy of empty gestures.

Optionally, the implementation process of obtaining the distance from the potential user to the camera includes: determining the distance from the potential user to the camera according to the focal length of the camera, the distance between the eyes of the potential user in the first image, and the distance between the eyes of the preset user.

For example, assuming that the focal length of the camera is f, the distance between the eyes of the user in the image (that is, the imaging plane) captured by the camera containing the user's frontal face image is M, and the preset distance between the eyes of the user is K, assuming that the distance between the user and the camera is is d, then according to the principle of similar triangles, it can be obtained: M/f=K/d, from which the distance d=(K*f)/M from the user to the camera can be deduced.

In this application, it is not limited to whether the camera is a monocular camera, a binocular camera or whether it is integrated with a depth sensor. The distance from the user to the camera can be determined based on the principle of similar triangles. The calculation method is simple and the implementation cost is low.

Optionally, according to the regions to be recognized corresponding to the potential users in the multiple frames of the first images, the implementation process of determining the hand movements of the potential users includes: respectively performing key points on the regions to be recognized corresponding to the potential users in the multiple frames of the first images Detect and obtain multiple sets of key point information on the hands of potential users. According to multiple sets of hand key point information of the potential user, the hand motion of the potential user is determined.

Optionally, the area to be identified corresponding to the potential user also includes an elbow image of the potential user, and according to the area to be identified corresponding to the potential user in the multi-frame first image, the implementation process of determining the hand movement of the potential user includes: The regions to be identified corresponding to the potential users in the first image of the frame are respectively subjected to key point detection, and multiple sets of key point information of the hands and elbows of the potential users are obtained. According to multiple sets of hand and elbow key point information of the potential user, the hand movement of the potential user is determined.

When the key points of the user's hands are too compact or part of the key points of the hand are missing in the detection results, it may lead to misjudgment or missed judgment of the user's hand movement. This application combines the key points of the user's elbow The moving direction is used to assist in judging the moving direction of the user's hand, which can improve the accuracy of judging the user's hand motion.

Optionally, the above human face image is a front face image. That is, the potential user satisfies that each frame of the first image in the multiple frames of the first image includes the front face image of the potential user.

Since the user usually faces the camera when performing an air gesture operation, this application can also exclude the users facing the camera inside the shooting area, and only determine potential users among the users facing the camera, which can reduce the erroneous recognition of the gesture recognition object. Judgment probability.

In a second aspect, an apparatus for determining an object for gesture recognition is provided. The device includes a plurality of functional modules, and the plurality of functional modules interact to implement the methods in the above first aspect and various implementation manners thereof. The multiple functional modules can be implemented based on software, hardware or a combination of software and hardware, and the multiple functional modules can be combined or divided arbitrarily based on specific implementations.

In a third aspect, a gesture recognition object determination device is provided, including: a processor and a memory;

The memory is used to store a computer program, and the computer program includes program instructions;

The processor is configured to invoke the computer program to implement the methods in the above first aspect and various implementation manners thereof.

In a fourth aspect, a computer-readable storage medium is provided. Instructions are stored on the computer-readable storage medium. When the instructions are executed by a processor, the above-mentioned first aspect and the methods in each implementation manner thereof are realized.

In a fifth aspect, a computer program product is provided, including a computer program. When the computer program is executed by a processor, the method in the above first aspect and its various implementation manners is implemented.

According to a sixth aspect, a chip is provided, and the chip includes a programmable logic circuit and/or program instructions, and when the chip is running, implements the method in the above first aspect and various implementation manners thereof.

Description of drawings

FIG. 1 is a schematic diagram of an application scenario involved in a gesture recognition object determination method provided in an embodiment of the present application;

FIG. 2 is a schematic flowchart of a method for determining a gesture recognition object provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a ranging principle provided by an embodiment of the present application;

Fig. 4 is a schematic diagram of an image provided by an embodiment of the present application;

Fig. 5 is a schematic diagram of the distribution of key points of a hand provided by the embodiment of the present application;

Fig. 6 is a schematic structural diagram of a gesture recognition object determination device provided by an embodiment of the present application;

Fig. 7 is a schematic structural diagram of another gesture recognition object determination device provided by an embodiment of the present application;

Fig. 8 is a schematic structural diagram of another device for determining an object for gesture recognition provided by an embodiment of the present application;

Fig. 9 is a schematic structural diagram of another device for determining an object for gesture recognition provided by an embodiment of the present application;

Fig. 10 is a schematic structural diagram of another device for determining an object for gesture recognition provided by an embodiment of the present application;

Fig. 11 is a block diagram of a gesture recognition object determination device provided by an embodiment of the present application.

detailed description

In order to make the purpose, technical solution and advantages of the present application clearer, the implementation manners of the present application will be further described in detail below in conjunction with the accompanying drawings.

With the development of computer vision technology, various forms of products such as mobile phones, electronic screens, and virtual reality (VR) devices emerge in an endless stream, and the demand for human-machine interaction is also increasing. Since gestures can express rich information in a non-contact manner, gesture recognition is widely used in human-computer interaction, smart phones, smart TVs and other products. In particular, the vision-based gesture recognition technology does not need to wear additional sensors on the hand to increase the mark, which is convenient and has a wide range of application prospects in human-computer interaction. The gestures mentioned in this application all refer to non-contact gestures, that is, air gestures.

At present, in many scenarios, there is a demand for air gesture operations on display devices. For example, in a conference room scenario, participants can perform air gesture operations such as page up, page down, left page, right page, and screenshots on the display screen on the conference terminal. For another example, in a family scene, family members can perform air gesture operations such as fast forward, rewind, turn up the volume, turn down the volume, and pause on the playback screen on the smart TV. For another example, in a classroom scene, a teacher or a student may perform air gesture operations such as scrolling up and scrolling down on the displayed content on the display device.

However, in these scenarios, there are usually multiple users in front of the display device. When gesture recognition is performed based on the images collected by the camera, it is easy to recognize the gestures of multiple users. The display device may not be able to distinguish which user is performing the air gesture operation. Gestures between users interfere with each other, so that the display device cannot realize accurate gesture control.

Based on this, this application proposes a solution for determining the gesture recognition object: by performing face detection on multiple frames of images captured by the camera to identify potential users in the shooting area, and then by judging the hand movements of potential users to Determine the gesture recognition object among potential users. Specifically, the user whose face image exists in each frame of multiple frames of images captured by the camera and whose hand movements match the preset gestures can be determined as a gesture within the camera’s shooting area. Identify objects. This application can automatically determine the gesture recognition object based on the image taken by the camera, and further can perform gesture recognition on the gesture recognition object to realize the gesture operation in the air. It is suitable for gesture recognition in various scenarios, especially multi-user scenarios, and the implementation method is simple. . In addition, in the process of performing gesture recognition on the gesture recognition object, gestures of other users other than the gesture recognition object will not be recognized, which can avoid the problem that the gestures of users interfere with each other and cause accurate gesture control cannot be realized.

Optionally, based on the consideration of the user's operating habits, since the user usually faces the camera when performing an air gesture operation, based on this, the solution of this application can also exclude users who face the camera inside the shooting area, and only face the camera The gesture recognition object is determined among the users. Specifically, the user whose face image exists in each frame of the multi-frame images captured by the camera and whose hand movements match the preset gestures can be determined as the gesture recognition within the shooting area of the camera. object. In this way, the misjudgment probability of the gesture recognition object can be reduced. In order to improve the determination accuracy of the gesture recognition object, it can also be clearly indicated in the operation manual of the display device supporting the gesture control function that the user needs to face the camera when performing an air gesture operation. Facing the camera referred to in this application does not mean that the face is completely facing the camera, but there may be a deviation within a set range. The face is completely facing the camera. It can be that the line connecting the eyes is parallel to the imaging plane of the camera. If the deflection angle of the face when the face is completely facing the camera is 0°, then in this application, facing the camera means that the deflection angle of the face relative to the camera is within a certain range. That is, if the deflection angle of the user's face is within a certain deflection angle range, the user is considered to be facing the camera. For example, users whose face deflection angles are within the range of -30° to 30° can be regarded as facing the camera. The range of face deflection angles here is only used as an example. To determine whether the user is facing the camera's face deflection angle range. In this application, the face of the user facing the camera is called the front face.

The method for determining a gesture recognition object provided in the embodiment of the present application may be applied to a general-purpose computing device. The general computing device may be a display device or a post-processing terminal connected to the display device. Wherein, the display device supports a gesture control function. The display device has a built-in camera, or the display device is connected to an external camera. The camera is used to take pictures of the shooting area to obtain images. The display device or the post-processing terminal connected to the display device is used to determine the gesture recognition object in the shooting area according to the image captured by the camera, and further perform gesture recognition on the gesture recognition object to respond to the gesture operation in the air. The deployment orientation of the camera is generally consistent with the deployment orientation of the display device, and the shooting area of the camera generally includes the area toward which the display surface of the display device faces. The post-processing end may be a server, or a server cluster composed of multiple servers, or a cloud computing platform.

The gesture recognition object determination method provided in the embodiment of the present application can be applied to various scenarios. In a conference room scenario, the display device can be a conference terminal such as a large screen or an electronic whiteboard. In a family scene or a classroom scene, the display device may be a smart TV, a projection device, or a VR device.

For example, FIG. 1 is a schematic diagram of an application scenario involved in a method for determining a gesture recognition object provided in an embodiment of the present application. The application scenario is a conference room scenario. As shown in FIG. 1 , the application scenario includes a conference terminal, and the conference terminal has a built-in camera. The conference terminal is installed on the wall. The camera's field of view includes the conference table and several attendees. When the gesture control function of the conference terminal is turned on, the camera continues to photograph the shooting area, and the conference terminal or the post-processing terminal (not shown in the figure) connected to the conference terminal processes the images captured by the camera to determine whether there are Gesture recognition object.

The method flow of the embodiment of the present application will be described below.

Fig. 2 is a schematic flowchart of a method for determining a gesture recognition object provided by an embodiment of the present application. As shown in Figure 2, the method includes:

Step 201: Determine one or more potential users in the shooting area according to multiple frames of first images obtained by shooting the shooting area with the camera.

The potential user satisfies that each frame of the first image in the multiple frames of the first image includes the face image of the potential user. That is to say, a user whose face image exists in each frame of the first image in multiple frames is regarded as a potential user in the shooting area. Wherein, the number of frames used to determine the image of the potential user is pre-configured. For example, the first multi-frame image may be 3 frames, 5 frames, or 10 frames. The embodiment of the present application uses the first The frame number of the image is not limited.

Optionally, face detection is performed on multiple frames of the first image respectively, so as to obtain a face image in each frame of the first image. Then determine which face images in different first images belong to the same user. Finally determine which user's face images exist in each frame of the first image in the multiple frames of the first image, and then obtain potential users in the shooting area.

For example, multi-task cascaded convolutional networks (MTCNN) can be used for face detection. MTCNN includes three cascaded networks: proposal network (P-Net), refinement network (refinement network, R-Net) and output network (output network, O-Net). The process of face detection of images based on MTCNN includes:

First, build an image pyramid for the input original image. The image pyramid includes multiple images of different sizes obtained by scaling the original image. Since there may be face images of different sizes in the original image, by establishing an image pyramid, face images of different sizes in the original image can be detected at a uniform size, and the robustness of the network to face images of different sizes can be enhanced.

Secondly, the image pyramid is input into the three cascaded networks (P-Net, R-Net, O-Net), and the face image in the image is detected from coarse to fine through the three cascaded networks, and finally Output the face detection result. Among them, P-Net is used to return multiple detection frames to the input image, and then map the multiple detection frames to the original image, and remove some of them through the non-maximum suppression (NMS) algorithm. Redundant frames to get preliminary face detection results. R-Net is used to further refine and filter the face detection results output by P-Net. O-Net is used to further refine and filter the face detection results output by R-Net, and output the final face detection results.

Optionally, the face detection result obtained based on the MTCNN includes face detection frame information and face key point information corresponding to each detected face image. Wherein, the face detection frame information may include the coordinates of the upper left corner and the lower right corner of the face detection frame, and the face image is located in the face detection frame. The human face key point information may include coordinates of multiple human face key points, and the multiple human face key points may include left eye, right eye, nose, left mouth corner, right mouth corner, and the like.

After obtaining the face detection frame information in the multi-frame first image, the intersection over union (IoU) of the face detection frames in every two adjacent frames of the first image in the multi-frame first image can be calculated respectively value. Based on the IoU value, the face images belonging to the same user in the first images of two adjacent frames are determined. The IoU value here may be equal to the ratio of the intersection area and the merged area of the face detection frames in the two adjacent frames of the first image after superimposing the two adjacent frames of the first image. The IoU value ranges from 0 to 1. For example, two adjacent first images are image A and image B respectively, when the IoU value of the first face detection frame in image A and the second face detection frame in image B is greater than the preset threshold, it can be determined The face image in the first face detection frame and the face image in the second face detection frame belong to the same user. If the above multiple frames of first images include multiple frames of images continuously collected by the camera, then the preset threshold may take a larger value, for example, may take a value of 0.8. If the above multiple frames of first images include multiple frames of images collected by the camera at intervals, the preset threshold may take a smaller value, for example, 0.6. The embodiment of the present application does not limit the specific value of the preset threshold.

Alternatively, after obtaining the face images in the multiple frames of the first images, it may also be determined which face images in different first images belong to the same user by calculating the face similarity.

In the embodiment of the present application, the same user identifier may be used to identify face images belonging to the same user in different images, and different user identifiers may be used to identify face images belonging to different users in the same image. If each frame of the first image in the plurality of frames of first images includes a face image identified by the same user identifier, then the user represented by the user identifier is determined as a potential user. The user identifiers used here only need to distinguish different users, for example, numbers, characters or other identifiers may be used as user identifiers. Since the application scheme does not need to identify the user identity, but only needs to be able to distinguish different users, there is no need to pre-set gesture recognition objects that may exist in the scene, and it can be flexibly applied to various multi-user scenarios, especially when the user group is changeable multi-user scenarios, such as public conference rooms.

Since the user usually faces the camera when performing an air gesture operation, the embodiment of the present application can also exclude the users facing the camera inside the shooting area, and only determine potential users among the users facing the camera, which can reduce the need for gesture recognition objects. probability of misjudgment.

Optionally, in the condition that each frame of the first image in the multiple frames of the first image satisfied by the potential user includes a face image in the face image of the potential user, the face image is a frontal face image, that is, the potential user The user satisfies: each frame of the first image in the multiple frames of the first image includes a front face image of the potential user. That is, a user whose front face image exists in each frame of the first image of the plurality of frames is regarded as a potential user in the shooting area.

In an implementation manner, the face image may be input into a pre-trained classification model to obtain a classification result output by the classification model, and the classification result indicates whether the input face image belongs to a frontal face or a side face. The classification model can be trained through supervised learning based on the training sample set. Wherein, the training sample set may include a large number of sample face images, and each sample face image is marked with a label, and the label indicates whether the sample face image belongs to a frontal face or a side face.

For example, a lightweight deep neural network - mobilenetv2 can be used to build a binary classification model. mobilenetv2 is often applied to classification tasks in mobile terminals such as mobile phones. After inputting the face image to mobilenetv2, mobilenetv2 will output the classification result. There are two classification results, which can be represented by 0 and 1 respectively. Among them, 0 can indicate that the input face image belongs to the side face, and 1 can indicate that the input face image belongs to the front face.

In another implementation, the face deflection angle range can be set in advance, if the user's face deflection angle is within the face deflection angle range, then the user is considered to be facing the camera, that is, the user's face in the image The face image is a front face image. After the face image is acquired, face pose estimation may be performed based on the face image to obtain the face deflection angle of the user to whom the face image belongs. If the face deflection angle of the user to whom the face image belongs is within the preset range of face deflection angles, it is determined that the face image is a front face image, otherwise it is determined that the face image is a side face image.

If the potential user is far away from the camera, the human body image of the potential user in the image captured by the camera will be small, which may not reflect the details of the user's hand, which may lead to subsequent misjudgment of the user's hand movement. Therefore, in the embodiment of the present application, after determining the potential users in the shooting area, the distance from the potential users to the camera may also be obtained. When the distance from a potential user to the camera exceeds a distance threshold, a distance prompt is output. This distance cue is used to prompt potential users to move closer to the camera. If the potential user wants to perform an air gesture operation, he can be prompted to approach the camera according to the distance, which can improve the accuracy of determining the gesture recognition object, and further improve the recognition accuracy of the gesture recognition object's air gesture.

When the solution of the present application is executed by a display device, the display device outputs a distance prompt, which may be that the display device displays a distance prompt. When the solution of the present application is executed by a post-processing end connected to a display device, the post-processing end outputs a distance prompt, and the post-processing end may send the distance prompt to the connected display device to display the distance prompt on the display device.

Optionally, the implementation process of obtaining the distance from the potential user to the camera includes: determining the distance from the potential user to the camera according to the focal length of the camera, the distance between the eyes of the potential user in the first image, and the distance between the eyes of the preset user. Here, the distance between the eyes of the potential user in the first image may be the distance between the eyes of the potential user in the first image including the front face image of the potential user. Wherein, the preset distance between the eyes of the user is a preset fixed value. Since the difference between the actual binocular distances of different users is small, an average value of the actual binocular distances of multiple users may be selected as the preset user binocular distance.

For example, FIG. 3 is a schematic diagram of a ranging principle provided in an embodiment of the present application. As shown in Figure 3, the focal length of the camera is f, the distance between the eyes of the user in the image (that is, the imaging plane) captured by the camera that contains the user's front face image is M, and the distance between the eyes of the preset user is K, assuming that the user arrives at If the distance of the camera is d, then according to the principle of similar triangles, it can be obtained: M/f=K/d, from which the distance from the user to the camera can be deduced as d=(K*f)/M.

In the embodiment of the present application, it is not limited to whether the camera is a monocular camera, a binocular camera, or whether a depth sensor is integrated. The distance from the user to the camera can be determined based on the principle of similar triangles. The calculation method is simple and the implementation cost is low.

Optionally, when the camera is a binocular camera, the distance from the potential user to the camera may also be calculated based on a binocular ranging principle. Alternatively, when the camera is integrated with a depth sensor, the distance from the potential user to the camera may also be obtained by measuring the depth sensor. Wherein, the depth sensor may be an ultrasonic radar, a millimeter-wave radar, a laser radar, or a structured light sensor, which is not limited in this embodiment of the present application. It should be understood that the depth sensor may also be other devices capable of measuring distances.

Step 202 : Obtain areas to be identified corresponding to potential users in multiple frames of the first image respectively, where the areas to be identified corresponding to potential users include hand images of the potential users.

Optionally, each frame of the first image has a region to be identified corresponding to a potential user. Optionally, the region to be identified corresponding to the potential user further includes an elbow image of the potential user. The region to be identified in the image involved in the embodiment of the present application is a region of interest (ROI) in the image, that is, the region in the image that needs to be processed.

In the process of determining the potential users in the shooting area in the above step 201, the face images of the potential users respectively in the multiple frames of the first images can be obtained. Correspondingly, the implementation process of step 202 may include: determining the position of the face image of the potential user in the first image, and according to the position of the face image of the potential user in the first image, determining the corresponding area to be identified. The area to be identified corresponding to the potential user may include not only the hand image of the potential user, but also the face image of the potential user.

Optionally, the face imaging area of the potential user in the first image may be expanded, and the area to be recognized including the hand image and the face image may be obtained by cropping. For example, FIG. 4 is a schematic diagram of an image provided by an embodiment of the present application. As shown in FIG. 4 , the images include a human body image of user A, a human body image of user B, a human body image of user C, and a human body image of user D. Wherein, the human body images of user A and user B include front face images, and the human body images of user C and user D include side face images. Assuming that user A and user B are potential users in the shooting area, the face imaging area A1 of user A in the image can be expanded, and the area to be recognized (area A2) corresponding to user A in the image can be obtained by cropping; B expands the face imaging area B1 in the image, and crops out the area to be recognized (area B2) corresponding to user B in the image.

Step 203: Determine the hand movements of the potential user according to the regions to be identified corresponding to the potential user in the multiple frames of the first image.

Optionally, the implementation process of step 203 may include: respectively performing key point detection on regions to be identified corresponding to the potential user in multiple frames of the first image to obtain multiple sets of hand key point information of the potential user. According to the multiple sets of hand key point information of the potential user, the hand motion of the potential user is determined.

Among them, performing key point detection on the regions to be identified corresponding to the potential users in the multiple frames of the first image respectively to obtain multiple sets of key point information of the hands of the potential user, which may be to identify the areas to be identified corresponding to the potential users in each frame of the first image The key point detection is carried out in the area to obtain a set of key point information of the potential user's hands.

Optionally, the set of hand key point information includes the positions of multiple hand key points and the connection relationship among the multiple hand key points. Each hand key represents a specific part of the hand. For example, FIG. 5 is a schematic diagram of distribution of key points of a hand provided in an embodiment of the present application. As shown in Figure 5, the hand can include wrist (0), carpus metacarpal joint (1), thumb metacarpophalangeal joint (2), thumb interphalangeal joint (3), thumb fingertip (4), index finger metacarpophalangeal joint ( 5), the proximal interphalangeal joint of the index finger (6), the distal interphalangeal joint of the index finger (7), the tip of the index finger (8), the metacarpophalangeal joint of the middle finger (9), the proximal interphalangeal joint of the middle finger (10), the distal interphalangeal joint of the middle finger Terminal interphalangeal joint (11), middle finger fingertip (12), ring finger metacarpophalangeal joint (13), ring finger proximal interphalangeal joint (14), ring finger distal interphalangeal joint (15), ring finger fingertip (16), The 21 key points of the hand are the metacarpophalangeal joint of the little finger (17), the proximal interphalangeal joint of the little finger (18), the distal interphalangeal joint of the little finger (19), and the fingertip of the little finger (20).

In this embodiment of the present application, when performing key point detection on a region to be recognized including a potential user's hand image, the above 21 hand key points may be detected, or more or less hand key points may be detected.

For example, a key point detector based on a deep neural network can be used to detect key points in the area to be recognized, and the key point detector can be implemented based on heatmap technology. The key point detector can perform key point detection on the area to be recognized in a bottom-up manner. Assuming that the detection target includes 21 key points of the hand, a heat map containing 21 channels can be generated, and each channel is a probability map (thermal distribution map) of a key point of the hand; the numbers in the probability map represent the is the probability of the key point of the hand, the closer the number is to 1, the higher the probability of the key point of the hand. At the same time, a vector map containing 21*2 channels is generated, and each 2 channels contains the position information (two-dimensional information) of a key point of the hand. From this, the position of the key points of the hand can be obtained. Further, the key point detector connects the detected hand key points based on partial affinity fields (PAF), so that the connection relationship between multiple hand key points can be obtained.

Optionally, after obtaining multiple sets of hand key point information of the potential user, the shape change and/or displacement of the potential user's hand can be determined according to the multiple sets of hand key point information, and then the potential user's hand motion can be determined .

When the user's multiple hand key points are too compact or some hand key points are missing in the detection results, it may lead to misjudgment or missed judgment of the user's hand movement, so the movement of the user's elbow key points can be combined direction to help determine the user's hand movement direction, and then determine the user's hand movement. Optionally, the region to be identified corresponding to the potential user in the first image may include a hand image and an elbow image of the potential user. Then, the implementation process of step 203 may include: performing key point detection on regions to be identified corresponding to potential users in multiple frames of the first image to obtain multiple sets of hand and elbow key point information of the potential user; Multiple sets of hand and elbow key point information to determine the potential user's hand movements.

Step 204: Determine a target user among one or more potential users as a gesture recognition object, and the target user's hand motion matches a preset gesture.

Optionally, the preset gesture includes an initial part of the gesture to be recognized. For example, if a complete gesture to be recognized requires 10 frames of images to be determined, then the gesture corresponding to the first 3 frames of images that are determined as the gesture to be recognized can be selected as the preset gesture. In this way, when the user needs to perform an air gesture operation, the gesture to be recognized can be performed directly in the shooting area of the camera, without performing other specific wake-up gestures to enable the gesture recognition function of the device, which can be realized without the user's perception The determination of the gesture recognition object can simplify user operations and improve user experience.

Wherein, the gesture to be recognized is a gesture that is preconfigured in the display device and can be converted into a control instruction. For example, in a conference scenario, the gestures to be recognized pre-configured on the conference terminal may include a page-up gesture, a page-down gesture, a page-turning gesture to the left, a page-turning gesture to the right, and a screenshot gesture. When it is determined that a potential user's hand motion is moving from left to right, it can be determined that the potential user's hand motion matches the initial part of the left page turning gesture, and then the potential user can be determined as gesture recognition object.

Optionally, when there are multiple potential users whose hand movements match the preset gestures in the shooting area of the camera, the potential user closest to the camera is used as the target user.

In the embodiment of the present application, at most one gesture recognition object can be determined at a time, and the gesture recognition object may change over time. The closer the user is to the camera, the higher the probability of being determined as a gesture recognition object.

Optionally, when the camera captures the target user as the gesture recognition object, the number of images that do not include the target user's face image in the image captured by the camera exceeds the number threshold, or the duration of the target user as the gesture recognition object When the duration threshold is reached, the target user is terminated as a gesture recognition object. For example, the number threshold can be set to 3 frames, that is, when the image captured by the camera does not include the face image of the target user for more than 3 frames, the target user will no longer be used as the gesture recognition object, and the gesture recognition object will be executed again. Determine the process. The duration threshold is the preset aging time. For example, the value can be 20 seconds, that is, the maximum effective duration of each determined gesture recognition object is 20 seconds. If it exceeds 20 seconds, the determined gesture recognition object will become invalid, and the gesture recognition needs to be re-determined. objects to meet the flexible and changeable requirements of gesture recognition objects in application scenarios.

Optionally, the condition for ending the target user as the gesture recognition object may also be that the target user does not make a correct gesture to be recognized within a certain period of time (the value is less than the aging time) after the target user is the gesture recognition object, or the target user is detected. The hand is put down, or the target user's hand remains still (this situation can exclude the situation that the user maliciously occupies the gesture recognition object), etc., the embodiment of the present application no longer regards the target user as the end condition of the gesture recognition object No limit.

For example, assuming that 3 frames of images are used to determine potential users, and the face image in the determination condition is a frontal face image, the condition for ending the user as the gesture recognition object is: the camera captures the shooting area after the user is the gesture recognition object to obtain The number of images that do not include the user's face image exceeds the number threshold. Then the implementation of the gesture recognition object determination method provided in the embodiment of the present application can be implemented as follows: in the process of determining the gesture recognition object, if there are 3 frames of images that satisfy the user's front face image, and based on these 3 frames of images, determine the user's hand If the internal action matches the preset gesture, then the user can be determined as the target of gesture recognition, and then gesture recognition will be performed on the user. At the same time, after the user is determined as the gesture recognition object, it will also detect in real time whether the next collected images include the user's frontal face image. If the number of images that do not include the user's frontal face image reaches a certain number, the end will The user is used as a gesture recognition object, and the process of determining a gesture recognition object is restarted. The process of determining the gesture recognition object can be executed when there is no gesture recognition object in the shooting area of the camera, that is, after the gesture recognition object is determined, the display device or the post-processing terminal connected to the display device can stop executing the gesture recognition object The process is determined until the last determined gesture recognition object becomes invalid.

Step 205: Obtain the to-be-recognized area corresponding to the target user in multiple frames of second images, where the to-be-identified area corresponding to the target user includes the hand image of the target user.

Here, obtaining the region to be identified corresponding to the target user in the second images of multiple frames can be understood as only obtaining the region to be identified corresponding to the target user in the second images of multiple frames, instead of obtaining the regions other than the target user in the second images of multiple frames. Areas to be identified corresponding to other users. The multiple frames of second images are obtained by shooting the shooting area by the camera after the target user is determined as the gesture recognition object. That is, the shooting moment of the second image is behind the shooting moment of the first image in time sequence. For example, in one case, the shooting moments of multiple frames of second images are continuous with the shooting moments of multiple frames of first images, that is, the first N frames of images obtained by shooting the shooting area by the camera are the first images. The image captured after the frame image is the second image. In another case, the shooting moments of the multiple frames of the second images may also be discontinuous with the shooting moments of the multiple frames of the first images. In this application, "first image" and "second image" are used to distinguish the shooting timing of the image. The first image refers to the image captured by the camera before the gesture recognition object is determined, and the second image refers to the image captured by the camera before the gesture recognition object is determined. Images captured by the camera afterwards.

Optionally, after the target user is determined as the gesture recognition object in the above step 204, the face information of the target user can also be saved, so as to associate the hand movements of the target user with the face information of the target user, so as to realize the recognition of the target user. Hand tracking, and then realize the gesture recognition of the target user. Wherein, the face information of the target user includes the position and movement trend of the face image of the target user in the multiple frames of the first images captured by the camera, or the face information of the target user includes the face features of the target user. The implementation process of step 205 may include: determining the face image position of the target user in the second image according to the saved face information of the target user. According to the face image position of the target user in the second image, the region to be recognized corresponding to the target user in the second image is determined.

Optionally, face detection may be performed on the second image to obtain a face image in the second image. When the saved face information of the target user is the position and movement trend of the face image of the target user in the multi-frame first images captured by the camera, the face belonging to the target user in the second image can be determined by the face tracking algorithm For example, face detection is performed on the second image based on MTCNN, and the IoU values of one or more face detection frames in the second image and the face detection frame of the target user in the previous frame image are respectively calculated to determine the second The face detection frame of the target user in the image, and then obtain the face image position of the target user in the second image. Or, when the saved face information of the target user is the face feature of the target user, after obtaining one or more face images in the second image, the facial similarity in the second image can be determined by calculating the face similarity. Which face image belongs to the target user.

Optionally, according to the face image position of the target user in the second image, the implementation manner of determining the area to be recognized corresponding to the target user in the second image can refer to the above-mentioned step 202 based on the face of the potential user in the first image The image position is an implementation manner of determining the area to be identified corresponding to the potential user in the first image, which will not be repeated in this embodiment of the present application.

Step 206: Perform gesture recognition on the target user according to the to-be-recognized areas corresponding to the target user in multiple frames of second images.

During the process of the target user being the object of gesture recognition, gesture recognition may be continuously performed on the target user. Performing gesture recognition on the target user may be to determine whether the hand movement of the target user matches a preset gesture to be recognized.

Optionally, the implementation of step 206 includes: inputting the to-be-recognized area corresponding to the target user in multiple frames of second images as an image sequence into the gesture recognition model, so as to obtain a gesture recognition result output by the gesture recognition model. Wherein, the gesture recognition result may indicate a preset gesture to be recognized, indicating that the target user has performed the gesture to be recognized during the shooting period of the multiple frames of second images; or, the gesture recognition result may indicate that there is no matching gesture to be recognized. Gesture recognition means that the target user did not perform any preset gestures to be recognized during the shooting period of the multiple frames of second images; or, the gesture recognition result may include the confidence of each preset gesture to be recognized, and the display device or The post-processing terminal connected to the display device can use the gesture to be recognized with the highest confidence and higher than a certain threshold as the gesture to be recognized by the target user. If there is no gesture to be recognized with the confidence higher than the threshold, it means During the shooting period of the multiple frames of the second images, the target user does not perform any preset gestures to be recognized.

Further, after performing gesture recognition on the target user, if it is determined that the target user has performed a certain gesture to be recognized within the shooting period of multiple frames of the second image, then respond to the manipulation instruction corresponding to the gesture to be recognized to realize air separation. Gesture operation, and continue to perform gesture recognition on the target user until the conditions for ending the target user as a gesture recognition object are met. If it is determined that the target user does not perform any gesture to be recognized during the shooting period of the multiple frames of the second image, then the target user may continue to perform gesture recognition until a condition for ending the target user as a gesture recognition object is met.

Optionally, if the above-mentioned preset gesture used to determine the gesture recognition object includes the initial part of the gesture to be recognized, then the region to be recognized corresponding to the target user in the multiple frames of the first image and the target user in the multiple frames of the second image can be Corresponding to the area to be recognized, it is judged whether the target user has performed the gesture to be recognized including the preset gesture. That is, after the user performs a gesture to be recognized in the shooting area of the camera, the display device or a post-processing terminal connected to the display device can determine the user as a gesture recognition target based on the gesture to be recognized, and then respond to the gesture to be recognized. Recognize the manipulation instruction corresponding to the gesture.

In the embodiment of the present application, when the target user is used as a gesture recognition object, the display device or the post-processing terminal connected to the display device only performs gesture recognition on the target user for a period of time, and does not perform gesture recognition on other users except the target user. The user performs gesture recognition, that is, locks and recognizes a user's gesture for a period of time, which can avoid the problem that the gestures of users interfere with each other, resulting in the inability to achieve accurate gesture control.

The order of the steps in the method for determining the gesture recognition object provided in the embodiment of the present application can be adjusted appropriately, and the steps can also be increased or decreased according to the situation. Any person skilled in the art within the technical scope disclosed in this application can easily think of changes, which should be covered within the scope of protection of this application, and thus will not be repeated here.

To sum up, in the gesture recognition object determination method provided by the embodiment of the present application, the user who has a face image in each frame of the multi-frame images captured by the camera and whose hand movements match the preset gestures is determined Objects are recognized for gestures within the camera's field of view. Based on the image captured by the camera, the gesture recognition object can be automatically determined, and further gesture recognition can be performed on the gesture recognition object to realize air gesture operation. It is suitable for gesture recognition in various scenarios, especially multi-user scenarios, and the implementation method is simple. In addition, in the process of performing gesture recognition on the gesture recognition object, gestures of other users other than the gesture recognition object will not be recognized, which can avoid the problem that the gestures of users interfere with each other and cause accurate gesture control cannot be realized. Optionally, users facing the camera inside the shooting area may also be excluded, and only gesture recognition objects are determined among users facing the camera, which can reduce the probability of misjudgment of gesture recognition objects. Optionally, the initial part of the gesture to be recognized is used as a preset gesture used to determine the gesture recognition object. When the user needs to perform an air gesture operation, the gesture to be recognized can be performed directly in the shooting area of the camera without Perform other specific wake-up gestures to enable the gesture recognition function of the device, and realize the determination of the gesture recognition object without the user's perception, which can simplify user operations and improve user experience.

Fig. 6 is a schematic structural diagram of an apparatus for determining an object for gesture recognition provided by an embodiment of the present application. As shown in Figure 6, the device 600 includes:

The first determining module 601 is configured to determine one or more potential users in the shooting area according to multiple frames of first images obtained by shooting the shooting area with the camera, and the potential users satisfy: each frame of the multiple frames of first images is first The images all include images of potential users' faces.

The second determining module 602 is configured to determine the hand movement of the potential user according to the to-be-recognized area corresponding to the potential user in the multiple frames of the first image, and the to-be-identified area corresponding to the potential user includes the hand image of the potential user.

The third determination module 603 is configured to determine a target user among one or more potential users as a gesture recognition object, and the hand motion of the target user matches a preset gesture.

Optionally, as shown in FIG. 7 , the device 600 further includes: a first acquiring module 604, configured to acquire target users in multiple frames of second images after determining target users among one or more potential users as gesture recognition objects The area to be recognized corresponding to the user, the area to be recognized corresponding to the target user includes the hand image of the target user, and the multi-frame second images are obtained by shooting the shooting area by the camera after the target user is determined as the gesture recognition object. The gesture recognition module 605 is configured to perform gesture recognition on the target user according to the to-be-recognized areas corresponding to the target user in multiple frames of second images.

Optionally, the preset gesture includes an initial part of the gesture to be recognized. The gesture recognition module 605 is configured to: judge whether the target user performs a gesture to be recognized according to the regions to be recognized corresponding to the target user in multiple frames of first images and the regions to be recognized corresponding to the target user in multiple frames of second images.

Optionally, the first obtaining module 604 is configured to: determine the face image position of the target user in the second image according to the saved face information of the target user. According to the face image position of the target user in the second image, the region to be recognized corresponding to the target user in the second image is determined.

Optionally, as shown in FIG. 8 , the device 600 further includes: a fourth determination module 606, configured to exclude the face of the target user from the image obtained by shooting the shooting area after the camera recognizes the target user as an object When the image quantity of the image exceeds the quantity threshold, or when the duration of the target user as the gesture recognition object reaches the duration threshold, the target user is terminated as the gesture recognition object.

Optionally, as shown in FIG. 9, the device 600 further includes: a fifth determination module 607, configured to determine the position of the face image of the potential user in the first image after determining one or more potential users in the shooting area . The sixth determination module 608 is configured to determine a region to be recognized corresponding to the potential user in the first image according to the position of the face image of the potential user in the first image.

Optionally, the third determining module 603 is configured to: when there are multiple potential users whose hand movements match the preset gestures in the shooting area, take the potential user closest to the camera as the target user.

Optionally, as shown in FIG. 10, the device 600 further includes: a second acquisition module 609, configured to acquire the distance from the potential user to the camera. The output module 610 is configured to output a distance prompt when the distance from the potential user to the camera exceeds a distance threshold, and the distance prompt is used to prompt the potential user to approach the camera. If the above-mentioned device for determining a gesture recognition object is a display device, the output module 610 is specifically a display module. Alternatively, if the above-mentioned apparatus for determining a gesture recognition object is a post-processing end, the output module 610 is specifically a sending module.

Optionally, the second obtaining module 609 is configured to: determine the distance between the potential user and the camera according to the focal length of the camera, the distance between the eyes of the potential user in the first image, and the distance between the eyes of the preset user.

Optionally, the second determining module 602 is configured to: perform key point detection on areas to be identified corresponding to potential users in multiple frames of the first image to obtain multiple sets of hand key point information of the potential user. According to multiple sets of hand key point information of the potential user, the hand motion of the potential user is determined.

Optionally, the region to be identified corresponding to the potential user further includes an elbow image of the potential user. The second determination module 602 is configured to: perform key point detection on regions to be identified corresponding to potential users in multiple frames of the first image, and obtain multiple sets of key point information of hands and elbows of potential users. According to multiple sets of hand and elbow key point information of the potential user, the hand movement of the potential user is determined.

Optionally, the above human face image is a front face image.

Regarding the apparatus in the foregoing embodiments, the specific manner in which each module executes operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

Fig. 11 is a block diagram of a gesture recognition object determination device provided by an embodiment of the present application. The gesture recognition object determination device may be a general computing device, for example, in a conference scenario, the general computing device may be a conference terminal or a post-conference processing terminal. Optionally, the conference terminal can be a large screen or an electronic whiteboard. The post-meeting processing end may be a server, or a server cluster composed of multiple servers, or a cloud computing platform. As shown in FIG. 11 , the gesture recognition object determining device 1100 includes: a processor 1101 and a memory 1102 .

memory 1102, configured to store computer programs, the computer programs including program instructions;

The processor 1101 is configured to call the computer program to implement the method steps shown in FIG. 2 in the above method embodiment.

Optionally, the gesture recognition object determining device 1100 further includes a communication bus 1103 and a communication interface 1104 .

Wherein, the processor 1101 includes one or more processing cores, and the processor 1101 executes various functional applications and data processing by running computer programs.

The memory 1102 can be used to store computer programs. Optionally, the memory may store an operating system and application program units required for at least one function. The operating system can be an operating system such as a real-time operating system (Real Time eXecutive, RTX), LINUX, UNIX, WINDOWS or OS X.

There may be multiple communication interfaces 1104, and the communication interfaces 1104 are used to communicate with other storage devices or network devices. For example, in this embodiment of the present application, when the device for determining a gesture recognition object is a post-conference processing terminal, the communication interface of the post-conference processing terminal may be used to send a result of determining a gesture recognition object to the conference terminal. Network devices can be switches or routers, etc.

The memory 1102 and the communication interface 1104 are respectively connected to the processor 1101 through the communication bus 1103 .

The embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores instructions, and when the instructions are executed by the processor, the method shown in Figure 2 in the above-mentioned method embodiment is implemented step.

The embodiment of the present application also provides a computer program product, including a computer program. When the computer program is executed by a processor, the method steps shown in Fig. 2 in the above method embodiment are implemented.

Those of ordinary skill in the art can understand that all or part of the steps for implementing the above embodiments can be completed by hardware, or can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium. The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk, and the like.

In the embodiments of the present application, the terms "first", "second" and "third" are used for description purposes only, and cannot be understood as indicating or implying relative importance.

The term "and/or" in this application is only an association relationship describing associated objects, which means that there may be three relationships, for example, A and/or B, which may mean: A exists alone, A and B exist simultaneously, and A and B exist alone. There are three cases of B. In addition, the character "/" in this article generally indicates that the contextual objects are an "or" relationship.

The above are only optional embodiments of the application, and are not intended to limit the application. Any modifications, equivalent replacements, improvements, etc. made within the concept and principles of the application shall be included in the protection of the application. within range.

Claims

A gesture recognition object determination method, characterized in that the method comprises:

One or more potential users in the shooting area are determined according to multiple frames of first images obtained by shooting the shooting area by the camera, and the potential users satisfy the following requirements: each frame of the first image in the multiple frames of first images is including a facial image of the potential user;

determining the hand movement of the potential user according to the region to be identified corresponding to the potential user in the multiple frames of the first image, where the region to be identified corresponding to the potential user includes a hand image of the potential user;

A target user among the one or more potential users is determined as a gesture recognition object, and a hand motion of the target user matches a preset gesture.
The method according to claim 1, wherein after the target user among the one or more potential users is determined as a gesture recognition object, the method further comprises:

Acquiring an area to be identified corresponding to the target user in multiple frames of second images, the area to be identified corresponding to the target user includes an image of the hand of the target user, and the multiple frames of second images are obtained by combining the target user obtained by photographing the photographing area by the camera after it is determined as the gesture recognition object;

Perform gesture recognition on the target user according to the to-be-recognized area corresponding to the target user in the multiple frames of second images.
The method according to claim 2, wherein the preset gesture includes the initial part of the gesture to be recognized, and according to the region to be recognized corresponding to the target user in the multiple frames of second images, the Gesture recognition for the above target users, including:

According to the to-be-recognized area corresponding to the target user in the multi-frame first image and the to-be-recognized area corresponding to the target user in the multi-frame second image, determine whether the target user has performed the to-be-recognized gesture .
The method according to claim 2 or 3, wherein the acquiring the region to be identified corresponding to the target user in multiple frames of second images comprises:

determining the face image position of the target user in the second image according to the stored face information of the target user;

A region to be identified corresponding to the target user in the second image is determined according to a face image position of the target user in the second image.
The method according to any one of claims 1 to 4, wherein the method further comprises:

After the camera recognizes the target user as a gesture object, the number of images obtained by shooting the shooting area that do not include the face image of the target user exceeds a number threshold, or the target user acts as When the duration of the gesture recognition object reaches the duration threshold, the target user is terminated as the gesture recognition object.
The method according to any one of claims 1 to 5, wherein after determining one or more potential users in the shooting area, the method further comprises:

determining the position of the face image of the potential user in the first image;

A region to be identified corresponding to the potential user in the first image is determined according to the face image position of the potential user in the first image.
The method according to any one of claims 1 to 6, wherein the determining a target user among the one or more potential users as a gesture recognition object comprises:

When there are multiple potential users whose hand movements match the preset gesture in the shooting area, the potential user closest to the camera is used as the target user.
The method according to any one of claims 1 to 7, wherein after determining one or more potential users in the shooting area, the method further comprises:

Obtain the distance from the potential user to the camera;

When the distance from the potential user to the camera exceeds a distance threshold, a distance prompt is output, and the distance prompt is used to prompt the potential user to approach the camera.
The method according to claim 8, wherein said acquiring the distance from said potential user to said camera comprises:

Determine the distance from the potential user to the camera according to the focal length of the camera, the distance between the eyes of the potential user in the first image, and the preset distance between the eyes of the user.
The method according to any one of claims 1 to 9, wherein the determining the hand movement of the potential user according to the region to be identified corresponding to the potential user in the multi-frame first image includes:

Perform key point detection on the areas to be identified corresponding to the potential users in the multiple frames of first images, to obtain multiple sets of hand key point information of the potential users;

According to the multiple sets of hand key point information of the potential user, the hand motion of the potential user is determined.
The method according to any one of claims 1 to 9, wherein the area to be identified corresponding to the potential user further includes an elbow image of the potential user, and the The area to be identified corresponding to the potential user, determining the hand movement of the potential user, including:

Perform key point detection on the areas to be identified corresponding to the potential users in the multiple frames of the first image, and obtain multiple sets of hand and elbow key point information of the potential user;

According to the multiple sets of hand and elbow key point information of the potential user, the hand motion of the potential user is determined.
The method according to any one of claims 1 to 11, wherein the face image is a front face image.
A gesture recognition object determination device, characterized in that the device comprises:

The first determining module is configured to determine one or more potential users in the shooting area according to multiple frames of first images obtained by shooting the shooting area with the camera, and the potential users satisfy: in the multiple frames of the first images Each frame of the first image includes the face image of the potential user;

The second determination module is configured to determine the hand movement of the potential user according to the area to be identified corresponding to the potential user in the multiple frames of the first image, the area to be identified corresponding to the potential user includes the potential user image of the hand;

The third determining module is configured to determine a target user among the one or more potential users as a gesture recognition object, and the hand motion of the target user matches a preset gesture.
The device according to claim 13, further comprising:

The first acquiring module is configured to, after determining a target user among the one or more potential users as a gesture recognition object, acquire a region to be recognized corresponding to the target user in multiple frames of second images, and the target user corresponds to The region to be recognized includes the hand image of the target user, and the multiple frames of second images are obtained by shooting the shooting region by the camera after the target user is determined as a gesture recognition object;

The gesture recognition module is configured to perform gesture recognition on the target user according to the to-be-recognized area corresponding to the target user in the multiple frames of second images.
The device according to claim 14, wherein the preset gesture includes an initial part of the gesture to be recognized, and the gesture recognition module is configured to:

According to the to-be-recognized area corresponding to the target user in the multi-frame first image and the to-be-recognized area corresponding to the target user in the multi-frame second image, determine whether the target user has performed the to-be-recognized gesture .
The device according to claim 14 or 15, wherein the first acquisition module is configured to:

determining the face image position of the target user in the second image according to the stored face information of the target user;

A region to be identified corresponding to the target user in the second image is determined according to a face image position of the target user in the second image.
The device according to any one of claims 13 to 16, wherein the device further comprises:

The fourth determination module is configured to: when the camera captures the target user as a gesture recognition object, the number of images obtained by shooting the shooting area that does not include the face image of the target user exceeds a threshold, Alternatively, when the duration of the target user as the gesture recognition object reaches a duration threshold, the target user is terminated as the gesture recognition object.
The device according to any one of claims 13 to 17, wherein the device further comprises:

A fifth determination module, configured to determine the face image position of the potential user in the first image after determining one or more potential users in the shooting area;

A sixth determining module, configured to determine a region to be identified corresponding to the potential user in the first image according to the face image position of the potential user in the first image.
The device according to any one of claims 13 to 18, wherein the third determination module is configured to:

When there are multiple potential users whose hand movements match the preset gesture in the shooting area, the potential user closest to the camera is used as the target user.
The device according to any one of claims 13 to 19, wherein the device further comprises:

A second acquiring module, configured to acquire the distance from the potential user to the camera;

An output module, configured to output a distance prompt when the distance from the potential user to the camera exceeds a distance threshold, and the distance prompt is used to prompt the potential user to approach the camera.
The device according to claim 20, wherein the second acquisition module is configured to:

Determine the distance from the potential user to the camera according to the focal length of the camera, the distance between the eyes of the potential user in the first image, and the preset distance between the eyes of the user.
The device according to any one of claims 13 to 21, wherein the second determination module is configured to:

Perform key point detection on the areas to be identified corresponding to the potential users in the multiple frames of first images, to obtain multiple sets of hand key point information of the potential users;

According to the multiple sets of hand key point information of the potential user, the hand motion of the potential user is determined.
The device according to any one of claims 13 to 21, wherein the area to be identified corresponding to the potential user further includes an elbow image of the potential user, and the second determining module is configured to:

Perform key point detection on the areas to be identified corresponding to the potential users in the multiple frames of the first image, and obtain multiple sets of hand and elbow key point information of the potential user;

According to the multiple sets of hand and elbow key point information of the potential user, the hand motion of the potential user is determined.
The device according to any one of claims 13 to 23, wherein the face image is a front face image.
A gesture recognition object determination device, characterized in that it includes: a processor and a memory;

The memory is used to store a computer program, and the computer program includes program instructions;

The processor is configured to call the computer program to implement the gesture recognition object determination method according to any one of claims 1 to 12.
A computer-readable storage medium, characterized in that instructions are stored on the computer-readable storage medium, and when the instructions are executed by a processor, the gesture recognition object determination according to any one of claims 1 to 12 is realized method.
A computer program product, characterized by comprising a computer program, and when the computer program is executed by a processor, implements the gesture recognition object determination method according to any one of claims 1 to 12.