WO2023273372A1 - Gesture recognition object determination method and apparatus - Google Patents

Gesture recognition object determination method and apparatus Download PDF

Info

Publication number
WO2023273372A1
WO2023273372A1 PCT/CN2022/078623 CN2022078623W WO2023273372A1 WO 2023273372 A1 WO2023273372 A1 WO 2023273372A1 CN 2022078623 W CN2022078623 W CN 2022078623W WO 2023273372 A1 WO2023273372 A1 WO 2023273372A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
image
potential
target user
gesture recognition
Prior art date
Application number
PCT/CN2022/078623
Other languages
French (fr)
Chinese (zh)
Inventor
黄允臻
王浩
李冬虎
冷继南
常胜
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202111034365.2A external-priority patent/CN115565241A/en
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023273372A1 publication Critical patent/WO2023273372A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer

Definitions

  • the present application relates to the field of computer vision, in particular to a method and device for determining a gesture recognition object.
  • gesture recognition is a very important way of human-computer interaction.
  • Gesture recognition technology uses various sensors to model the shape and displacement of the hand (arm), forms an information sequence, and then converts the information sequence into corresponding instructions to control certain operations.
  • the present application provides a gesture recognition object determination method and device.
  • a method for determining a gesture recognition object is provided.
  • the method can be applied to general computing devices.
  • the method includes: determining one or more potential users in the shooting area according to multiple frames of first images obtained by shooting the shooting area with the camera, and the potential users satisfy that each frame of the first images in the multiple frames of first images includes Face images of potential users.
  • the hand movement of the potential user is determined according to the to-be-recognized area corresponding to the potential user in the multiple frames of the first image, and the to-be-identified area corresponding to the potential user includes the hand image of the potential user.
  • a target user among one or more potential users is determined as a gesture recognition object, and a hand movement of the target user matches a preset gesture.
  • a user whose face image exists in each frame of multiple images captured by the camera and whose hand motion matches a preset gesture is determined as a gesture recognition object within the shooting area of the camera. Based on the image captured by the camera, the gesture recognition object can be automatically determined, and gesture recognition can be performed on the gesture recognition object to realize the air gesture operation. It is suitable for gesture recognition in various scenarios, especially multi-user scenarios, and the implementation method is simple.
  • the above method further includes: acquiring a region to be recognized corresponding to the target user in multiple frames of second images, the region to be recognized corresponding to the target user Including the target user's hand image, the multi-frame second images are obtained by shooting the shooting area by the camera after the target user is determined as the gesture recognition object.
  • Gesture recognition is performed on the target user according to the regions to be recognized corresponding to the target user in the multiple frames of second images.
  • gesture recognition is only performed on the target user as the gesture recognition target within a period of time, and no gesture recognition is performed on other users except the target user, that is, within a period of time
  • the time-locked recognition of a user's gesture can avoid the problem that the gestures of the users interfere with each other, resulting in the inability to realize accurate gesture control.
  • the preset gesture includes the initial part of the gesture to be recognized
  • the realization process of performing gesture recognition on the target user according to the region to be recognized corresponding to the target user in the multiple frames of the second image includes: according to the multiple frames of the first image
  • the area to be recognized corresponding to the target user and the area to be recognized corresponding to the target user in the multiple frames of second images are used to determine whether the target user performs a gesture to be recognized.
  • the initial part of the gesture to be recognized is used as the preset gesture used to determine the gesture recognition object.
  • the gesture to be recognized can be performed directly in the shooting area of the camera without Perform other specific wake-up gestures to enable the gesture recognition function of the device, and realize the determination of the gesture recognition object without the user's perception, which can simplify user operations and improve user experience.
  • the implementation process of acquiring the to-be-recognized area corresponding to the target user in multiple frames of second images includes: determining the face image position of the target user in the second image according to the saved face information of the target user. According to the face image position of the target user in the second image, the region to be recognized corresponding to the target user in the second image is determined.
  • the face information of the target user can be saved, so that the hand movement of the target user can be associated with the face information of the target user, and the hand tracking of the target user can be realized. Then realize the gesture recognition of the target user.
  • the camera captures the target user as the gesture recognition object
  • the number of images that do not include the target user's face image in the image captured by the camera exceeds the number threshold, or the duration of the target user as the gesture recognition object
  • the target user is terminated as a gesture recognition object.
  • gesture recognition object only one gesture recognition object can be determined at most at the same time. Since the gesture recognition object may change over time, by setting some conditions to end the target user as the gesture recognition object, the gesture recognition object in the application scenario can be satisfied. Recognize the flexible and changing needs of the object.
  • the above method further includes: determining the face image positions of the potential users in the first image. According to the position of the face image of the potential user in the first image, a region to be recognized corresponding to the potential user in the first image is determined.
  • the implementation process of determining a target user among one or more potential users as a gesture recognition object includes: when there are multiple potential users whose hand movements match the preset gestures in the shooting area, set the target user closest to the camera potential users as target users.
  • the above method further includes: acquiring a distance from the potential users to the camera.
  • a distance prompt is output, which is used to prompt the potential user to approach the camera.
  • Output distance prompts to remind potential users to approach the camera. If potential users want to perform air gesture operations, they can approach the camera according to the distance prompt, which can improve the accuracy of determining the gesture recognition object, and further improve the distance of the gesture recognition object. Recognition accuracy of empty gestures.
  • the implementation process of obtaining the distance from the potential user to the camera includes: determining the distance from the potential user to the camera according to the focal length of the camera, the distance between the eyes of the potential user in the first image, and the distance between the eyes of the preset user.
  • the focal length of the camera is f
  • the distance between the eyes of the user in the image (that is, the imaging plane) captured by the camera containing the user's frontal face image is M
  • the preset distance between the eyes of the user is K
  • the distance between the user and the camera is is d
  • the camera is a monocular camera, a binocular camera or whether it is integrated with a depth sensor.
  • the distance from the user to the camera can be determined based on the principle of similar triangles. The calculation method is simple and the implementation cost is low.
  • the implementation process of determining the hand movements of the potential users includes: respectively performing key points on the regions to be recognized corresponding to the potential users in the multiple frames of the first images Detect and obtain multiple sets of key point information on the hands of potential users. According to multiple sets of hand key point information of the potential user, the hand motion of the potential user is determined.
  • the area to be identified corresponding to the potential user also includes an elbow image of the potential user, and according to the area to be identified corresponding to the potential user in the multi-frame first image, the implementation process of determining the hand movement of the potential user includes:
  • the regions to be identified corresponding to the potential users in the first image of the frame are respectively subjected to key point detection, and multiple sets of key point information of the hands and elbows of the potential users are obtained. According to multiple sets of hand and elbow key point information of the potential user, the hand movement of the potential user is determined.
  • This application combines the key points of the user's elbow
  • the moving direction is used to assist in judging the moving direction of the user's hand, which can improve the accuracy of judging the user's hand motion.
  • the above human face image is a front face image. That is, the potential user satisfies that each frame of the first image in the multiple frames of the first image includes the front face image of the potential user.
  • this application can also exclude the users facing the camera inside the shooting area, and only determine potential users among the users facing the camera, which can reduce the erroneous recognition of the gesture recognition object. Judgment probability.
  • an apparatus for determining an object for gesture recognition includes a plurality of functional modules, and the plurality of functional modules interact to implement the methods in the above first aspect and various implementation manners thereof.
  • the multiple functional modules can be implemented based on software, hardware or a combination of software and hardware, and the multiple functional modules can be combined or divided arbitrarily based on specific implementations.
  • a gesture recognition object determination device including: a processor and a memory;
  • the memory is used to store a computer program, and the computer program includes program instructions
  • the processor is configured to invoke the computer program to implement the methods in the above first aspect and various implementation manners thereof.
  • a computer-readable storage medium In a fourth aspect, a computer-readable storage medium is provided. Instructions are stored on the computer-readable storage medium. When the instructions are executed by a processor, the above-mentioned first aspect and the methods in each implementation manner thereof are realized.
  • a computer program product including a computer program.
  • the computer program is executed by a processor, the method in the above first aspect and its various implementation manners is implemented.
  • a chip is provided, and the chip includes a programmable logic circuit and/or program instructions, and when the chip is running, implements the method in the above first aspect and various implementation manners thereof.
  • FIG. 1 is a schematic diagram of an application scenario involved in a gesture recognition object determination method provided in an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a method for determining a gesture recognition object provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of a ranging principle provided by an embodiment of the present application.
  • Fig. 4 is a schematic diagram of an image provided by an embodiment of the present application.
  • Fig. 5 is a schematic diagram of the distribution of key points of a hand provided by the embodiment of the present application.
  • Fig. 6 is a schematic structural diagram of a gesture recognition object determination device provided by an embodiment of the present application.
  • Fig. 7 is a schematic structural diagram of another gesture recognition object determination device provided by an embodiment of the present application.
  • Fig. 8 is a schematic structural diagram of another device for determining an object for gesture recognition provided by an embodiment of the present application.
  • Fig. 9 is a schematic structural diagram of another device for determining an object for gesture recognition provided by an embodiment of the present application.
  • Fig. 10 is a schematic structural diagram of another device for determining an object for gesture recognition provided by an embodiment of the present application.
  • Fig. 11 is a block diagram of a gesture recognition object determination device provided by an embodiment of the present application.
  • gestures can express rich information in a non-contact manner
  • gesture recognition is widely used in human-computer interaction, smart phones, smart TVs and other products.
  • the vision-based gesture recognition technology does not need to wear additional sensors on the hand to increase the mark, which is convenient and has a wide range of application prospects in human-computer interaction.
  • the gestures mentioned in this application all refer to non-contact gestures, that is, air gestures.
  • air gesture operations on display devices For example, in a conference room scenario, participants can perform air gesture operations such as page up, page down, left page, right page, and screenshots on the display screen on the conference terminal.
  • participants can perform air gesture operations such as page up, page down, left page, right page, and screenshots on the display screen on the conference terminal.
  • family members can perform air gesture operations such as fast forward, rewind, turn up the volume, turn down the volume, and pause on the playback screen on the smart TV.
  • a teacher or a student may perform air gesture operations such as scrolling up and scrolling down on the displayed content on the display device.
  • gesture recognition is performed based on the images collected by the camera, it is easy to recognize the gestures of multiple users.
  • the display device may not be able to distinguish which user is performing the air gesture operation. Gestures between users interfere with each other, so that the display device cannot realize accurate gesture control.
  • this application proposes a solution for determining the gesture recognition object: by performing face detection on multiple frames of images captured by the camera to identify potential users in the shooting area, and then by judging the hand movements of potential users to Determine the gesture recognition object among potential users.
  • the user whose face image exists in each frame of multiple frames of images captured by the camera and whose hand movements match the preset gestures can be determined as a gesture within the camera’s shooting area. Identify objects.
  • This application can automatically determine the gesture recognition object based on the image taken by the camera, and further can perform gesture recognition on the gesture recognition object to realize the gesture operation in the air. It is suitable for gesture recognition in various scenarios, especially multi-user scenarios, and the implementation method is simple. .
  • gestures of other users other than the gesture recognition object will not be recognized, which can avoid the problem that the gestures of users interfere with each other and cause accurate gesture control cannot be realized.
  • the solution of this application can also exclude users who face the camera inside the shooting area, and only face the camera
  • the gesture recognition object is determined among the users. Specifically, the user whose face image exists in each frame of the multi-frame images captured by the camera and whose hand movements match the preset gestures can be determined as the gesture recognition within the shooting area of the camera. object. In this way, the misjudgment probability of the gesture recognition object can be reduced. In order to improve the determination accuracy of the gesture recognition object, it can also be clearly indicated in the operation manual of the display device supporting the gesture control function that the user needs to face the camera when performing an air gesture operation.
  • Facing the camera referred to in this application does not mean that the face is completely facing the camera, but there may be a deviation within a set range.
  • the face is completely facing the camera. It can be that the line connecting the eyes is parallel to the imaging plane of the camera. If the deflection angle of the face when the face is completely facing the camera is 0°, then in this application, facing the camera means that the deflection angle of the face relative to the camera is within a certain range. That is, if the deflection angle of the user's face is within a certain deflection angle range, the user is considered to be facing the camera. For example, users whose face deflection angles are within the range of -30° to 30° can be regarded as facing the camera. The range of face deflection angles here is only used as an example. To determine whether the user is facing the camera's face deflection angle range. In this application, the face of the user facing the camera is called the front face.
  • the method for determining a gesture recognition object may be applied to a general-purpose computing device.
  • the general computing device may be a display device or a post-processing terminal connected to the display device.
  • the display device supports a gesture control function.
  • the display device has a built-in camera, or the display device is connected to an external camera. The camera is used to take pictures of the shooting area to obtain images.
  • the display device or the post-processing terminal connected to the display device is used to determine the gesture recognition object in the shooting area according to the image captured by the camera, and further perform gesture recognition on the gesture recognition object to respond to the gesture operation in the air.
  • the deployment orientation of the camera is generally consistent with the deployment orientation of the display device, and the shooting area of the camera generally includes the area toward which the display surface of the display device faces.
  • the post-processing end may be a server, or a server cluster composed of multiple servers, or a cloud computing platform.
  • the display device can be a conference terminal such as a large screen or an electronic whiteboard.
  • the display device may be a smart TV, a projection device, or a VR device.
  • FIG. 1 is a schematic diagram of an application scenario involved in a method for determining a gesture recognition object provided in an embodiment of the present application.
  • the application scenario is a conference room scenario.
  • the application scenario includes a conference terminal, and the conference terminal has a built-in camera.
  • the conference terminal is installed on the wall.
  • the camera's field of view includes the conference table and several attendees.
  • the gesture control function of the conference terminal is turned on, the camera continues to photograph the shooting area, and the conference terminal or the post-processing terminal (not shown in the figure) connected to the conference terminal processes the images captured by the camera to determine whether there are Gesture recognition object.
  • Fig. 2 is a schematic flowchart of a method for determining a gesture recognition object provided by an embodiment of the present application. As shown in Figure 2, the method includes:
  • Step 201 Determine one or more potential users in the shooting area according to multiple frames of first images obtained by shooting the shooting area with the camera.
  • each frame of the first image in the multiple frames of the first image includes the face image of the potential user. That is to say, a user whose face image exists in each frame of the first image in multiple frames is regarded as a potential user in the shooting area.
  • the number of frames used to determine the image of the potential user is pre-configured.
  • the first multi-frame image may be 3 frames, 5 frames, or 10 frames.
  • the embodiment of the present application uses the first The frame number of the image is not limited.
  • face detection is performed on multiple frames of the first image respectively, so as to obtain a face image in each frame of the first image. Then determine which face images in different first images belong to the same user. Finally determine which user's face images exist in each frame of the first image in the multiple frames of the first image, and then obtain potential users in the shooting area.
  • MTCNN multi-task cascaded convolutional networks
  • MTCNN includes three cascaded networks: proposal network (P-Net), refinement network (refinement network, R-Net) and output network (output network, O-Net).
  • P-Net proposal network
  • R-Net refinement network
  • O-Net output network
  • the process of face detection of images based on MTCNN includes:
  • the image pyramid includes multiple images of different sizes obtained by scaling the original image. Since there may be face images of different sizes in the original image, by establishing an image pyramid, face images of different sizes in the original image can be detected at a uniform size, and the robustness of the network to face images of different sizes can be enhanced.
  • the image pyramid is input into the three cascaded networks (P-Net, R-Net, O-Net), and the face image in the image is detected from coarse to fine through the three cascaded networks, and finally Output the face detection result.
  • P-Net is used to return multiple detection frames to the input image, and then map the multiple detection frames to the original image, and remove some of them through the non-maximum suppression (NMS) algorithm. Redundant frames to get preliminary face detection results.
  • R-Net is used to further refine and filter the face detection results output by P-Net.
  • O-Net is used to further refine and filter the face detection results output by R-Net, and output the final face detection results.
  • the face detection result obtained based on the MTCNN includes face detection frame information and face key point information corresponding to each detected face image.
  • the face detection frame information may include the coordinates of the upper left corner and the lower right corner of the face detection frame, and the face image is located in the face detection frame.
  • the human face key point information may include coordinates of multiple human face key points, and the multiple human face key points may include left eye, right eye, nose, left mouth corner, right mouth corner, and the like.
  • the intersection over union (IoU) of the face detection frames in every two adjacent frames of the first image in the multi-frame first image can be calculated respectively value.
  • the IoU value here may be equal to the ratio of the intersection area and the merged area of the face detection frames in the two adjacent frames of the first image after superimposing the two adjacent frames of the first image.
  • the IoU value ranges from 0 to 1.
  • two adjacent first images are image A and image B respectively, when the IoU value of the first face detection frame in image A and the second face detection frame in image B is greater than the preset threshold, it can be determined
  • the face image in the first face detection frame and the face image in the second face detection frame belong to the same user.
  • the preset threshold may take a larger value, for example, may take a value of 0.8. If the above multiple frames of first images include multiple frames of images collected by the camera at intervals, the preset threshold may take a smaller value, for example, 0.6.
  • the embodiment of the present application does not limit the specific value of the preset threshold.
  • the face images in the multiple frames of the first images it may also be determined which face images in different first images belong to the same user by calculating the face similarity.
  • the same user identifier may be used to identify face images belonging to the same user in different images, and different user identifiers may be used to identify face images belonging to different users in the same image. If each frame of the first image in the plurality of frames of first images includes a face image identified by the same user identifier, then the user represented by the user identifier is determined as a potential user.
  • the user identifiers used here only need to distinguish different users, for example, numbers, characters or other identifiers may be used as user identifiers.
  • the application scheme does not need to identify the user identity, but only needs to be able to distinguish different users, there is no need to pre-set gesture recognition objects that may exist in the scene, and it can be flexibly applied to various multi-user scenarios, especially when the user group is changeable multi-user scenarios, such as public conference rooms.
  • the embodiment of the present application can also exclude the users facing the camera inside the shooting area, and only determine potential users among the users facing the camera, which can reduce the need for gesture recognition objects. probability of misjudgment.
  • each frame of the first image in the multiple frames of the first image satisfied by the potential user includes a face image in the face image of the potential user
  • the face image is a frontal face image, that is, the potential user
  • each frame of the first image in the multiple frames of the first image includes a front face image of the potential user. That is, a user whose front face image exists in each frame of the first image of the plurality of frames is regarded as a potential user in the shooting area.
  • the face image may be input into a pre-trained classification model to obtain a classification result output by the classification model, and the classification result indicates whether the input face image belongs to a frontal face or a side face.
  • the classification model can be trained through supervised learning based on the training sample set.
  • the training sample set may include a large number of sample face images, and each sample face image is marked with a label, and the label indicates whether the sample face image belongs to a frontal face or a side face.
  • a lightweight deep neural network - mobilenetv2 can be used to build a binary classification model.
  • mobilenetv2 is often applied to classification tasks in mobile terminals such as mobile phones. After inputting the face image to mobilenetv2, mobilenetv2 will output the classification result.
  • classification results There are two classification results, which can be represented by 0 and 1 respectively. Among them, 0 can indicate that the input face image belongs to the side face, and 1 can indicate that the input face image belongs to the front face.
  • the face deflection angle range can be set in advance, if the user's face deflection angle is within the face deflection angle range, then the user is considered to be facing the camera, that is, the user's face in the image
  • the face image is a front face image.
  • face pose estimation may be performed based on the face image to obtain the face deflection angle of the user to whom the face image belongs. If the face deflection angle of the user to whom the face image belongs is within the preset range of face deflection angles, it is determined that the face image is a front face image, otherwise it is determined that the face image is a side face image.
  • the distance from the potential users to the camera may also be obtained.
  • a distance prompt is output. This distance cue is used to prompt potential users to move closer to the camera. If the potential user wants to perform an air gesture operation, he can be prompted to approach the camera according to the distance, which can improve the accuracy of determining the gesture recognition object, and further improve the recognition accuracy of the gesture recognition object's air gesture.
  • the display device When the solution of the present application is executed by a display device, the display device outputs a distance prompt, which may be that the display device displays a distance prompt.
  • a distance prompt When the solution of the present application is executed by a post-processing end connected to a display device, the post-processing end outputs a distance prompt, and the post-processing end may send the distance prompt to the connected display device to display the distance prompt on the display device.
  • the implementation process of obtaining the distance from the potential user to the camera includes: determining the distance from the potential user to the camera according to the focal length of the camera, the distance between the eyes of the potential user in the first image, and the distance between the eyes of the preset user.
  • the distance between the eyes of the potential user in the first image may be the distance between the eyes of the potential user in the first image including the front face image of the potential user.
  • the preset distance between the eyes of the user is a preset fixed value. Since the difference between the actual binocular distances of different users is small, an average value of the actual binocular distances of multiple users may be selected as the preset user binocular distance.
  • FIG. 3 is a schematic diagram of a ranging principle provided in an embodiment of the present application.
  • the focal length of the camera is f
  • the distance between the eyes of the user in the image (that is, the imaging plane) captured by the camera that contains the user's front face image is M
  • the distance between the eyes of the preset user is K
  • d the distance between the eyes of the preset user
  • the camera is a monocular camera, a binocular camera, or whether a depth sensor is integrated.
  • the distance from the user to the camera can be determined based on the principle of similar triangles. The calculation method is simple and the implementation cost is low.
  • the distance from the potential user to the camera may also be calculated based on a binocular ranging principle.
  • the distance from the potential user to the camera may also be obtained by measuring the depth sensor.
  • the depth sensor may be an ultrasonic radar, a millimeter-wave radar, a laser radar, or a structured light sensor, which is not limited in this embodiment of the present application. It should be understood that the depth sensor may also be other devices capable of measuring distances.
  • Step 202 Obtain areas to be identified corresponding to potential users in multiple frames of the first image respectively, where the areas to be identified corresponding to potential users include hand images of the potential users.
  • each frame of the first image has a region to be identified corresponding to a potential user.
  • the region to be identified corresponding to the potential user further includes an elbow image of the potential user.
  • the region to be identified in the image involved in the embodiment of the present application is a region of interest (ROI) in the image, that is, the region in the image that needs to be processed.
  • ROI region of interest
  • the face images of the potential users respectively in the multiple frames of the first images can be obtained.
  • the implementation process of step 202 may include: determining the position of the face image of the potential user in the first image, and according to the position of the face image of the potential user in the first image, determining the corresponding area to be identified.
  • the area to be identified corresponding to the potential user may include not only the hand image of the potential user, but also the face image of the potential user.
  • FIG. 4 is a schematic diagram of an image provided by an embodiment of the present application.
  • the images include a human body image of user A, a human body image of user B, a human body image of user C, and a human body image of user D.
  • the human body images of user A and user B include front face images
  • the human body images of user C and user D include side face images.
  • the face imaging area A1 of user A in the image can be expanded, and the area to be recognized (area A2) corresponding to user A in the image can be obtained by cropping; B expands the face imaging area B1 in the image, and crops out the area to be recognized (area B2) corresponding to user B in the image.
  • Step 203 Determine the hand movements of the potential user according to the regions to be identified corresponding to the potential user in the multiple frames of the first image.
  • the implementation process of step 203 may include: respectively performing key point detection on regions to be identified corresponding to the potential user in multiple frames of the first image to obtain multiple sets of hand key point information of the potential user. According to the multiple sets of hand key point information of the potential user, the hand motion of the potential user is determined.
  • performing key point detection on the regions to be identified corresponding to the potential users in the multiple frames of the first image respectively to obtain multiple sets of key point information of the hands of the potential user which may be to identify the areas to be identified corresponding to the potential users in each frame of the first image
  • the key point detection is carried out in the area to obtain a set of key point information of the potential user's hands.
  • the set of hand key point information includes the positions of multiple hand key points and the connection relationship among the multiple hand key points.
  • Each hand key represents a specific part of the hand.
  • FIG. 5 is a schematic diagram of distribution of key points of a hand provided in an embodiment of the present application.
  • the hand can include wrist (0), carpus metacarpal joint (1), thumb metacarpophalangeal joint (2), thumb interphalangeal joint (3), thumb fingertip (4), index finger metacarpophalangeal joint ( 5), the proximal interphalangeal joint of the index finger (6), the distal interphalangeal joint of the index finger (7), the tip of the index finger (8), the metacarpophalangeal joint of the middle finger (9), the proximal interphalangeal joint of the middle finger (10), the distal interphalangeal joint of the middle finger Terminal interphalangeal joint (11), middle finger fingertip (12), ring finger metacarpophalangeal joint (13), ring finger proximal interphalangeal joint (14), ring finger distal interphalangeal joint (15), ring finger fingertip (16),
  • the 21 key points of the hand are the metacarpophalangeal joint of the little finger (17), the proximal interphalangeal joint of the little finger (18), the distal interphalangeal joint of the little finger (19), and
  • the above 21 hand key points may be detected, or more or less hand key points may be detected.
  • a key point detector based on a deep neural network can be used to detect key points in the area to be recognized, and the key point detector can be implemented based on heatmap technology.
  • the key point detector can perform key point detection on the area to be recognized in a bottom-up manner. Assuming that the detection target includes 21 key points of the hand, a heat map containing 21 channels can be generated, and each channel is a probability map (thermal distribution map) of a key point of the hand; the numbers in the probability map represent the is the probability of the key point of the hand, the closer the number is to 1, the higher the probability of the key point of the hand.
  • a vector map containing 21*2 channels is generated, and each 2 channels contains the position information (two-dimensional information) of a key point of the hand. From this, the position of the key points of the hand can be obtained. Further, the key point detector connects the detected hand key points based on partial affinity fields (PAF), so that the connection relationship between multiple hand key points can be obtained.
  • PAF partial affinity fields
  • the shape change and/or displacement of the potential user's hand can be determined according to the multiple sets of hand key point information, and then the potential user's hand motion can be determined .
  • the region to be identified corresponding to the potential user in the first image may include a hand image and an elbow image of the potential user.
  • the implementation process of step 203 may include: performing key point detection on regions to be identified corresponding to potential users in multiple frames of the first image to obtain multiple sets of hand and elbow key point information of the potential user; Multiple sets of hand and elbow key point information to determine the potential user's hand movements.
  • Step 204 Determine a target user among one or more potential users as a gesture recognition object, and the target user's hand motion matches a preset gesture.
  • the preset gesture includes an initial part of the gesture to be recognized. For example, if a complete gesture to be recognized requires 10 frames of images to be determined, then the gesture corresponding to the first 3 frames of images that are determined as the gesture to be recognized can be selected as the preset gesture. In this way, when the user needs to perform an air gesture operation, the gesture to be recognized can be performed directly in the shooting area of the camera, without performing other specific wake-up gestures to enable the gesture recognition function of the device, which can be realized without the user's perception
  • the determination of the gesture recognition object can simplify user operations and improve user experience.
  • the gesture to be recognized is a gesture that is preconfigured in the display device and can be converted into a control instruction.
  • the gestures to be recognized pre-configured on the conference terminal may include a page-up gesture, a page-down gesture, a page-turning gesture to the left, a page-turning gesture to the right, and a screenshot gesture.
  • the potential user closest to the camera is used as the target user.
  • At most one gesture recognition object can be determined at a time, and the gesture recognition object may change over time. The closer the user is to the camera, the higher the probability of being determined as a gesture recognition object.
  • the number threshold can be set to 3 frames, that is, when the image captured by the camera does not include the face image of the target user for more than 3 frames, the target user will no longer be used as the gesture recognition object, and the gesture recognition object will be executed again.
  • the duration threshold is the preset aging time. For example, the value can be 20 seconds, that is, the maximum effective duration of each determined gesture recognition object is 20 seconds. If it exceeds 20 seconds, the determined gesture recognition object will become invalid, and the gesture recognition needs to be re-determined. objects to meet the flexible and changeable requirements of gesture recognition objects in application scenarios.
  • condition for ending the target user as the gesture recognition object may also be that the target user does not make a correct gesture to be recognized within a certain period of time (the value is less than the aging time) after the target user is the gesture recognition object, or the target user is detected.
  • the hand is put down, or the target user's hand remains still (this situation can exclude the situation that the user maliciously occupies the gesture recognition object), etc., the embodiment of the present application no longer regards the target user as the end condition of the gesture recognition object No limit.
  • the condition for ending the user as the gesture recognition object is: the camera captures the shooting area after the user is the gesture recognition object to obtain The number of images that do not include the user's face image exceeds the number threshold.
  • the implementation of the gesture recognition object determination method provided in the embodiment of the present application can be implemented as follows: in the process of determining the gesture recognition object, if there are 3 frames of images that satisfy the user's front face image, and based on these 3 frames of images, determine the user's hand If the internal action matches the preset gesture, then the user can be determined as the target of gesture recognition, and then gesture recognition will be performed on the user.
  • the user After the user is determined as the gesture recognition object, it will also detect in real time whether the next collected images include the user's frontal face image. If the number of images that do not include the user's frontal face image reaches a certain number, the end will The user is used as a gesture recognition object, and the process of determining a gesture recognition object is restarted.
  • the process of determining the gesture recognition object can be executed when there is no gesture recognition object in the shooting area of the camera, that is, after the gesture recognition object is determined, the display device or the post-processing terminal connected to the display device can stop executing the gesture recognition object The process is determined until the last determined gesture recognition object becomes invalid.
  • Step 205 Obtain the to-be-recognized area corresponding to the target user in multiple frames of second images, where the to-be-identified area corresponding to the target user includes the hand image of the target user.
  • obtaining the region to be identified corresponding to the target user in the second images of multiple frames can be understood as only obtaining the region to be identified corresponding to the target user in the second images of multiple frames, instead of obtaining the regions other than the target user in the second images of multiple frames.
  • the multiple frames of second images are obtained by shooting the shooting area by the camera after the target user is determined as the gesture recognition object. That is, the shooting moment of the second image is behind the shooting moment of the first image in time sequence.
  • the shooting moments of multiple frames of second images are continuous with the shooting moments of multiple frames of first images, that is, the first N frames of images obtained by shooting the shooting area by the camera are the first images.
  • the image captured after the frame image is the second image.
  • the shooting moments of the multiple frames of the second images may also be discontinuous with the shooting moments of the multiple frames of the first images.
  • first image and second image are used to distinguish the shooting timing of the image.
  • the first image refers to the image captured by the camera before the gesture recognition object is determined
  • the second image refers to the image captured by the camera before the gesture recognition object is determined. Images captured by the camera afterwards.
  • the face information of the target user can also be saved, so as to associate the hand movements of the target user with the face information of the target user, so as to realize the recognition of the target user. Hand tracking, and then realize the gesture recognition of the target user.
  • the face information of the target user includes the position and movement trend of the face image of the target user in the multiple frames of the first images captured by the camera, or the face information of the target user includes the face features of the target user.
  • the implementation process of step 205 may include: determining the face image position of the target user in the second image according to the saved face information of the target user. According to the face image position of the target user in the second image, the region to be recognized corresponding to the target user in the second image is determined.
  • face detection may be performed on the second image to obtain a face image in the second image.
  • the face belonging to the target user in the second image can be determined by the face tracking algorithm
  • face detection is performed on the second image based on MTCNN, and the IoU values of one or more face detection frames in the second image and the face detection frame of the target user in the previous frame image are respectively calculated to determine the second The face detection frame of the target user in the image, and then obtain the face image position of the target user in the second image.
  • the facial similarity in the second image can be determined by calculating the face similarity. Which face image belongs to the target user.
  • the implementation manner of determining the area to be recognized corresponding to the target user in the second image can refer to the above-mentioned step 202 based on the face of the potential user in the first image
  • the image position is an implementation manner of determining the area to be identified corresponding to the potential user in the first image, which will not be repeated in this embodiment of the present application.
  • Step 206 Perform gesture recognition on the target user according to the to-be-recognized areas corresponding to the target user in multiple frames of second images.
  • gesture recognition may be continuously performed on the target user. Performing gesture recognition on the target user may be to determine whether the hand movement of the target user matches a preset gesture to be recognized.
  • step 206 includes: inputting the to-be-recognized area corresponding to the target user in multiple frames of second images as an image sequence into the gesture recognition model, so as to obtain a gesture recognition result output by the gesture recognition model.
  • the gesture recognition result may indicate a preset gesture to be recognized, indicating that the target user has performed the gesture to be recognized during the shooting period of the multiple frames of second images; or, the gesture recognition result may indicate that there is no matching gesture to be recognized.
  • Gesture recognition means that the target user did not perform any preset gestures to be recognized during the shooting period of the multiple frames of second images; or, the gesture recognition result may include the confidence of each preset gesture to be recognized, and the display device or The post-processing terminal connected to the display device can use the gesture to be recognized with the highest confidence and higher than a certain threshold as the gesture to be recognized by the target user. If there is no gesture to be recognized with the confidence higher than the threshold, it means During the shooting period of the multiple frames of the second images, the target user does not perform any preset gestures to be recognized.
  • gesture recognition After performing gesture recognition on the target user, if it is determined that the target user has performed a certain gesture to be recognized within the shooting period of multiple frames of the second image, then respond to the manipulation instruction corresponding to the gesture to be recognized to realize air separation. Gesture operation, and continue to perform gesture recognition on the target user until the conditions for ending the target user as a gesture recognition object are met. If it is determined that the target user does not perform any gesture to be recognized during the shooting period of the multiple frames of the second image, then the target user may continue to perform gesture recognition until a condition for ending the target user as a gesture recognition object is met.
  • the region to be recognized corresponding to the target user in the multiple frames of the first image and the target user in the multiple frames of the second image can be Corresponding to the area to be recognized, it is judged whether the target user has performed the gesture to be recognized including the preset gesture. That is, after the user performs a gesture to be recognized in the shooting area of the camera, the display device or a post-processing terminal connected to the display device can determine the user as a gesture recognition target based on the gesture to be recognized, and then respond to the gesture to be recognized. Recognize the manipulation instruction corresponding to the gesture.
  • the display device or the post-processing terminal connected to the display device only performs gesture recognition on the target user for a period of time, and does not perform gesture recognition on other users except the target user.
  • the user performs gesture recognition, that is, locks and recognizes a user's gesture for a period of time, which can avoid the problem that the gestures of users interfere with each other, resulting in the inability to achieve accurate gesture control.
  • the gesture recognition object determination method provided by the embodiment of the present application, the user who has a face image in each frame of the multi-frame images captured by the camera and whose hand movements match the preset gestures is determined Objects are recognized for gestures within the camera's field of view. Based on the image captured by the camera, the gesture recognition object can be automatically determined, and further gesture recognition can be performed on the gesture recognition object to realize air gesture operation. It is suitable for gesture recognition in various scenarios, especially multi-user scenarios, and the implementation method is simple. In addition, in the process of performing gesture recognition on the gesture recognition object, gestures of other users other than the gesture recognition object will not be recognized, which can avoid the problem that the gestures of users interfere with each other and cause accurate gesture control cannot be realized.
  • users facing the camera inside the shooting area may also be excluded, and only gesture recognition objects are determined among users facing the camera, which can reduce the probability of misjudgment of gesture recognition objects.
  • the initial part of the gesture to be recognized is used as a preset gesture used to determine the gesture recognition object.
  • the gesture to be recognized can be performed directly in the shooting area of the camera without Perform other specific wake-up gestures to enable the gesture recognition function of the device, and realize the determination of the gesture recognition object without the user's perception, which can simplify user operations and improve user experience.
  • Fig. 6 is a schematic structural diagram of an apparatus for determining an object for gesture recognition provided by an embodiment of the present application. As shown in Figure 6, the device 600 includes:
  • the first determining module 601 is configured to determine one or more potential users in the shooting area according to multiple frames of first images obtained by shooting the shooting area with the camera, and the potential users satisfy: each frame of the multiple frames of first images is first
  • the images all include images of potential users' faces.
  • the second determining module 602 is configured to determine the hand movement of the potential user according to the to-be-recognized area corresponding to the potential user in the multiple frames of the first image, and the to-be-identified area corresponding to the potential user includes the hand image of the potential user.
  • the third determination module 603 is configured to determine a target user among one or more potential users as a gesture recognition object, and the hand motion of the target user matches a preset gesture.
  • the device 600 further includes: a first acquiring module 604, configured to acquire target users in multiple frames of second images after determining target users among one or more potential users as gesture recognition objects
  • the area to be recognized corresponding to the user, the area to be recognized corresponding to the target user includes the hand image of the target user, and the multi-frame second images are obtained by shooting the shooting area by the camera after the target user is determined as the gesture recognition object.
  • the gesture recognition module 605 is configured to perform gesture recognition on the target user according to the to-be-recognized areas corresponding to the target user in multiple frames of second images.
  • the preset gesture includes an initial part of the gesture to be recognized.
  • the gesture recognition module 605 is configured to: judge whether the target user performs a gesture to be recognized according to the regions to be recognized corresponding to the target user in multiple frames of first images and the regions to be recognized corresponding to the target user in multiple frames of second images.
  • the first obtaining module 604 is configured to: determine the face image position of the target user in the second image according to the saved face information of the target user. According to the face image position of the target user in the second image, the region to be recognized corresponding to the target user in the second image is determined.
  • the device 600 further includes: a fourth determination module 606, configured to exclude the face of the target user from the image obtained by shooting the shooting area after the camera recognizes the target user as an object When the image quantity of the image exceeds the quantity threshold, or when the duration of the target user as the gesture recognition object reaches the duration threshold, the target user is terminated as the gesture recognition object.
  • a fourth determination module 606 configured to exclude the face of the target user from the image obtained by shooting the shooting area after the camera recognizes the target user as an object When the image quantity of the image exceeds the quantity threshold, or when the duration of the target user as the gesture recognition object reaches the duration threshold, the target user is terminated as the gesture recognition object.
  • the device 600 further includes: a fifth determination module 607, configured to determine the position of the face image of the potential user in the first image after determining one or more potential users in the shooting area .
  • the sixth determination module 608 is configured to determine a region to be recognized corresponding to the potential user in the first image according to the position of the face image of the potential user in the first image.
  • the third determining module 603 is configured to: when there are multiple potential users whose hand movements match the preset gestures in the shooting area, take the potential user closest to the camera as the target user.
  • the device 600 further includes: a second acquisition module 609, configured to acquire the distance from the potential user to the camera.
  • the output module 610 is configured to output a distance prompt when the distance from the potential user to the camera exceeds a distance threshold, and the distance prompt is used to prompt the potential user to approach the camera. If the above-mentioned device for determining a gesture recognition object is a display device, the output module 610 is specifically a display module. Alternatively, if the above-mentioned apparatus for determining a gesture recognition object is a post-processing end, the output module 610 is specifically a sending module.
  • the second obtaining module 609 is configured to: determine the distance between the potential user and the camera according to the focal length of the camera, the distance between the eyes of the potential user in the first image, and the distance between the eyes of the preset user.
  • the second determining module 602 is configured to: perform key point detection on areas to be identified corresponding to potential users in multiple frames of the first image to obtain multiple sets of hand key point information of the potential user. According to multiple sets of hand key point information of the potential user, the hand motion of the potential user is determined.
  • the region to be identified corresponding to the potential user further includes an elbow image of the potential user.
  • the second determination module 602 is configured to: perform key point detection on regions to be identified corresponding to potential users in multiple frames of the first image, and obtain multiple sets of key point information of hands and elbows of potential users. According to multiple sets of hand and elbow key point information of the potential user, the hand movement of the potential user is determined.
  • the above human face image is a front face image.
  • Fig. 11 is a block diagram of a gesture recognition object determination device provided by an embodiment of the present application.
  • the gesture recognition object determination device may be a general computing device, for example, in a conference scenario, the general computing device may be a conference terminal or a post-conference processing terminal.
  • the conference terminal can be a large screen or an electronic whiteboard.
  • the post-meeting processing end may be a server, or a server cluster composed of multiple servers, or a cloud computing platform.
  • the gesture recognition object determining device 1100 includes: a processor 1101 and a memory 1102 .
  • memory 1102 configured to store computer programs, the computer programs including program instructions
  • the processor 1101 is configured to call the computer program to implement the method steps shown in FIG. 2 in the above method embodiment.
  • the gesture recognition object determining device 1100 further includes a communication bus 1103 and a communication interface 1104 .
  • the processor 1101 includes one or more processing cores, and the processor 1101 executes various functional applications and data processing by running computer programs.
  • the memory 1102 can be used to store computer programs.
  • the memory may store an operating system and application program units required for at least one function.
  • the operating system can be an operating system such as a real-time operating system (Real Time eXecutive, RTX), LINUX, UNIX, WINDOWS or OS X.
  • the communication interfaces 1104 are used to communicate with other storage devices or network devices.
  • the communication interface of the post-conference processing terminal may be used to send a result of determining a gesture recognition object to the conference terminal.
  • Network devices can be switches or routers, etc.
  • the memory 1102 and the communication interface 1104 are respectively connected to the processor 1101 through the communication bus 1103 .
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores instructions, and when the instructions are executed by the processor, the method shown in Figure 2 in the above-mentioned method embodiment is implemented step.
  • the embodiment of the present application also provides a computer program product, including a computer program.
  • a computer program product including a computer program.
  • the computer program is executed by a processor, the method steps shown in Fig. 2 in the above method embodiment are implemented.

Abstract

The present application discloses a gesture recognition object determination method and apparatus, and belongs to the field of computer vision. First, a device determines, according to multiple first image frames obtained by a camera capturing a capture area, one or more potential users in the capture area, potential users satisfying: each first image frame among the multiple first image frames comprising a face image of the potential user. Then, the device determines a hand action of the potential user according to an area to be recognized corresponding to the potential user in the multiple first image frames, the area to be recognized corresponding to the potential user comprising a hand image of the potential user. Finally, a target user of the one or more potential users is determined to be a gesture recognition object, and a hand action of the target user is matched to a preset gesture. In the present application, a gesture recognition object can be automatically determined on the basis of an image captured by a camera, and further, an air gesture operation of the gesture recognition object can be implemented, which is applicable to gesture recognition in multiple scenarios, especially in a multi-user scenario. An implementation means is simple.

Description

手势识别对象确定方法及装置Gesture recognition object determination method and device
本申请要求于2021年06月30日提交的申请号为202110736357.6、发明名称为“一种手势识别的方法、装置及系统”以及于2021年09月03日提交的申请号为202111034365.2、发明名称为“手势识别对象确定方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires that the application number submitted on June 30, 2021 is 202110736357.6, and the name of the invention is "a method, device and system for gesture recognition" and the application number submitted on September 3, 2021 is 202111034365.2, and the name of the invention is The priority of the Chinese patent application "Method and Device for Gesture Recognition Object Determination", the entire content of which is incorporated in this application by reference.
技术领域technical field
本申请涉及计算机视觉领域,特别涉及一种手势识别对象确定方法及装置。The present application relates to the field of computer vision, in particular to a method and device for determining a gesture recognition object.
背景技术Background technique
在计算机视觉领域中,手势识别是一种非常重要的人机交互方式。手势识别技术是利用各类传感器对手部(手臂)的形态、位移等进行建模,形成信息序列,再将信息序列转换为对应的指令,用来控制实现某些操作。In the field of computer vision, gesture recognition is a very important way of human-computer interaction. Gesture recognition technology uses various sensors to model the shape and displacement of the hand (arm), forms an information sequence, and then converts the information sequence into corresponding instructions to control certain operations.
由于多用户场景下会识别到多个用户的手势,如何在多个用户中确定手势识别对象,是实现准确进行手势控制的关键。Since the gestures of multiple users will be recognized in a multi-user scenario, how to determine the gesture recognition object among multiple users is the key to realizing accurate gesture control.
发明内容Contents of the invention
本申请提供了一种手势识别对象确定方法及装置。The present application provides a gesture recognition object determination method and device.
第一方面,提供了一种手势识别对象确定方法。该方法可以应用于通用计算设备。该方法包括:根据相机对拍摄区域进行拍摄得到的多帧第一图像,确定该拍摄区域内的一个或多个潜在用户,潜在用户满足:多帧第一图像中的每帧第一图像都包括潜在用户的人脸图像。根据多帧第一图像中潜在用户对应的待识别区域,确定该潜在用户的手部动作,该潜在用户对应的待识别区域包括潜在用户的手部图像。将一个或多个潜在用户中的目标用户确定为手势识别对象,目标用户的手部动作与预设手势匹配。In a first aspect, a method for determining a gesture recognition object is provided. The method can be applied to general computing devices. The method includes: determining one or more potential users in the shooting area according to multiple frames of first images obtained by shooting the shooting area with the camera, and the potential users satisfy that each frame of the first images in the multiple frames of first images includes Face images of potential users. The hand movement of the potential user is determined according to the to-be-recognized area corresponding to the potential user in the multiple frames of the first image, and the to-be-identified area corresponding to the potential user includes the hand image of the potential user. A target user among one or more potential users is determined as a gesture recognition object, and a hand movement of the target user matches a preset gesture.
本申请将在相机拍摄的多帧图像的每帧图像中都存在人脸图像、且手部动作与预设手势匹配的用户确定为相机的拍摄区域内的手势识别对象。基于相机拍摄的图像即可实现自动确定手势识别对象,进一步可以对手势识别对象进行手势识别以实现隔空手势操作,适用于多种场景尤其是多用户场景下的手势识别,实现方式简单。In the present application, a user whose face image exists in each frame of multiple images captured by the camera and whose hand motion matches a preset gesture is determined as a gesture recognition object within the shooting area of the camera. Based on the image captured by the camera, the gesture recognition object can be automatically determined, and gesture recognition can be performed on the gesture recognition object to realize the air gesture operation. It is suitable for gesture recognition in various scenarios, especially multi-user scenarios, and the implementation method is simple.
可选地,在将一个或多个潜在用户中的目标用户确定为手势识别对象之后,上述方法还包括:获取多帧第二图像中目标用户对应的待识别区域,目标用户对应的待识别区域包括目标用户的手部图像,该多帧第二图像是在将目标用户确定为手势识别对象之后,由相机对拍摄区域进行拍摄得到的。根据多帧第二图像中目标用户对应的待识别区域,对目标用户进行手势识别。这里可以理解为:只获取多帧第二图像中目标用户对应的待识别区域,并且只对目标用户进行手势识别。Optionally, after determining the target user among the one or more potential users as the gesture recognition object, the above method further includes: acquiring a region to be recognized corresponding to the target user in multiple frames of second images, the region to be recognized corresponding to the target user Including the target user's hand image, the multi-frame second images are obtained by shooting the shooting area by the camera after the target user is determined as the gesture recognition object. Gesture recognition is performed on the target user according to the regions to be recognized corresponding to the target user in the multiple frames of second images. Here it can be understood as: only obtain the to-be-recognized area corresponding to the target user in multiple frames of second images, and only perform gesture recognition on the target user.
本申请中,在将目标用户确定为手势识别对象之后,在一段时间内只对作为手势识别对象的目标用户进行手势识别,而不会对除目标用户以外的其它用户进行手势识别,即在一段 时间内锁定识别一个用户的手势,可以避免用户之间手势互相干扰导致无法实现准确进行手势控制的问题。In this application, after the target user is determined as the gesture recognition target, gesture recognition is only performed on the target user as the gesture recognition target within a period of time, and no gesture recognition is performed on other users except the target user, that is, within a period of time The time-locked recognition of a user's gesture can avoid the problem that the gestures of the users interfere with each other, resulting in the inability to realize accurate gesture control.
可选地,预设手势包括待识别手势的起始部分,根据多帧第二图像中目标用户对应的待识别区域,对目标用户进行手势识别的实现过程,包括:根据多帧第一图像中目标用户对应的待识别区域以及多帧第二图像中目标用户对应的待识别区域,判断目标用户是否执行了待识别手势。Optionally, the preset gesture includes the initial part of the gesture to be recognized, and the realization process of performing gesture recognition on the target user according to the region to be recognized corresponding to the target user in the multiple frames of the second image includes: according to the multiple frames of the first image The area to be recognized corresponding to the target user and the area to be recognized corresponding to the target user in the multiple frames of second images are used to determine whether the target user performs a gesture to be recognized.
本申请中,将待识别手势的起始部分作为用来判定手势识别对象的预设手势,当用户需要进行隔空手势操作时,直接在相机的拍摄区域内执行待识别手势即可,而无需执行其它特定的唤醒手势来开启设备的手势识别功能,在用户无感知的情况下实现手势识别对象的确定,可以简化用户操作,提升用户体验。In this application, the initial part of the gesture to be recognized is used as the preset gesture used to determine the gesture recognition object. When the user needs to perform an air gesture operation, the gesture to be recognized can be performed directly in the shooting area of the camera without Perform other specific wake-up gestures to enable the gesture recognition function of the device, and realize the determination of the gesture recognition object without the user's perception, which can simplify user operations and improve user experience.
可选地,获取多帧第二图像中目标用户对应的待识别区域的实现过程,包括:根据保存的目标用户的人脸信息,确定目标用户在第二图像中的人脸图像位置。根据目标用户在第二图像中的人脸图像位置,确定第二图像中目标用户对应的待识别区域。Optionally, the implementation process of acquiring the to-be-recognized area corresponding to the target user in multiple frames of second images includes: determining the face image position of the target user in the second image according to the saved face information of the target user. According to the face image position of the target user in the second image, the region to be recognized corresponding to the target user in the second image is determined.
本申请中,在将目标用户确定为手势识别对象之后,可以保存目标用户的人脸信息,以通过目标用户的人脸信息关联上目标用户的手部动作,实现对目标用户的手部追踪,进而实现对目标用户的手势识别。In this application, after the target user is determined as a gesture recognition object, the face information of the target user can be saved, so that the hand movement of the target user can be associated with the face information of the target user, and the hand tracking of the target user can be realized. Then realize the gesture recognition of the target user.
可选地,当相机在目标用户作为手势识别对象之后,对拍摄区域进行拍摄得到的图像中不包括目标用户的人脸图像的图像数量超出数量阈值,或者,目标用户作为手势识别对象的持续时长达到时长阈值时,结束将目标用户作为手势识别对象。Optionally, when the camera captures the target user as the gesture recognition object, the number of images that do not include the target user's face image in the image captured by the camera exceeds the number threshold, or the duration of the target user as the gesture recognition object When the duration threshold is reached, the target user is terminated as a gesture recognition object.
本申请中,同一时刻最多只能确定出一个手势识别对象,由于手势识别对象可能会随着时间推移发生变化,因此通过设置一些结束将目标用户作为手势识别对象的条件,可以满足应用场景中手势识别对象灵活多变的需求。In this application, only one gesture recognition object can be determined at most at the same time. Since the gesture recognition object may change over time, by setting some conditions to end the target user as the gesture recognition object, the gesture recognition object in the application scenario can be satisfied. Recognize the flexible and changing needs of the object.
可选地,在确定拍摄区域内的一个或多个潜在用户之后,上述方法还包括:确定潜在用户在第一图像中的人脸图像位置。根据潜在用户在第一图像中的人脸图像位置,确定第一图像中潜在用户对应的待识别区域。Optionally, after determining one or more potential users in the shooting area, the above method further includes: determining the face image positions of the potential users in the first image. According to the position of the face image of the potential user in the first image, a region to be recognized corresponding to the potential user in the first image is determined.
可选地,将一个或多个潜在用户中的目标用户确定为手势识别对象的实现过程,包括:当拍摄区域内存在手部动作与预设手势匹配的多个潜在用户时,将距离相机最近的潜在用户作为目标用户。Optionally, the implementation process of determining a target user among one or more potential users as a gesture recognition object includes: when there are multiple potential users whose hand movements match the preset gestures in the shooting area, set the target user closest to the camera potential users as target users.
可选地,在确定拍摄区域内的一个或多个潜在用户之后,上述方法还包括:获取潜在用户到相机的距离。当潜在用户到相机的距离超出距离阈值时,输出距离提示,距离提示用于提示潜在用户靠近相机。Optionally, after determining one or more potential users in the shooting area, the above method further includes: acquiring a distance from the potential users to the camera. When the distance from the potential user to the camera exceeds the distance threshold, a distance prompt is output, which is used to prompt the potential user to approach the camera.
如果潜在用户距离相机较远,那么相机拍摄的图像中该潜在用户的人体图像会较小,可能无法体现用户手部细节,这样可能会导致后续对用户的手部动作的误判,本申请通过输出距离提示来提示潜在用户靠近相机,如果潜在用户要进行隔空手势操作,那么可以根据距离提示靠近相机,这样可以提高对手势识别对象的判定准确性,进一步还可以提高对手势识别对象的隔空手势的识别准确性。If the potential user is far away from the camera, the human body image of the potential user in the image captured by the camera will be small, which may not reflect the details of the user's hand, which may lead to subsequent misjudgment of the user's hand movement. Output distance prompts to remind potential users to approach the camera. If potential users want to perform air gesture operations, they can approach the camera according to the distance prompt, which can improve the accuracy of determining the gesture recognition object, and further improve the distance of the gesture recognition object. Recognition accuracy of empty gestures.
可选地,获取潜在用户到相机的距离的实现过程,包括:根据相机的焦距、潜在用户在第一图像中的双眼间距以及预设用户双眼间距,确定潜在用户到相机的距离。Optionally, the implementation process of obtaining the distance from the potential user to the camera includes: determining the distance from the potential user to the camera according to the focal length of the camera, the distance between the eyes of the potential user in the first image, and the distance between the eyes of the preset user.
例如,假设相机的焦距为f,用户在该相机拍摄的包含该用户的正脸图像的图像(即成像 平面)中的双眼间距为M,预设用户双眼间距为K,假设用户到相机的距离为d,则根据相似三角形原理可以得到:M/f=K/d,由此可推导出用户到相机的距离d=(K*f)/M。For example, assuming that the focal length of the camera is f, the distance between the eyes of the user in the image (that is, the imaging plane) captured by the camera containing the user's frontal face image is M, and the preset distance between the eyes of the user is K, assuming that the distance between the user and the camera is is d, then according to the principle of similar triangles, it can be obtained: M/f=K/d, from which the distance d=(K*f)/M from the user to the camera can be deduced.
本申请中,不限于相机是单目相机、双目相机或者是否集成有深度传感器,基于相似三角形原理即可确定用户到相机的距离,计算方式简单,实现成本较低。In this application, it is not limited to whether the camera is a monocular camera, a binocular camera or whether it is integrated with a depth sensor. The distance from the user to the camera can be determined based on the principle of similar triangles. The calculation method is simple and the implementation cost is low.
可选地,根据多帧第一图像中潜在用户对应的待识别区域,确定潜在用户的手部动作的实现过程,包括:对多帧第一图像中潜在用户对应的待识别区域分别进行关键点检测,得到潜在用户的多组手部关键点信息。根据潜在用户的多组手部关键点信息,确定潜在用户的手部动作。Optionally, according to the regions to be recognized corresponding to the potential users in the multiple frames of the first images, the implementation process of determining the hand movements of the potential users includes: respectively performing key points on the regions to be recognized corresponding to the potential users in the multiple frames of the first images Detect and obtain multiple sets of key point information on the hands of potential users. According to multiple sets of hand key point information of the potential user, the hand motion of the potential user is determined.
可选地,潜在用户对应的待识别区域还包括潜在用户的肘部图像,根据多帧第一图像中潜在用户对应的待识别区域,确定潜在用户的手部动作的实现过程,包括:对多帧第一图像中潜在用户对应的待识别区域分别进行关键点检测,得到潜在用户的多组手部及肘部关键点信息。根据潜在用户的多组手部及肘部关键点信息,确定潜在用户的手部动作。Optionally, the area to be identified corresponding to the potential user also includes an elbow image of the potential user, and according to the area to be identified corresponding to the potential user in the multi-frame first image, the implementation process of determining the hand movement of the potential user includes: The regions to be identified corresponding to the potential users in the first image of the frame are respectively subjected to key point detection, and multiple sets of key point information of the hands and elbows of the potential users are obtained. According to multiple sets of hand and elbow key point information of the potential user, the hand movement of the potential user is determined.
当用户的多个手部关键点过于紧凑或检测结果中缺失部分手部关键点时,可能会导致对用户的手部动作的误判或漏判,本申请通过结合用户的肘部关键点的移动方向来辅助判断用户的手部移动方向,可以提高对用户的手部动作的判定准确性。When the key points of the user's hands are too compact or part of the key points of the hand are missing in the detection results, it may lead to misjudgment or missed judgment of the user's hand movement. This application combines the key points of the user's elbow The moving direction is used to assist in judging the moving direction of the user's hand, which can improve the accuracy of judging the user's hand motion.
可选地,上述人脸图像为正脸图像。也即是,潜在用户满足:多帧第一图像中的每帧第一图像都包括该潜在用户的正脸图像。Optionally, the above human face image is a front face image. That is, the potential user satisfies that each frame of the first image in the multiple frames of the first image includes the front face image of the potential user.
由于用户在进行隔空手势操作时通常是面对相机的,本申请还可以排除拍摄区域内侧对相机的用户,只在面对相机的用户中确定潜在用户,这样可以降低对手势识别对象的误判概率。Since the user usually faces the camera when performing an air gesture operation, this application can also exclude the users facing the camera inside the shooting area, and only determine potential users among the users facing the camera, which can reduce the erroneous recognition of the gesture recognition object. Judgment probability.
第二方面,提供了一种手势识别对象确定装置。所述装置包括多个功能模块,所述多个功能模块相互作用,实现上述第一方面及其各实施方式中的方法。所述多个功能模块可以基于软件、硬件或软件和硬件的结合实现,且所述多个功能模块可以基于具体实现进行任意组合或分割。In a second aspect, an apparatus for determining an object for gesture recognition is provided. The device includes a plurality of functional modules, and the plurality of functional modules interact to implement the methods in the above first aspect and various implementation manners thereof. The multiple functional modules can be implemented based on software, hardware or a combination of software and hardware, and the multiple functional modules can be combined or divided arbitrarily based on specific implementations.
第三方面,提供了一种手势识别对象确定设备,包括:处理器和存储器;In a third aspect, a gesture recognition object determination device is provided, including: a processor and a memory;
所述存储器,用于存储计算机程序,所述计算机程序包括程序指令;The memory is used to store a computer program, and the computer program includes program instructions;
所述处理器,用于调用所述计算机程序,实现上述第一方面及其各实施方式中的方法。The processor is configured to invoke the computer program to implement the methods in the above first aspect and various implementation manners thereof.
第四方面,提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有指令,当所述指令被处理器执行时,实现上述第一方面及其各实施方式中的方法。In a fourth aspect, a computer-readable storage medium is provided. Instructions are stored on the computer-readable storage medium. When the instructions are executed by a processor, the above-mentioned first aspect and the methods in each implementation manner thereof are realized.
第五方面,提供了一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时,实现上述第一方面及其各实施方式中的方法。In a fifth aspect, a computer program product is provided, including a computer program. When the computer program is executed by a processor, the method in the above first aspect and its various implementation manners is implemented.
第六方面,提供了一种芯片,芯片包括可编程逻辑电路和/或程序指令,当芯片运行时,实现上述第一方面及其各实施方式中的方法。According to a sixth aspect, a chip is provided, and the chip includes a programmable logic circuit and/or program instructions, and when the chip is running, implements the method in the above first aspect and various implementation manners thereof.
附图说明Description of drawings
图1是本申请实施例提供的一种手势识别对象确定方法涉及的应用场景示意图;FIG. 1 is a schematic diagram of an application scenario involved in a gesture recognition object determination method provided in an embodiment of the present application;
图2是本申请实施例提供的一种手势识别对象确定方法的流程示意图;FIG. 2 is a schematic flowchart of a method for determining a gesture recognition object provided by an embodiment of the present application;
图3是本申请实施例提供的一种测距原理示意图;FIG. 3 is a schematic diagram of a ranging principle provided by an embodiment of the present application;
图4是本申请实施例提供的一种图像示意图;Fig. 4 is a schematic diagram of an image provided by an embodiment of the present application;
图5是本申请实施例提供的一种手部关键点的分布示意图;Fig. 5 is a schematic diagram of the distribution of key points of a hand provided by the embodiment of the present application;
图6是本申请实施例提供的一种手势识别对象确定装置的结构示意图;Fig. 6 is a schematic structural diagram of a gesture recognition object determination device provided by an embodiment of the present application;
图7是本申请实施例提供的另一种手势识别对象确定装置的结构示意图;Fig. 7 is a schematic structural diagram of another gesture recognition object determination device provided by an embodiment of the present application;
图8是本申请实施例提供的又一种手势识别对象确定装置的结构示意图;Fig. 8 is a schematic structural diagram of another device for determining an object for gesture recognition provided by an embodiment of the present application;
图9是本申请实施例提供的再一种手势识别对象确定装置的结构示意图;Fig. 9 is a schematic structural diagram of another device for determining an object for gesture recognition provided by an embodiment of the present application;
图10是本申请实施例提供的还一种手势识别对象确定装置的结构示意图;Fig. 10 is a schematic structural diagram of another device for determining an object for gesture recognition provided by an embodiment of the present application;
图11是本申请实施例提供的一种手势识别对象确定设备的框图。Fig. 11 is a block diagram of a gesture recognition object determination device provided by an embodiment of the present application.
具体实施方式detailed description
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the purpose, technical solution and advantages of the present application clearer, the implementation manners of the present application will be further described in detail below in conjunction with the accompanying drawings.
随着计算机视觉技术的发展,手机、电子屏幕、虚拟现实(virtual reality,VR)设备等多种形态的产品层出不穷,对人与机器的交互需求也日益增长。由于手势可以通过非接触式的方式表达丰富的信息,使得手势识别在人机交互、智能手机、智能电视等产品中有广泛的应用。尤其是基于视觉的手势识别技术,不需要在手上佩戴额外传感器以增加标记,便利性好,在人机交互等方面有很广泛的应用前景。本申请中提及的手势均指非接触式的手势,即隔空手势。With the development of computer vision technology, various forms of products such as mobile phones, electronic screens, and virtual reality (VR) devices emerge in an endless stream, and the demand for human-machine interaction is also increasing. Since gestures can express rich information in a non-contact manner, gesture recognition is widely used in human-computer interaction, smart phones, smart TVs and other products. In particular, the vision-based gesture recognition technology does not need to wear additional sensors on the hand to increase the mark, which is convenient and has a wide range of application prospects in human-computer interaction. The gestures mentioned in this application all refer to non-contact gestures, that is, air gestures.
目前很多场景中都有对显示设备进行隔空手势操作的需求。例如在会议室场景中,与会者可以对会议终端上的显示画面进行例如向上翻页、向下翻页、向左翻页、向右翻页、截屏等隔空手势操作。又例如在家庭场景中,家庭成员可以对智能电视上的播放画面进行例如快进、后退、调高音量、调低音量、暂停等隔空手势操作。又例如在教室场景中,老师或学生可以对显示设备上的显示内容进行例如向上滚动、向下滚动等隔空手势操作。At present, in many scenarios, there is a demand for air gesture operations on display devices. For example, in a conference room scenario, participants can perform air gesture operations such as page up, page down, left page, right page, and screenshots on the display screen on the conference terminal. For another example, in a family scene, family members can perform air gesture operations such as fast forward, rewind, turn up the volume, turn down the volume, and pause on the playback screen on the smart TV. For another example, in a classroom scene, a teacher or a student may perform air gesture operations such as scrolling up and scrolling down on the displayed content on the display device.
但是这些场景中显示设备前通常都有多个用户,在基于相机采集的图像进行手势识别时,容易识别到多个用户的手势,显示设备可能无法分辨具体是哪个用户在进行隔空手势操作,用户之间的手势会互相干扰,导致显示设备无法实现准确进行手势控制。However, in these scenarios, there are usually multiple users in front of the display device. When gesture recognition is performed based on the images collected by the camera, it is easy to recognize the gestures of multiple users. The display device may not be able to distinguish which user is performing the air gesture operation. Gestures between users interfere with each other, so that the display device cannot realize accurate gesture control.
基于此,本申请提出了一种确定手势识别对象的方案:通过对相机拍摄的多帧图像进行人脸检测以识别出拍摄区域内的潜在用户,再通过对潜在用户的手部动作进行判断以在潜在用户中确定手势识别对象,具体可以将在相机拍摄的多帧图像的每帧图像中都存在人脸图像、且手部动作与预设手势匹配的用户确定为相机的拍摄区域内的手势识别对象。本申请基于相机拍摄的图像即可实现自动确定手势识别对象,进一步可以对手势识别对象进行手势识别以实现隔空手势操作,适用于多种场景尤其是多用户场景下的手势识别,实现方式简单。另外,在对手势识别对象进行手势识别的过程中,不会识别除手势识别对象以外其它用户的手势,可以避免用户之间手势互相干扰导致无法实现准确进行手势控制的问题。Based on this, this application proposes a solution for determining the gesture recognition object: by performing face detection on multiple frames of images captured by the camera to identify potential users in the shooting area, and then by judging the hand movements of potential users to Determine the gesture recognition object among potential users. Specifically, the user whose face image exists in each frame of multiple frames of images captured by the camera and whose hand movements match the preset gestures can be determined as a gesture within the camera’s shooting area. Identify objects. This application can automatically determine the gesture recognition object based on the image taken by the camera, and further can perform gesture recognition on the gesture recognition object to realize the gesture operation in the air. It is suitable for gesture recognition in various scenarios, especially multi-user scenarios, and the implementation method is simple. . In addition, in the process of performing gesture recognition on the gesture recognition object, gestures of other users other than the gesture recognition object will not be recognized, which can avoid the problem that the gestures of users interfere with each other and cause accurate gesture control cannot be realized.
可选地,基于对用户操作习惯的考虑,由于用户在进行隔空手势操作时通常是面对相机 的,基于此,本申请方案还可以排除拍摄区域内侧对相机的用户,只在面对相机的用户中确定手势识别对象,具体可以将在相机拍摄的多帧图像的每帧图像中都存在正脸图像、且手部动作与预设手势匹配的用户确定为相机的拍摄区域内的手势识别对象。这样可以降低对手势识别对象的误判概率。为了提高对手势识别对象的判定准确性,还可以在支持手势控制功能的显示设备的操作手册中明确指明:用户在进行隔空手势操作时需要面对相机。本申请所指的面对相机并不是人脸完全正对相机,而可以存在设定范围内的偏差。人脸完全正对相机,可以是,双眼连线平行于相机的成像平面。若以人脸完全正对相机时的人脸偏转角度为0°,那么本申请中,面对相机是指人脸相对于相机的偏转角度在一定范围内。也即是,如果用户的人脸偏转角度在一定偏转角度范围内,那么视为这个用户是面对相机的。例如,可以将人脸偏转角度在-30°至30°范围内的用户视为是面对相机的,这里的人脸偏转角度范围仅用作示例性说明,本申请可以根据实际需求来设置用来判定用户是否面对相机的人脸偏转角度范围。本申请中将面对相机的用户的人脸称为正脸。Optionally, based on the consideration of the user's operating habits, since the user usually faces the camera when performing an air gesture operation, based on this, the solution of this application can also exclude users who face the camera inside the shooting area, and only face the camera The gesture recognition object is determined among the users. Specifically, the user whose face image exists in each frame of the multi-frame images captured by the camera and whose hand movements match the preset gestures can be determined as the gesture recognition within the shooting area of the camera. object. In this way, the misjudgment probability of the gesture recognition object can be reduced. In order to improve the determination accuracy of the gesture recognition object, it can also be clearly indicated in the operation manual of the display device supporting the gesture control function that the user needs to face the camera when performing an air gesture operation. Facing the camera referred to in this application does not mean that the face is completely facing the camera, but there may be a deviation within a set range. The face is completely facing the camera. It can be that the line connecting the eyes is parallel to the imaging plane of the camera. If the deflection angle of the face when the face is completely facing the camera is 0°, then in this application, facing the camera means that the deflection angle of the face relative to the camera is within a certain range. That is, if the deflection angle of the user's face is within a certain deflection angle range, the user is considered to be facing the camera. For example, users whose face deflection angles are within the range of -30° to 30° can be regarded as facing the camera. The range of face deflection angles here is only used as an example. To determine whether the user is facing the camera's face deflection angle range. In this application, the face of the user facing the camera is called the front face.
本申请实施例提供的手势识别对象确定方法可以应用于通用计算设备。该通用计算设备可以是显示设备或与显示设备连接的后处理端。其中,显示设备支持手势控制功能。显示设备内置有相机,或者,显示设备与外置的相机相连。相机用于对拍摄区域进行拍摄以得到图像。显示设备或与显示设备连接的后处理端用于根据相机拍摄得到的图像确定拍摄区域内的手势识别对象以及进一步对手势识别对象进行手势识别以响应隔空手势操作。相机的部署方位通常与显示设备的部署方位一致,相机的拍摄区域通常包括显示设备的显示面所朝向的区域。后处理端可以是一台服务器、或者由多台服务器组成的服务器集群、或者云计算平台等。The method for determining a gesture recognition object provided in the embodiment of the present application may be applied to a general-purpose computing device. The general computing device may be a display device or a post-processing terminal connected to the display device. Wherein, the display device supports a gesture control function. The display device has a built-in camera, or the display device is connected to an external camera. The camera is used to take pictures of the shooting area to obtain images. The display device or the post-processing terminal connected to the display device is used to determine the gesture recognition object in the shooting area according to the image captured by the camera, and further perform gesture recognition on the gesture recognition object to respond to the gesture operation in the air. The deployment orientation of the camera is generally consistent with the deployment orientation of the display device, and the shooting area of the camera generally includes the area toward which the display surface of the display device faces. The post-processing end may be a server, or a server cluster composed of multiple servers, or a cloud computing platform.
本申请实施例提供的手势识别对象确定方法可以应用于多种场景。在会议室场景中,显示设备可以是大屏或电子白板等会议终端。在家庭场景或教室场景中,显示设备可以是智能电视、投影设备或VR设备等。The gesture recognition object determination method provided in the embodiment of the present application can be applied to various scenarios. In a conference room scenario, the display device can be a conference terminal such as a large screen or an electronic whiteboard. In a family scene or a classroom scene, the display device may be a smart TV, a projection device, or a VR device.
例如,图1是本申请实施例提供的一种手势识别对象确定方法涉及的应用场景示意图。该应用场景是会议室场景。如图1所示,该应用场景包括会议终端,会议终端内置有相机。会议终端安装在墙面上。相机的拍摄区域内包括会议桌以及多个与会者。在会议终端开启手势控制功能期间,相机持续对拍摄区域进行拍摄,会议终端或与会议终端连接的后处理端(图中未示出)对相机拍摄的图像进行处理,以确定拍摄区域内是否存在手势识别对象。For example, FIG. 1 is a schematic diagram of an application scenario involved in a method for determining a gesture recognition object provided in an embodiment of the present application. The application scenario is a conference room scenario. As shown in FIG. 1 , the application scenario includes a conference terminal, and the conference terminal has a built-in camera. The conference terminal is installed on the wall. The camera's field of view includes the conference table and several attendees. When the gesture control function of the conference terminal is turned on, the camera continues to photograph the shooting area, and the conference terminal or the post-processing terminal (not shown in the figure) connected to the conference terminal processes the images captured by the camera to determine whether there are Gesture recognition object.
下面对本申请实施例的方法流程进行说明。The method flow of the embodiment of the present application will be described below.
图2是本申请实施例提供的一种手势识别对象确定方法的流程示意图。如图2所示,该方法包括:Fig. 2 is a schematic flowchart of a method for determining a gesture recognition object provided by an embodiment of the present application. As shown in Figure 2, the method includes:
步骤201、根据相机对拍摄区域进行拍摄得到的多帧第一图像,确定拍摄区域内的一个或多个潜在用户。Step 201: Determine one or more potential users in the shooting area according to multiple frames of first images obtained by shooting the shooting area with the camera.
潜在用户满足:多帧第一图像中的每帧第一图像都包括该潜在用户的人脸图像。也即是,将在多帧第一图像中的每帧第一图像中都存在人脸图像的用户作为拍摄区域内的潜在用户。其中,用来确定潜在用户的图像的帧数是预先配置的,例如上述多帧第一图像可以是3帧图像、5帧图像或10帧图像等,本申请实施例对这里所采用的第一图像的帧数不做限定。The potential user satisfies that each frame of the first image in the multiple frames of the first image includes the face image of the potential user. That is to say, a user whose face image exists in each frame of the first image in multiple frames is regarded as a potential user in the shooting area. Wherein, the number of frames used to determine the image of the potential user is pre-configured. For example, the first multi-frame image may be 3 frames, 5 frames, or 10 frames. The embodiment of the present application uses the first The frame number of the image is not limited.
可选地,分别对多帧第一图像进行人脸检测,以得到每帧第一图像中的人脸图像。然后判断不同第一图像中的哪些人脸图像属于同一用户。最后确定哪些用户的人脸图像存在于多 帧第一图像中的每帧第一图像,进而得到拍摄区域内的潜在用户。Optionally, face detection is performed on multiple frames of the first image respectively, so as to obtain a face image in each frame of the first image. Then determine which face images in different first images belong to the same user. Finally determine which user's face images exist in each frame of the first image in the multiple frames of the first image, and then obtain potential users in the shooting area.
例如,可以使用多任务级联卷积神经网络(multi-task cascaded convolutional networks,MTCNN)进行人脸检测。MTCNN包括级联的三个网络:提案网络(proposal network,P-Net)、细化网络(refinement network,R-Net)和输出网络(output network,O-Net)。基于MTCNN对图像进行人脸检测的过程包括:For example, multi-task cascaded convolutional networks (MTCNN) can be used for face detection. MTCNN includes three cascaded networks: proposal network (P-Net), refinement network (refinement network, R-Net) and output network (output network, O-Net). The process of face detection of images based on MTCNN includes:
首先,对输入的原始图像建立图像金字塔。图像金字塔包括对原始图像进行缩放得到的不同尺寸的多张图像。由于原始图像中可能存在尺寸不同的人脸图像,通过建立图像金字塔,可以在统一尺寸下检测到原始图像中不同尺寸的人脸图像,增强网络对不同尺寸的人脸图像的鲁棒性。First, build an image pyramid for the input original image. The image pyramid includes multiple images of different sizes obtained by scaling the original image. Since there may be face images of different sizes in the original image, by establishing an image pyramid, face images of different sizes in the original image can be detected at a uniform size, and the robustness of the network to face images of different sizes can be enhanced.
其次,将图像金字塔输入级联的三个网络(P-Net、R-Net、O-Net)中,通过级联的三个网络完成对图像中的人脸图像由粗到细的检测,最终输出人脸检测结果。其中,P-Net用于对输入的图像回归出多个检测框,再将回归出的多个检测框映射到原图中,通过非极大值抑制(non-maximum suppression,NMS)算法去除部分冗余框,得到初步的人脸检测结果。R-Net用于对P-Net输出的人脸检测结果进行进一步细化和过滤。O-Net用于对R-Net输出的人脸检测结果进行进一步细化和过滤,输出最终的人脸检测结果。Secondly, the image pyramid is input into the three cascaded networks (P-Net, R-Net, O-Net), and the face image in the image is detected from coarse to fine through the three cascaded networks, and finally Output the face detection result. Among them, P-Net is used to return multiple detection frames to the input image, and then map the multiple detection frames to the original image, and remove some of them through the non-maximum suppression (NMS) algorithm. Redundant frames to get preliminary face detection results. R-Net is used to further refine and filter the face detection results output by P-Net. O-Net is used to further refine and filter the face detection results output by R-Net, and output the final face detection results.
可选地,基于MTCNN得到的人脸检测结果包括检测到的每个人脸图像对应的人脸检测框信息以及人脸关键点信息。其中,人脸检测框信息可以包括人脸检测框的左上角坐标和右下角坐标,人脸图像位于人脸检测框内。人脸关键点信息可以包括多个人脸关键点的坐标,该多个人脸关键点可以包括左眼、右眼、鼻子、左嘴角和右嘴角等。Optionally, the face detection result obtained based on the MTCNN includes face detection frame information and face key point information corresponding to each detected face image. Wherein, the face detection frame information may include the coordinates of the upper left corner and the lower right corner of the face detection frame, and the face image is located in the face detection frame. The human face key point information may include coordinates of multiple human face key points, and the multiple human face key points may include left eye, right eye, nose, left mouth corner, right mouth corner, and the like.
在得到多帧第一图像中的人脸检测框信息之后,可以分别计算多帧第一图像中每相邻两帧第一图像中的人脸检测框的交并比(intersection over union,IoU)值。再基于IoU值确定相邻两帧第一图像中属于同一用户的人脸图像。这里的IoU值可以等于将相邻两帧第一图像叠加后,该相邻两帧第一图像中的人脸检测框的相交面积与相并面积的比值。IoU值的取值范围为0至1。例如,相邻两帧第一图像分别为图像A和图像B,当图像A中的第一人脸检测框与图像B中的第二人脸检测框的IoU值大于预设阈值时,可以确定第一人脸检测框中的人脸图像与第二人脸检测框中的人脸图像属于同一用户。如果上述多帧第一图像包括相机连续采集的多帧图像,那么预设阈值可以取较大值,例如可以取值为0.8。如果上述多帧第一图像包括相机隔帧采集的多帧图像,那么预设阈值可以取较小值,例如可以取值为0.6。本申请实施例对预设阈值的具体取值不做限定。After obtaining the face detection frame information in the multi-frame first image, the intersection over union (IoU) of the face detection frames in every two adjacent frames of the first image in the multi-frame first image can be calculated respectively value. Based on the IoU value, the face images belonging to the same user in the first images of two adjacent frames are determined. The IoU value here may be equal to the ratio of the intersection area and the merged area of the face detection frames in the two adjacent frames of the first image after superimposing the two adjacent frames of the first image. The IoU value ranges from 0 to 1. For example, two adjacent first images are image A and image B respectively, when the IoU value of the first face detection frame in image A and the second face detection frame in image B is greater than the preset threshold, it can be determined The face image in the first face detection frame and the face image in the second face detection frame belong to the same user. If the above multiple frames of first images include multiple frames of images continuously collected by the camera, then the preset threshold may take a larger value, for example, may take a value of 0.8. If the above multiple frames of first images include multiple frames of images collected by the camera at intervals, the preset threshold may take a smaller value, for example, 0.6. The embodiment of the present application does not limit the specific value of the preset threshold.
或者,在得到多帧第一图像中的人脸图像之后,也可以通过计算人脸相似度的方式判断不同第一图像中的哪些人脸图像属于同一用户。Alternatively, after obtaining the face images in the multiple frames of the first images, it may also be determined which face images in different first images belong to the same user by calculating the face similarity.
本申请实施例中,可以采用相同的用户标识来标识不同图像中属于同一用户的人脸图像,并采用不同的用户标识来标识同一图像中属于不同用户的人脸图像。如果上述多帧第一图像中的每帧第一图像都包括同一用户标识所标识的人脸图像,那么将该用户标识代表的用户确定为潜在用户。这里使用的用户标识只需区分不同用户,例如可以用数字、字符或其它标识作为用户标识。由于本申请方案无需识别出用户身份,而是只要能够区分不同用户即可,也就无需预先设置场景中可能存在的手势识别对象,可以灵活适用于各种多用户场景,尤其是用户群体多变的多用户场景,例如公用会议室等。In the embodiment of the present application, the same user identifier may be used to identify face images belonging to the same user in different images, and different user identifiers may be used to identify face images belonging to different users in the same image. If each frame of the first image in the plurality of frames of first images includes a face image identified by the same user identifier, then the user represented by the user identifier is determined as a potential user. The user identifiers used here only need to distinguish different users, for example, numbers, characters or other identifiers may be used as user identifiers. Since the application scheme does not need to identify the user identity, but only needs to be able to distinguish different users, there is no need to pre-set gesture recognition objects that may exist in the scene, and it can be flexibly applied to various multi-user scenarios, especially when the user group is changeable multi-user scenarios, such as public conference rooms.
由于用户在进行隔空手势操作时通常是面对相机的,本申请实施例还可以排除拍摄区域 内侧对相机的用户,只在面对相机的用户中确定潜在用户,这样可以降低对手势识别对象的误判概率。Since the user usually faces the camera when performing an air gesture operation, the embodiment of the present application can also exclude the users facing the camera inside the shooting area, and only determine potential users among the users facing the camera, which can reduce the need for gesture recognition objects. probability of misjudgment.
可选地,上述潜在用户所满足的多帧第一图像中的每帧第一图像都包括该潜在用户的人脸图像中的人脸图像这个条件中,人脸图像为正脸图像,即潜在用户满足:多帧第一图像中的每帧第一图像都包括该潜在用户的正脸图像。也即是,将在多帧第一图像的每帧第一图像中都存在正脸图像的用户作为拍摄区域内的潜在用户。Optionally, in the condition that each frame of the first image in the multiple frames of the first image satisfied by the potential user includes a face image in the face image of the potential user, the face image is a frontal face image, that is, the potential user The user satisfies: each frame of the first image in the multiple frames of the first image includes a front face image of the potential user. That is, a user whose front face image exists in each frame of the first image of the plurality of frames is regarded as a potential user in the shooting area.
在一种实现方式中,可以将人脸图像输入预先训练得到的分类模型,以得到分类模型输出的分类结果,该分类结果指示输入的人脸图像属于正脸还是侧脸。该分类模型可以基于训练样本集通过有监督学习的方式训练得到。其中,训练样本集可以包括大量样本人脸图像,每张样本人脸图像都标注有标签,该标签指示样本人脸图像属于正脸还是侧脸。In an implementation manner, the face image may be input into a pre-trained classification model to obtain a classification result output by the classification model, and the classification result indicates whether the input face image belongs to a frontal face or a side face. The classification model can be trained through supervised learning based on the training sample set. Wherein, the training sample set may include a large number of sample face images, and each sample face image is marked with a label, and the label indicates whether the sample face image belongs to a frontal face or a side face.
例如,可以使用轻量级的深度神经网络——mobilenetv2建立二分类模型。mobilenetv2常应用于手机等移动端中的分类任务。向mobilenetv2输入人脸图像后,mobilenetv2会输出分类结果。分类结果有两种,可以分别采用0和1来表示。其中,0可以表示输入的人脸图像属于侧脸,1可以表示输入的人脸图像属于正脸。For example, a lightweight deep neural network - mobilenetv2 can be used to build a binary classification model. mobilenetv2 is often applied to classification tasks in mobile terminals such as mobile phones. After inputting the face image to mobilenetv2, mobilenetv2 will output the classification result. There are two classification results, which can be represented by 0 and 1 respectively. Among them, 0 can indicate that the input face image belongs to the side face, and 1 can indicate that the input face image belongs to the front face.
在另一种实现方式中,可以预先设置人脸偏转角度范围,如果用户的人脸偏转角度在该人脸偏转角度范围内,那么视为这个用户是面对相机的,即图像中该用户的人脸图像为正脸图像。在获取人脸图像之后,可以基于人脸图像进行人脸姿态估计,以得到该人脸图像所属用户的人脸偏转角度。如果该人脸图像所属用户的人脸偏转角度处于预先设置的人脸偏转角度范围内,则判定该人脸图像为正脸图像,否则判定该人脸图像为侧脸图像。In another implementation, the face deflection angle range can be set in advance, if the user's face deflection angle is within the face deflection angle range, then the user is considered to be facing the camera, that is, the user's face in the image The face image is a front face image. After the face image is acquired, face pose estimation may be performed based on the face image to obtain the face deflection angle of the user to whom the face image belongs. If the face deflection angle of the user to whom the face image belongs is within the preset range of face deflection angles, it is determined that the face image is a front face image, otherwise it is determined that the face image is a side face image.
如果潜在用户距离相机较远,那么相机拍摄的图像中该潜在用户的人体图像会较小,可能无法体现用户手部细节,这样可能会导致后续对用户的手部动作的误判。因此本申请实施例中,在确定拍摄区域内的潜在用户之后,还可以获取潜在用户到相机的距离。当潜在用户到相机的距离超出距离阈值时,输出距离提示。该距离提示用于提示潜在用户靠近相机。如果这个潜在用户要进行隔空手势操作,那么可以根据距离提示靠近相机,这样可以提高对手势识别对象的判定准确性,进一步还可以提高对手势识别对象的隔空手势的识别准确性。If the potential user is far away from the camera, the human body image of the potential user in the image captured by the camera will be small, which may not reflect the details of the user's hand, which may lead to subsequent misjudgment of the user's hand movement. Therefore, in the embodiment of the present application, after determining the potential users in the shooting area, the distance from the potential users to the camera may also be obtained. When the distance from a potential user to the camera exceeds a distance threshold, a distance prompt is output. This distance cue is used to prompt potential users to move closer to the camera. If the potential user wants to perform an air gesture operation, he can be prompted to approach the camera according to the distance, which can improve the accuracy of determining the gesture recognition object, and further improve the recognition accuracy of the gesture recognition object's air gesture.
当本申请方案由显示设备执行时,显示设备输出距离提示,可以是显示设备显示距离提示。当本申请方案由与显示设备连接的后处理端执行时,后处理端输出距离提示,可以是后处理端向连接的显示设备发送距离提示,以在显示设备上显示该距离提示。When the solution of the present application is executed by a display device, the display device outputs a distance prompt, which may be that the display device displays a distance prompt. When the solution of the present application is executed by a post-processing end connected to a display device, the post-processing end outputs a distance prompt, and the post-processing end may send the distance prompt to the connected display device to display the distance prompt on the display device.
可选地,获取潜在用户到相机的距离的实现过程,包括:根据相机的焦距、潜在用户在第一图像中的双眼间距以及预设用户双眼间距,确定潜在用户到相机的距离。这里,潜在用户在第一图像中的双眼间距可以是,潜在用户在包含该潜在用户的正脸图像的第一图像中的双眼间距。其中,预设用户双眼间距是预先设置的固定值。由于不同用户的实际双眼间距的差异较小,因此可以选用多个用户的实际双眼间距的平均值作为预设用户双眼间距。Optionally, the implementation process of obtaining the distance from the potential user to the camera includes: determining the distance from the potential user to the camera according to the focal length of the camera, the distance between the eyes of the potential user in the first image, and the distance between the eyes of the preset user. Here, the distance between the eyes of the potential user in the first image may be the distance between the eyes of the potential user in the first image including the front face image of the potential user. Wherein, the preset distance between the eyes of the user is a preset fixed value. Since the difference between the actual binocular distances of different users is small, an average value of the actual binocular distances of multiple users may be selected as the preset user binocular distance.
例如,图3是本申请实施例提供的一种测距原理示意图。如图3所示,相机的焦距为f,用户在该相机拍摄的包含该用户的正脸图像的图像(即成像平面)中的双眼间距为M,预设用户双眼间距为K,假设用户到相机的距离为d,则根据相似三角形原理可以得到:M/f=K/d,由此可推导出用户到相机的距离d=(K*f)/M。For example, FIG. 3 is a schematic diagram of a ranging principle provided in an embodiment of the present application. As shown in Figure 3, the focal length of the camera is f, the distance between the eyes of the user in the image (that is, the imaging plane) captured by the camera that contains the user's front face image is M, and the distance between the eyes of the preset user is K, assuming that the user arrives at If the distance of the camera is d, then according to the principle of similar triangles, it can be obtained: M/f=K/d, from which the distance from the user to the camera can be deduced as d=(K*f)/M.
本申请实施例中,不限于相机是单目相机、双目相机或者是否集成有深度传感器,基于相似三角形原理即可确定用户到相机的距离,计算方式简单,实现成本较低。In the embodiment of the present application, it is not limited to whether the camera is a monocular camera, a binocular camera, or whether a depth sensor is integrated. The distance from the user to the camera can be determined based on the principle of similar triangles. The calculation method is simple and the implementation cost is low.
可选地,当相机为双目相机时,还可以基于双目测距原理计算得到潜在用户到相机的距离。或者,当相机集成有深度传感器时,还可以通过深度传感器测量得到潜在用户到相机的距离。其中,深度传感器可以是超声波雷达、毫米波雷达、激光雷达或者结构光传感器,本申请实施例对此不作限定。应理解,该深度传感器也可以为其它可测量距离的设备。Optionally, when the camera is a binocular camera, the distance from the potential user to the camera may also be calculated based on a binocular ranging principle. Alternatively, when the camera is integrated with a depth sensor, the distance from the potential user to the camera may also be obtained by measuring the depth sensor. Wherein, the depth sensor may be an ultrasonic radar, a millimeter-wave radar, a laser radar, or a structured light sensor, which is not limited in this embodiment of the present application. It should be understood that the depth sensor may also be other devices capable of measuring distances.
步骤202、分别获取多帧第一图像中潜在用户对应的待识别区域,潜在用户对应的待识别区域包括潜在用户的手部图像。Step 202 : Obtain areas to be identified corresponding to potential users in multiple frames of the first image respectively, where the areas to be identified corresponding to potential users include hand images of the potential users.
可选地,每帧第一图像中都有潜在用户对应的待识别区域。可选地,潜在用户对应的待识别区域还包括潜在用户的肘部图像。本申请实施例涉及的图像中的待识别区域为图像中的感兴趣区域(region of interest,ROI),即图像中需要被处理的区域。Optionally, each frame of the first image has a region to be identified corresponding to a potential user. Optionally, the region to be identified corresponding to the potential user further includes an elbow image of the potential user. The region to be identified in the image involved in the embodiment of the present application is a region of interest (ROI) in the image, that is, the region in the image that needs to be processed.
在上述步骤201确定拍摄区域内的潜在用户的过程中,可以得到潜在用户分别在多帧第一图像中的人脸图像。相应地,步骤202的实现过程可以包括:确定潜在用户在第一图像中的人脸图像位置,并根据潜在用户在第一图像中的人脸图像位置,确定该第一图像中该潜在用户对应的待识别区域。潜在用户对应的待识别区域除了包括该潜在用户的手部图像以外,还可以包括该潜在用户的人脸图像。In the process of determining the potential users in the shooting area in the above step 201, the face images of the potential users respectively in the multiple frames of the first images can be obtained. Correspondingly, the implementation process of step 202 may include: determining the position of the face image of the potential user in the first image, and according to the position of the face image of the potential user in the first image, determining the corresponding area to be identified. The area to be identified corresponding to the potential user may include not only the hand image of the potential user, but also the face image of the potential user.
可选地,可以对潜在用户在第一图像中的人脸成像区域进行拓展,裁剪得到包含手部图像以及人脸图像的待识别区域。例如,图4是本申请实施例提供的一种图像示意图。如图4所示,该图像包括用户A的人体图像、用户B的人体图像、用户C的人体图像以及用户D的人体图像。其中,用户A和用户B的人体图像包括正脸图像,用户C和用户D的人体图像包括侧脸图像。假设用户A和用户B是拍摄区域内的潜在用户,则可以对用户A在图像中的人脸成像区域A1进行拓展,裁剪得到该图像中用户A对应的待识别区域(区域A2);对用户B在图像中的人脸成像区域B1进行拓展,裁剪得到该图像中用户B对应的待识别区域(区域B2)。Optionally, the face imaging area of the potential user in the first image may be expanded, and the area to be recognized including the hand image and the face image may be obtained by cropping. For example, FIG. 4 is a schematic diagram of an image provided by an embodiment of the present application. As shown in FIG. 4 , the images include a human body image of user A, a human body image of user B, a human body image of user C, and a human body image of user D. Wherein, the human body images of user A and user B include front face images, and the human body images of user C and user D include side face images. Assuming that user A and user B are potential users in the shooting area, the face imaging area A1 of user A in the image can be expanded, and the area to be recognized (area A2) corresponding to user A in the image can be obtained by cropping; B expands the face imaging area B1 in the image, and crops out the area to be recognized (area B2) corresponding to user B in the image.
步骤203、根据多帧第一图像中潜在用户对应的待识别区域,确定潜在用户的手部动作。Step 203: Determine the hand movements of the potential user according to the regions to be identified corresponding to the potential user in the multiple frames of the first image.
可选地,步骤203的实现过程可以包括:对多帧第一图像中潜在用户对应的待识别区域分别进行关键点检测,得到该潜在用户的多组手部关键点信息。根据该潜在用户的多组手部关键点信息,确定该潜在用户的手部动作。Optionally, the implementation process of step 203 may include: respectively performing key point detection on regions to be identified corresponding to the potential user in multiple frames of the first image to obtain multiple sets of hand key point information of the potential user. According to the multiple sets of hand key point information of the potential user, the hand motion of the potential user is determined.
其中,对多帧第一图像中潜在用户对应的待识别区域分别进行关键点检测,得到该潜在用户的多组手部关键点信息,可以是对每帧第一图像中潜在用户对应的待识别区域进行关键点检测,以得到该潜在用户的一组手部关键点信息。Among them, performing key point detection on the regions to be identified corresponding to the potential users in the multiple frames of the first image respectively to obtain multiple sets of key point information of the hands of the potential user, which may be to identify the areas to be identified corresponding to the potential users in each frame of the first image The key point detection is carried out in the area to obtain a set of key point information of the potential user's hands.
可选地,一组手部关键点信息包括多个手部关键点的位置以及多个手部关键点之间的连接关系。每个手部关键点代表手部的一个特定部位。例如,图5是本申请实施例提供的一种手部关键点的分布示意图。如图5所示,手部可以包括手腕(0)、腕掌骨关节(1)、拇指掌指关节(2)、拇指指间关节(3)、拇指指尖(4)、食指掌指关节(5)、食指近端指间关节(6)、食指远端指间关节(7)、食指指尖(8)、中指掌指关节(9)、中指近端指间关节(10)、中指远端指间关节(11)、中指指尖(12)、无名指掌指关节(13)、无名指近端指间关节(14)、无名指远端指间关节(15)、无名指指尖(16)、小指掌指关节(17)、小指近端指间关节(18)、小指远端指间关节(19)、小指指尖(20)这21个手部关键点。Optionally, the set of hand key point information includes the positions of multiple hand key points and the connection relationship among the multiple hand key points. Each hand key represents a specific part of the hand. For example, FIG. 5 is a schematic diagram of distribution of key points of a hand provided in an embodiment of the present application. As shown in Figure 5, the hand can include wrist (0), carpus metacarpal joint (1), thumb metacarpophalangeal joint (2), thumb interphalangeal joint (3), thumb fingertip (4), index finger metacarpophalangeal joint ( 5), the proximal interphalangeal joint of the index finger (6), the distal interphalangeal joint of the index finger (7), the tip of the index finger (8), the metacarpophalangeal joint of the middle finger (9), the proximal interphalangeal joint of the middle finger (10), the distal interphalangeal joint of the middle finger Terminal interphalangeal joint (11), middle finger fingertip (12), ring finger metacarpophalangeal joint (13), ring finger proximal interphalangeal joint (14), ring finger distal interphalangeal joint (15), ring finger fingertip (16), The 21 key points of the hand are the metacarpophalangeal joint of the little finger (17), the proximal interphalangeal joint of the little finger (18), the distal interphalangeal joint of the little finger (19), and the fingertip of the little finger (20).
本申请实施例在对包含潜在用户的手部图像的待识别区域进行关键点检测时,可以检测上述21个手部关键点,或者也可以检测更多或更少的手部关键点。In this embodiment of the present application, when performing key point detection on a region to be recognized including a potential user's hand image, the above 21 hand key points may be detected, or more or less hand key points may be detected.
例如,可以使用基于深度神经网络的关键点检测器对待识别区域进行关键点检测,该关键点检测器可以是基于热力图(heatmap)技术实现的。关键点检测器可以采用自下而上的方式对待识别区域进行关键点检测。假设检测目标包括21个手部关键点,则可以生成包含21个通道的热力图,每一个通道即为一个手部关键点的概率图(热力分布图);概率图中的数字表征了该处为手部关键点的概率大小,数字越接近1表示该处为手部关键点的概率越大。同时生成包含21*2个通道的矢量图,每2个通道包含一个手部关键点的位置信息(二维信息)。由此可以得到手部关键点的位置。进一步,关键点检测器基于部分亲和字段(par affinity fields,PAF)将检测得到的手部关键点连接起来,由此可以得到多个手部关键点之间的连接关系。For example, a key point detector based on a deep neural network can be used to detect key points in the area to be recognized, and the key point detector can be implemented based on heatmap technology. The key point detector can perform key point detection on the area to be recognized in a bottom-up manner. Assuming that the detection target includes 21 key points of the hand, a heat map containing 21 channels can be generated, and each channel is a probability map (thermal distribution map) of a key point of the hand; the numbers in the probability map represent the is the probability of the key point of the hand, the closer the number is to 1, the higher the probability of the key point of the hand. At the same time, a vector map containing 21*2 channels is generated, and each 2 channels contains the position information (two-dimensional information) of a key point of the hand. From this, the position of the key points of the hand can be obtained. Further, the key point detector connects the detected hand key points based on partial affinity fields (PAF), so that the connection relationship between multiple hand key points can be obtained.
可选地,在得到潜在用户的多组手部关键点信息之后,可以根据多组手部关键点信息确定潜在用户的手部的形态变化和/或位移,进而确定该潜在用户的手部动作。Optionally, after obtaining multiple sets of hand key point information of the potential user, the shape change and/or displacement of the potential user's hand can be determined according to the multiple sets of hand key point information, and then the potential user's hand motion can be determined .
当用户的多个手部关键点过于紧凑或检测结果中缺失部分手部关键点时,可能会导致对用户的手部动作的误判或漏判,因此可以结合用户的肘部关键点的移动方向来辅助判断用户的手部移动方向,进而确定用户的手部动作。可选地,第一图像中潜在用户对应的待识别区域可以包括潜在用户的手部图像和肘部图像。则步骤203的实现过程可以包括:对多帧第一图像中潜在用户对应的待识别区域分别进行关键点检测,得到该潜在用户的多组手部及肘部关键点信息;根据该潜在用户的多组手部及肘部关键点信息,确定该潜在用户的手部动作。When the user's multiple hand key points are too compact or some hand key points are missing in the detection results, it may lead to misjudgment or missed judgment of the user's hand movement, so the movement of the user's elbow key points can be combined direction to help determine the user's hand movement direction, and then determine the user's hand movement. Optionally, the region to be identified corresponding to the potential user in the first image may include a hand image and an elbow image of the potential user. Then, the implementation process of step 203 may include: performing key point detection on regions to be identified corresponding to potential users in multiple frames of the first image to obtain multiple sets of hand and elbow key point information of the potential user; Multiple sets of hand and elbow key point information to determine the potential user's hand movements.
步骤204、将一个或多个潜在用户中的目标用户确定为手势识别对象,目标用户的手部动作与预设手势匹配。Step 204: Determine a target user among one or more potential users as a gesture recognition object, and the target user's hand motion matches a preset gesture.
可选地,预设手势包括待识别手势的起始部分。例如,一个完整的待识别手势需要10帧图像才能判定,则可以选取用作判定为待识别手势的起始3帧图像对应的手势作为预设手势。这样,当用户需要进行隔空手势操作时,直接在相机的拍摄区域内执行待识别手势即可,而无需执行其它特定的唤醒手势来开启设备的手势识别功能,在用户无感知的情况下实现手势识别对象的确定,可以简化用户操作,提升用户体验。Optionally, the preset gesture includes an initial part of the gesture to be recognized. For example, if a complete gesture to be recognized requires 10 frames of images to be determined, then the gesture corresponding to the first 3 frames of images that are determined as the gesture to be recognized can be selected as the preset gesture. In this way, when the user needs to perform an air gesture operation, the gesture to be recognized can be performed directly in the shooting area of the camera, without performing other specific wake-up gestures to enable the gesture recognition function of the device, which can be realized without the user's perception The determination of the gesture recognition object can simplify user operations and improve user experience.
其中,待识别手势为显示设备中预先配置的能够转换为控制指令的手势。例如在会议场景中,会议终端上预先配置的待识别手势可以包括向上翻页手势、向下翻页手势、向左翻页手势、向右翻页手势和截屏手势。当确定某个潜在用户的手部动作是从左向右移动时,则可以判定该潜在用户的手部动作与向左翻页手势的起始部分匹配,进而可以将该潜在用户确定为手势识别对象。Wherein, the gesture to be recognized is a gesture that is preconfigured in the display device and can be converted into a control instruction. For example, in a conference scenario, the gestures to be recognized pre-configured on the conference terminal may include a page-up gesture, a page-down gesture, a page-turning gesture to the left, a page-turning gesture to the right, and a screenshot gesture. When it is determined that a potential user's hand motion is moving from left to right, it can be determined that the potential user's hand motion matches the initial part of the left page turning gesture, and then the potential user can be determined as gesture recognition object.
可选地,当相机的拍摄区域内存在手部动作与预设手势匹配的多个潜在用户时,将距离相机最近的潜在用户作为目标用户。Optionally, when there are multiple potential users whose hand movements match the preset gestures in the shooting area of the camera, the potential user closest to the camera is used as the target user.
本申请实施例中,同一时刻最多只能确定出一个手势识别对象,手势识别对象可能会随着时间推移发生变化。距离相机越近的用户被确定为手势识别对象的概率越高。In the embodiment of the present application, at most one gesture recognition object can be determined at a time, and the gesture recognition object may change over time. The closer the user is to the camera, the higher the probability of being determined as a gesture recognition object.
可选地,当相机在目标用户作为手势识别对象之后,对拍摄区域进行拍摄得到的图像中不包括目标用户的人脸图像的图像数量超出数量阈值,或者,目标用户作为手势识别对象的持续时长达到时长阈值时,结束将目标用户作为手势识别对象。例如,数量阈值可以取值为3帧,即当相机采集的图像中不包括目标用户的人脸图像超过3帧时,则不再将该目标用户作为手势识别对象,并重新开始执行手势识别对象确定流程。时长阈值为预先设置的老化时间,例如可以取值为20秒,即每次确定的手势识别对象的最多有效时长为20秒,超过20秒则确定的手势识别对象失效,需要重新再确定手势识别对象,以满足应用场景中手势识别对 象灵活多变的需求。Optionally, when the camera captures the target user as the gesture recognition object, the number of images that do not include the target user's face image in the image captured by the camera exceeds the number threshold, or the duration of the target user as the gesture recognition object When the duration threshold is reached, the target user is terminated as a gesture recognition object. For example, the number threshold can be set to 3 frames, that is, when the image captured by the camera does not include the face image of the target user for more than 3 frames, the target user will no longer be used as the gesture recognition object, and the gesture recognition object will be executed again. Determine the process. The duration threshold is the preset aging time. For example, the value can be 20 seconds, that is, the maximum effective duration of each determined gesture recognition object is 20 seconds. If it exceeds 20 seconds, the determined gesture recognition object will become invalid, and the gesture recognition needs to be re-determined. objects to meet the flexible and changeable requirements of gesture recognition objects in application scenarios.
可选地,结束将目标用户作为手势识别对象的条件还可以是目标用户作为手势识别对象之后的一定时长(取值小于老化时间)内没有作出正确的待识别手势,或者,检测到目标用户的手放下了,或者,目标用户的手部保持不动(这种情况可以排除用户恶意占用手势识别对象的情况),等等,本申请实施例对将目标用户不再作为手势识别对象的结束条件不做限定。Optionally, the condition for ending the target user as the gesture recognition object may also be that the target user does not make a correct gesture to be recognized within a certain period of time (the value is less than the aging time) after the target user is the gesture recognition object, or the target user is detected. The hand is put down, or the target user's hand remains still (this situation can exclude the situation that the user maliciously occupies the gesture recognition object), etc., the embodiment of the present application no longer regards the target user as the end condition of the gesture recognition object No limit.
例如,假设采用3帧图像来判定潜在用户,且判定条件中的人脸图像为正脸图像,结束将用户作为手势识别对象的条件为:相机在用户作为手势识别对象之后对拍摄区域进行拍摄得到的图像中不包括该用户的人脸图像的图像数量超出数量阈值。则本申请实施例提供的手势识别对象确定方法的实现可以如下:在确定手势识别对象的过程中,如果存在3帧图像满足包括用户的正脸图像,且基于这3帧图像确定这个用户的手部动作与预设手势匹配,那么可以将该用户确定为手势识别对象,接下来会对该用户进行手势识别。同时,在将该用户确定为手势识别对象之后,还会实时检测接下来采集的图像中是否包括该用户的正脸图像,如果不包括该用户的正脸图像的图像达到一定数量,则结束将该用户作为手势识别对象,并重新开始执行手势识别对象确定流程。手势识别对象确定流程可以是在相机的拍摄区域内不存在手势识别对象时执行的,也即是,在确定手势识别对象之后,显示设备或与显示设备连接的后处理端可以停止执行手势识别对象确定流程直至上一次确定的手势识别对象失效。For example, assuming that 3 frames of images are used to determine potential users, and the face image in the determination condition is a frontal face image, the condition for ending the user as the gesture recognition object is: the camera captures the shooting area after the user is the gesture recognition object to obtain The number of images that do not include the user's face image exceeds the number threshold. Then the implementation of the gesture recognition object determination method provided in the embodiment of the present application can be implemented as follows: in the process of determining the gesture recognition object, if there are 3 frames of images that satisfy the user's front face image, and based on these 3 frames of images, determine the user's hand If the internal action matches the preset gesture, then the user can be determined as the target of gesture recognition, and then gesture recognition will be performed on the user. At the same time, after the user is determined as the gesture recognition object, it will also detect in real time whether the next collected images include the user's frontal face image. If the number of images that do not include the user's frontal face image reaches a certain number, the end will The user is used as a gesture recognition object, and the process of determining a gesture recognition object is restarted. The process of determining the gesture recognition object can be executed when there is no gesture recognition object in the shooting area of the camera, that is, after the gesture recognition object is determined, the display device or the post-processing terminal connected to the display device can stop executing the gesture recognition object The process is determined until the last determined gesture recognition object becomes invalid.
步骤205、获取多帧第二图像中目标用户对应的待识别区域,目标用户对应的待识别区域包括目标用户的手部图像。Step 205: Obtain the to-be-recognized area corresponding to the target user in multiple frames of second images, where the to-be-identified area corresponding to the target user includes the hand image of the target user.
这里,获取多帧第二图像中目标用户对应的待识别区域,可以理解为,只获取多帧第二图像中目标用户对应的待识别区域,而不再获取多帧第二图像中除目标用户以外的其它用户对应的待识别区域。该多帧第二图像是在将目标用户确定为手势识别对象之后,由相机对拍摄区域进行拍摄得到的。也即是,第二图像的拍摄时刻在时序上位于第一图像的拍摄时刻之后。例如,在一种情况下,多帧第二图像的拍摄时刻与多帧第一图像的拍摄时刻是连续的,即相机对拍摄区域进行拍摄得到的前N帧图像为第一图像,在这N帧图像之后拍摄得到的图像为第二图像。在另一种情况下,多帧第二图像的拍摄时刻也可以与多帧第一图像的拍摄时刻不连续。本申请中采用“第一图像”和“第二图像”来区分图像的拍摄时机,第一图像指的是在确定手势识别对象之前相机拍摄的图像,第二图像指的是在确定手势识别对象之后相机拍摄的图像。Here, obtaining the region to be identified corresponding to the target user in the second images of multiple frames can be understood as only obtaining the region to be identified corresponding to the target user in the second images of multiple frames, instead of obtaining the regions other than the target user in the second images of multiple frames. Areas to be identified corresponding to other users. The multiple frames of second images are obtained by shooting the shooting area by the camera after the target user is determined as the gesture recognition object. That is, the shooting moment of the second image is behind the shooting moment of the first image in time sequence. For example, in one case, the shooting moments of multiple frames of second images are continuous with the shooting moments of multiple frames of first images, that is, the first N frames of images obtained by shooting the shooting area by the camera are the first images. The image captured after the frame image is the second image. In another case, the shooting moments of the multiple frames of the second images may also be discontinuous with the shooting moments of the multiple frames of the first images. In this application, "first image" and "second image" are used to distinguish the shooting timing of the image. The first image refers to the image captured by the camera before the gesture recognition object is determined, and the second image refers to the image captured by the camera before the gesture recognition object is determined. Images captured by the camera afterwards.
可选地,在上述步骤204中将目标用户确定为手势识别对象之后,还可以保存目标用户的人脸信息,以通过目标用户的人脸信息关联上目标用户的手部动作,实现对目标用户的手部追踪,进而实现对目标用户的手势识别。其中,目标用户的人脸信息包括目标用户的人脸图像在相机拍摄的多帧第一图像中的位置以及运动趋势,或者,目标用户的人脸信息包括目标用户的人脸特征。步骤205的实现过程可以包括:根据保存的目标用户的人脸信息,确定目标用户在第二图像中的人脸图像位置。根据目标用户在第二图像中的人脸图像位置,确定第二图像中目标用户对应的待识别区域。Optionally, after the target user is determined as the gesture recognition object in the above step 204, the face information of the target user can also be saved, so as to associate the hand movements of the target user with the face information of the target user, so as to realize the recognition of the target user. Hand tracking, and then realize the gesture recognition of the target user. Wherein, the face information of the target user includes the position and movement trend of the face image of the target user in the multiple frames of the first images captured by the camera, or the face information of the target user includes the face features of the target user. The implementation process of step 205 may include: determining the face image position of the target user in the second image according to the saved face information of the target user. According to the face image position of the target user in the second image, the region to be recognized corresponding to the target user in the second image is determined.
可选地,可以对第二图像进行人脸检测,以得到第二图像中的人脸图像。当保存的目标用户的人脸信息为目标用户的人脸图像在相机拍摄的多帧第一图像中的位置以及运动趋势时,可以通过人脸跟踪算法确定第二图像中属于目标用户的人脸图像,例如基于MTCNN对第二图像进行人脸检测,并分别计算第二图像中的一个或多个人脸检测框与前一帧图像中目标用 户的人脸检测框的IoU值,以确定第二图像中目标用户的人脸检测框,进而得到目标用户在第二图像中的人脸图像位置。或者,当保存的目标用户的人脸信息为目标用户的人脸特征时,在得到第二图像中的一个或多个人脸图像之后,可以通过计算人脸相似度的方式确定第二图像中的哪个人脸图像属于目标用户。Optionally, face detection may be performed on the second image to obtain a face image in the second image. When the saved face information of the target user is the position and movement trend of the face image of the target user in the multi-frame first images captured by the camera, the face belonging to the target user in the second image can be determined by the face tracking algorithm For example, face detection is performed on the second image based on MTCNN, and the IoU values of one or more face detection frames in the second image and the face detection frame of the target user in the previous frame image are respectively calculated to determine the second The face detection frame of the target user in the image, and then obtain the face image position of the target user in the second image. Or, when the saved face information of the target user is the face feature of the target user, after obtaining one or more face images in the second image, the facial similarity in the second image can be determined by calculating the face similarity. Which face image belongs to the target user.
可选地,根据目标用户在第二图像中的人脸图像位置,确定第二图像中目标用户对应的待识别区域的实现方式可参考上述步骤202中根据潜在用户在第一图像中的人脸图像位置,确定该第一图像中该潜在用户对应的待识别区域的实现方式,本申请实施例在此不再赘述。Optionally, according to the face image position of the target user in the second image, the implementation manner of determining the area to be recognized corresponding to the target user in the second image can refer to the above-mentioned step 202 based on the face of the potential user in the first image The image position is an implementation manner of determining the area to be identified corresponding to the potential user in the first image, which will not be repeated in this embodiment of the present application.
步骤206、根据多帧第二图像中目标用户对应的待识别区域,对目标用户进行手势识别。Step 206: Perform gesture recognition on the target user according to the to-be-recognized areas corresponding to the target user in multiple frames of second images.
在目标用户作为手势识别对象的过程中,可以持续对目标用户进行手势识别。对目标用户进行手势识别,可以是,判断目标用户的手部动作是否与预设的待识别手势匹配。During the process of the target user being the object of gesture recognition, gesture recognition may be continuously performed on the target user. Performing gesture recognition on the target user may be to determine whether the hand movement of the target user matches a preset gesture to be recognized.
可选地,步骤206的实现方式包括:将多帧第二图像中目标用户对应的待识别区域作为图像序列输入手势识别模型,以得到手势识别模型输出的手势识别结果。其中,手势识别结果可以指示某个预设的待识别手势,表示在该多帧第二图像的拍摄时段内,目标用户执行了该待识别手势;或者,手势识别结果可以指示不存在匹配的待识别手势,表示在该多帧第二图像的拍摄时段内,目标用户没有执行任何预设的待识别手势;又或者,手势识别结果可以包括预设的各个待识别手势的置信度,显示设备或与显示设备连接的后处理端可以将置信度最高且高于某个阈值的待识别手势作为目标用户执行的待识别手势,如果不存在置信度高于该阈值的待识别手势,则表示在该多帧第二图像的拍摄时段内,目标用户没有执行任何预设的待识别手势。Optionally, the implementation of step 206 includes: inputting the to-be-recognized area corresponding to the target user in multiple frames of second images as an image sequence into the gesture recognition model, so as to obtain a gesture recognition result output by the gesture recognition model. Wherein, the gesture recognition result may indicate a preset gesture to be recognized, indicating that the target user has performed the gesture to be recognized during the shooting period of the multiple frames of second images; or, the gesture recognition result may indicate that there is no matching gesture to be recognized. Gesture recognition means that the target user did not perform any preset gestures to be recognized during the shooting period of the multiple frames of second images; or, the gesture recognition result may include the confidence of each preset gesture to be recognized, and the display device or The post-processing terminal connected to the display device can use the gesture to be recognized with the highest confidence and higher than a certain threshold as the gesture to be recognized by the target user. If there is no gesture to be recognized with the confidence higher than the threshold, it means During the shooting period of the multiple frames of the second images, the target user does not perform any preset gestures to be recognized.
进一步地,在对目标用户进行手势识别之后,如果确定在多帧第二图像的拍摄时段内,目标用户执行了某个待识别手势,则响应该待识别手势对应的操控指令,以实现隔空手势操作,并且继续对目标用户进行手势识别,直至满足结束将目标用户作为手势识别对象的条件。如果确定在多帧第二图像的拍摄时段内,目标用户没有执行任何待识别手势,那么可以继续对目标用户进行手势识别,直至满足结束将目标用户作为手势识别对象的条件。Further, after performing gesture recognition on the target user, if it is determined that the target user has performed a certain gesture to be recognized within the shooting period of multiple frames of the second image, then respond to the manipulation instruction corresponding to the gesture to be recognized to realize air separation. Gesture operation, and continue to perform gesture recognition on the target user until the conditions for ending the target user as a gesture recognition object are met. If it is determined that the target user does not perform any gesture to be recognized during the shooting period of the multiple frames of the second image, then the target user may continue to perform gesture recognition until a condition for ending the target user as a gesture recognition object is met.
可选地,如果上述用来判定手势识别对象的预设手势包括待识别手势的起始部分,那么可以根据多帧第一图像中目标用户对应的待识别区域以及多帧第二图像中目标用户对应的待识别区域,判断目标用户是否执行了包含该预设手势的待识别手势。也即是,用户在相机的拍摄区域内执行了某种待识别手势后,显示设备或与显示设备连接的后处理端可以基于该待识别手势将该用户确定为手势识别对象,然后响应该待识别手势对应的操控指令。Optionally, if the above-mentioned preset gesture used to determine the gesture recognition object includes the initial part of the gesture to be recognized, then the region to be recognized corresponding to the target user in the multiple frames of the first image and the target user in the multiple frames of the second image can be Corresponding to the area to be recognized, it is judged whether the target user has performed the gesture to be recognized including the preset gesture. That is, after the user performs a gesture to be recognized in the shooting area of the camera, the display device or a post-processing terminal connected to the display device can determine the user as a gesture recognition target based on the gesture to be recognized, and then respond to the gesture to be recognized. Recognize the manipulation instruction corresponding to the gesture.
本申请实施例中,在目标用户作为手势识别对象的过程中,显示设备或与显示设备连接的后处理端在一段时间内只对目标用户进行手势识别,而不会对除目标用户以外的其它用户进行手势识别,即在一段时间内锁定识别一个用户的手势,可以避免用户之间手势互相干扰导致无法实现准确进行手势控制的问题。In the embodiment of the present application, when the target user is used as a gesture recognition object, the display device or the post-processing terminal connected to the display device only performs gesture recognition on the target user for a period of time, and does not perform gesture recognition on other users except the target user. The user performs gesture recognition, that is, locks and recognizes a user's gesture for a period of time, which can avoid the problem that the gestures of users interfere with each other, resulting in the inability to achieve accurate gesture control.
本申请实施例提供的手势识别对象确定方法的步骤的先后顺序能够进行适当调整,步骤也能够根据情况进行相应增减。任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化的方法,都应涵盖在本申请的保护范围之内,因此不再赘述。The order of the steps in the method for determining the gesture recognition object provided in the embodiment of the present application can be adjusted appropriately, and the steps can also be increased or decreased according to the situation. Any person skilled in the art within the technical scope disclosed in this application can easily think of changes, which should be covered within the scope of protection of this application, and thus will not be repeated here.
综上所述,在本申请实施例提供的手势识别对象确定方法中,将在相机拍摄的多帧图像的每帧图像中都存在人脸图像、且手部动作与预设手势匹配的用户确定为相机的拍摄区域内的手势识别对象。基于相机拍摄的图像即可实现自动确定手势识别对象,进一步可以对手势 识别对象进行手势识别以实现隔空手势操作,适用于多种场景尤其是多用户场景下的手势识别,实现方式简单。另外,在对手势识别对象进行手势识别的过程中,不会识别除手势识别对象以外其它用户的手势,可以避免用户之间手势互相干扰导致无法实现准确进行手势控制的问题。可选地,还可以排除拍摄区域内侧对相机的用户,只在面对相机的用户中确定手势识别对象,这样可以降低对手势识别对象的误判概率。可选地,将待识别手势的起始部分作为用来判定手势识别对象的预设手势,当用户需要进行隔空手势操作时,直接在相机的拍摄区域内执行待识别手势即可,而无需执行其它特定的唤醒手势来开启设备的手势识别功能,在用户无感知的情况下实现手势识别对象的确定,可以简化用户操作,提升用户体验。To sum up, in the gesture recognition object determination method provided by the embodiment of the present application, the user who has a face image in each frame of the multi-frame images captured by the camera and whose hand movements match the preset gestures is determined Objects are recognized for gestures within the camera's field of view. Based on the image captured by the camera, the gesture recognition object can be automatically determined, and further gesture recognition can be performed on the gesture recognition object to realize air gesture operation. It is suitable for gesture recognition in various scenarios, especially multi-user scenarios, and the implementation method is simple. In addition, in the process of performing gesture recognition on the gesture recognition object, gestures of other users other than the gesture recognition object will not be recognized, which can avoid the problem that the gestures of users interfere with each other and cause accurate gesture control cannot be realized. Optionally, users facing the camera inside the shooting area may also be excluded, and only gesture recognition objects are determined among users facing the camera, which can reduce the probability of misjudgment of gesture recognition objects. Optionally, the initial part of the gesture to be recognized is used as a preset gesture used to determine the gesture recognition object. When the user needs to perform an air gesture operation, the gesture to be recognized can be performed directly in the shooting area of the camera without Perform other specific wake-up gestures to enable the gesture recognition function of the device, and realize the determination of the gesture recognition object without the user's perception, which can simplify user operations and improve user experience.
图6是本申请实施例提供的一种手势识别对象确定装置的结构示意图。如图6所示,装置600包括:Fig. 6 is a schematic structural diagram of an apparatus for determining an object for gesture recognition provided by an embodiment of the present application. As shown in Figure 6, the device 600 includes:
第一确定模块601,用于根据相机对拍摄区域进行拍摄得到的多帧第一图像,确定拍摄区域内的一个或多个潜在用户,潜在用户满足:多帧第一图像中的每帧第一图像都包括潜在用户的人脸图像。The first determining module 601 is configured to determine one or more potential users in the shooting area according to multiple frames of first images obtained by shooting the shooting area with the camera, and the potential users satisfy: each frame of the multiple frames of first images is first The images all include images of potential users' faces.
第二确定模块602,用于根据多帧第一图像中潜在用户对应的待识别区域,确定潜在用户的手部动作,潜在用户对应的待识别区域包括潜在用户的手部图像。The second determining module 602 is configured to determine the hand movement of the potential user according to the to-be-recognized area corresponding to the potential user in the multiple frames of the first image, and the to-be-identified area corresponding to the potential user includes the hand image of the potential user.
第三确定模块603,用于将一个或多个潜在用户中的目标用户确定为手势识别对象,目标用户的手部动作与预设手势匹配。The third determination module 603 is configured to determine a target user among one or more potential users as a gesture recognition object, and the hand motion of the target user matches a preset gesture.
可选地,如图7所示,装置600还包括:第一获取模块604,用于在将一个或多个潜在用户中的目标用户确定为手势识别对象之后,获取多帧第二图像中目标用户对应的待识别区域,目标用户对应的待识别区域包括目标用户的手部图像,多帧第二图像是在将目标用户确定为手势识别对象之后,由相机对拍摄区域进行拍摄得到的。手势识别模块605,用于根据多帧第二图像中目标用户对应的待识别区域,对目标用户进行手势识别。Optionally, as shown in FIG. 7 , the device 600 further includes: a first acquiring module 604, configured to acquire target users in multiple frames of second images after determining target users among one or more potential users as gesture recognition objects The area to be recognized corresponding to the user, the area to be recognized corresponding to the target user includes the hand image of the target user, and the multi-frame second images are obtained by shooting the shooting area by the camera after the target user is determined as the gesture recognition object. The gesture recognition module 605 is configured to perform gesture recognition on the target user according to the to-be-recognized areas corresponding to the target user in multiple frames of second images.
可选地,预设手势包括待识别手势的起始部分。手势识别模块605,用于:根据多帧第一图像中目标用户对应的待识别区域以及多帧第二图像中目标用户对应的待识别区域,判断目标用户是否执行了待识别手势。Optionally, the preset gesture includes an initial part of the gesture to be recognized. The gesture recognition module 605 is configured to: judge whether the target user performs a gesture to be recognized according to the regions to be recognized corresponding to the target user in multiple frames of first images and the regions to be recognized corresponding to the target user in multiple frames of second images.
可选地,第一获取模块604,用于:根据保存的目标用户的人脸信息,确定目标用户在第二图像中的人脸图像位置。根据目标用户在第二图像中的人脸图像位置,确定第二图像中目标用户对应的待识别区域。Optionally, the first obtaining module 604 is configured to: determine the face image position of the target user in the second image according to the saved face information of the target user. According to the face image position of the target user in the second image, the region to be recognized corresponding to the target user in the second image is determined.
可选地,如图8所示,装置600还包括:第四确定模块606,用于当相机在目标用户作为手势识别对象之后,对拍摄区域进行拍摄得到的图像中不包括目标用户的人脸图像的图像数量超出数量阈值,或者,目标用户作为手势识别对象的持续时长达到时长阈值时,结束将目标用户作为手势识别对象。Optionally, as shown in FIG. 8 , the device 600 further includes: a fourth determination module 606, configured to exclude the face of the target user from the image obtained by shooting the shooting area after the camera recognizes the target user as an object When the image quantity of the image exceeds the quantity threshold, or when the duration of the target user as the gesture recognition object reaches the duration threshold, the target user is terminated as the gesture recognition object.
可选地,如图9所示,装置600还包括:第五确定模块607,用于在确定拍摄区域内的一个或多个潜在用户之后,确定潜在用户在第一图像中的人脸图像位置。第六确定模块608,用于根据潜在用户在第一图像中的人脸图像位置,确定第一图像中潜在用户对应的待识别区域。Optionally, as shown in FIG. 9, the device 600 further includes: a fifth determination module 607, configured to determine the position of the face image of the potential user in the first image after determining one or more potential users in the shooting area . The sixth determination module 608 is configured to determine a region to be recognized corresponding to the potential user in the first image according to the position of the face image of the potential user in the first image.
可选地,第三确定模块603,用于:当拍摄区域内存在手部动作与预设手势匹配的多个潜在用户时,将距离相机最近的潜在用户作为目标用户。Optionally, the third determining module 603 is configured to: when there are multiple potential users whose hand movements match the preset gestures in the shooting area, take the potential user closest to the camera as the target user.
可选地,如图10所示,装置600还包括:第二获取模块609,用于获取潜在用户到相机 的距离。输出模块610,用于当潜在用户到相机的距离超出距离阈值时,输出距离提示,距离提示用于提示潜在用户靠近相机。如果上述手势识别对象确定装置为显示设备,则输出模块610具体为显示模块。或者,如果上述手势识别对象确定装置为后处理端,则输出模块610具体为发送模块。Optionally, as shown in FIG. 10, the device 600 further includes: a second acquisition module 609, configured to acquire the distance from the potential user to the camera. The output module 610 is configured to output a distance prompt when the distance from the potential user to the camera exceeds a distance threshold, and the distance prompt is used to prompt the potential user to approach the camera. If the above-mentioned device for determining a gesture recognition object is a display device, the output module 610 is specifically a display module. Alternatively, if the above-mentioned apparatus for determining a gesture recognition object is a post-processing end, the output module 610 is specifically a sending module.
可选地,第二获取模块609,用于:根据相机的焦距、潜在用户在第一图像中的双眼间距以及预设用户双眼间距,确定潜在用户到相机的距离。Optionally, the second obtaining module 609 is configured to: determine the distance between the potential user and the camera according to the focal length of the camera, the distance between the eyes of the potential user in the first image, and the distance between the eyes of the preset user.
可选地,第二确定模块602,用于:对多帧第一图像中潜在用户对应的待识别区域分别进行关键点检测,得到潜在用户的多组手部关键点信息。根据潜在用户的多组手部关键点信息,确定潜在用户的手部动作。Optionally, the second determining module 602 is configured to: perform key point detection on areas to be identified corresponding to potential users in multiple frames of the first image to obtain multiple sets of hand key point information of the potential user. According to multiple sets of hand key point information of the potential user, the hand motion of the potential user is determined.
可选地,潜在用户对应的待识别区域还包括潜在用户的肘部图像。第二确定模块602,用于:对多帧第一图像中潜在用户对应的待识别区域分别进行关键点检测,得到潜在用户的多组手部及肘部关键点信息。根据潜在用户的多组手部及肘部关键点信息,确定潜在用户的手部动作。Optionally, the region to be identified corresponding to the potential user further includes an elbow image of the potential user. The second determination module 602 is configured to: perform key point detection on regions to be identified corresponding to potential users in multiple frames of the first image, and obtain multiple sets of key point information of hands and elbows of potential users. According to multiple sets of hand and elbow key point information of the potential user, the hand movement of the potential user is determined.
可选地,上述人脸图像为正脸图像。Optionally, the above human face image is a front face image.
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。Regarding the apparatus in the foregoing embodiments, the specific manner in which each module executes operations has been described in detail in the embodiments related to the method, and will not be described in detail here.
图11是本申请实施例提供的一种手势识别对象确定设备的框图。该手势识别对象确定设备可以是通用计算设备,例如在会议场景中,该通用计算设备可以是会议终端或会议后处理端。可选地,会议终端可以是大屏或电子白板等。会议后处理端可以是一台服务器、或者由多台服务器组成的服务器集群、或者云计算平台等。如图11所示,该手势识别对象确定设备1100包括:处理器1101和存储器1102。Fig. 11 is a block diagram of a gesture recognition object determination device provided by an embodiment of the present application. The gesture recognition object determination device may be a general computing device, for example, in a conference scenario, the general computing device may be a conference terminal or a post-conference processing terminal. Optionally, the conference terminal can be a large screen or an electronic whiteboard. The post-meeting processing end may be a server, or a server cluster composed of multiple servers, or a cloud computing platform. As shown in FIG. 11 , the gesture recognition object determining device 1100 includes: a processor 1101 and a memory 1102 .
存储器1102,用于存储计算机程序,所述计算机程序包括程序指令; memory 1102, configured to store computer programs, the computer programs including program instructions;
处理器1101,用于调用所述计算机程序,实现如上述方法实施例中图2示出的方法步骤。The processor 1101 is configured to call the computer program to implement the method steps shown in FIG. 2 in the above method embodiment.
可选地,该手势识别对象确定设备1100还包括通信总线1103和通信接口1104。Optionally, the gesture recognition object determining device 1100 further includes a communication bus 1103 and a communication interface 1104 .
其中,处理器1101包括一个或者一个以上处理核心,处理器1101通过运行计算机程序,执行各种功能应用以及数据处理。Wherein, the processor 1101 includes one or more processing cores, and the processor 1101 executes various functional applications and data processing by running computer programs.
存储器1102可用于存储计算机程序。可选地,存储器可存储操作系统和至少一个功能所需的应用程序单元。操作系统可以是实时操作系统(Real Time eXecutive,RTX)、LINUX、UNIX、WINDOWS或OS X之类的操作系统。The memory 1102 can be used to store computer programs. Optionally, the memory may store an operating system and application program units required for at least one function. The operating system can be an operating system such as a real-time operating system (Real Time eXecutive, RTX), LINUX, UNIX, WINDOWS or OS X.
通信接口1104可以为多个,通信接口1104用于与其它存储设备或网络设备进行通信。例如在本申请实施例中,当手势识别对象确定设备为会议后处理端时,会议后处理端的通信接口可以用于向会议终端发送手势识别对象确定结果等。网络设备可以是交换机或路由器等。There may be multiple communication interfaces 1104, and the communication interfaces 1104 are used to communicate with other storage devices or network devices. For example, in this embodiment of the present application, when the device for determining a gesture recognition object is a post-conference processing terminal, the communication interface of the post-conference processing terminal may be used to send a result of determining a gesture recognition object to the conference terminal. Network devices can be switches or routers, etc.
存储器1102与通信接口1104分别通过通信总线1103与处理器1101连接。The memory 1102 and the communication interface 1104 are respectively connected to the processor 1101 through the communication bus 1103 .
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有指令,当所述指令被处理器执行时,实现如上述方法实施例中图2示出的方法步骤。The embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores instructions, and when the instructions are executed by the processor, the method shown in Figure 2 in the above-mentioned method embodiment is implemented step.
本申请实施例还提供了一种计算机程序产品,包括计算机程序,所述计算机程序被处理 器执行时,实现如上述方法实施例中图2示出的方法步骤。The embodiment of the present application also provides a computer program product, including a computer program. When the computer program is executed by a processor, the method steps shown in Fig. 2 in the above method embodiment are implemented.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above embodiments can be completed by hardware, or can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium. The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk, and the like.
在本申请实施例中,术语“第一”、“第二”和“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性。In the embodiments of the present application, the terms "first", "second" and "third" are used for description purposes only, and cannot be understood as indicating or implying relative importance.
本申请中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。The term "and/or" in this application is only an association relationship describing associated objects, which means that there may be three relationships, for example, A and/or B, which may mean: A exists alone, A and B exist simultaneously, and A and B exist alone. There are three cases of B. In addition, the character "/" in this article generally indicates that the contextual objects are an "or" relationship.
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的构思和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。The above are only optional embodiments of the application, and are not intended to limit the application. Any modifications, equivalent replacements, improvements, etc. made within the concept and principles of the application shall be included in the protection of the application. within range.

Claims (27)

  1. 一种手势识别对象确定方法,其特征在于,所述方法包括:A gesture recognition object determination method, characterized in that the method comprises:
    根据相机对拍摄区域进行拍摄得到的多帧第一图像,确定所述拍摄区域内的一个或多个潜在用户,所述潜在用户满足:所述多帧第一图像中的每帧第一图像都包括所述潜在用户的人脸图像;One or more potential users in the shooting area are determined according to multiple frames of first images obtained by shooting the shooting area by the camera, and the potential users satisfy the following requirements: each frame of the first image in the multiple frames of first images is including a facial image of the potential user;
    根据所述多帧第一图像中所述潜在用户对应的待识别区域,确定所述潜在用户的手部动作,所述潜在用户对应的待识别区域包括所述潜在用户的手部图像;determining the hand movement of the potential user according to the region to be identified corresponding to the potential user in the multiple frames of the first image, where the region to be identified corresponding to the potential user includes a hand image of the potential user;
    将所述一个或多个潜在用户中的目标用户确定为手势识别对象,所述目标用户的手部动作与预设手势匹配。A target user among the one or more potential users is determined as a gesture recognition object, and a hand motion of the target user matches a preset gesture.
  2. 根据权利要求1所述的方法,其特征在于,在将所述一个或多个潜在用户中的目标用户确定为手势识别对象之后,所述方法还包括:The method according to claim 1, wherein after the target user among the one or more potential users is determined as a gesture recognition object, the method further comprises:
    获取多帧第二图像中所述目标用户对应的待识别区域,所述目标用户对应的待识别区域包括所述目标用户的手部图像,所述多帧第二图像是在将所述目标用户确定为手势识别对象之后,由所述相机对所述拍摄区域进行拍摄得到的;Acquiring an area to be identified corresponding to the target user in multiple frames of second images, the area to be identified corresponding to the target user includes an image of the hand of the target user, and the multiple frames of second images are obtained by combining the target user obtained by photographing the photographing area by the camera after it is determined as the gesture recognition object;
    根据所述多帧第二图像中所述目标用户对应的待识别区域,对所述目标用户进行手势识别。Perform gesture recognition on the target user according to the to-be-recognized area corresponding to the target user in the multiple frames of second images.
  3. 根据权利要求2所述的方法,其特征在于,所述预设手势包括待识别手势的起始部分,所述根据所述多帧第二图像中所述目标用户对应的待识别区域,对所述目标用户进行手势识别,包括:The method according to claim 2, wherein the preset gesture includes the initial part of the gesture to be recognized, and according to the region to be recognized corresponding to the target user in the multiple frames of second images, the Gesture recognition for the above target users, including:
    根据所述多帧第一图像中所述目标用户对应的待识别区域以及所述多帧第二图像中所述目标用户对应的待识别区域,判断所述目标用户是否执行了所述待识别手势。According to the to-be-recognized area corresponding to the target user in the multi-frame first image and the to-be-recognized area corresponding to the target user in the multi-frame second image, determine whether the target user has performed the to-be-recognized gesture .
  4. 根据权利要求2或3所述的方法,其特征在于,所述获取多帧第二图像中所述目标用户对应的待识别区域,包括:The method according to claim 2 or 3, wherein the acquiring the region to be identified corresponding to the target user in multiple frames of second images comprises:
    根据保存的所述目标用户的人脸信息,确定所述目标用户在所述第二图像中的人脸图像位置;determining the face image position of the target user in the second image according to the stored face information of the target user;
    根据所述目标用户在所述第二图像中的人脸图像位置,确定所述第二图像中所述目标用户对应的待识别区域。A region to be identified corresponding to the target user in the second image is determined according to a face image position of the target user in the second image.
  5. 根据权利要求1至4任一所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 4, wherein the method further comprises:
    当所述相机在所述目标用户作为手势识别对象之后,对所述拍摄区域进行拍摄得到的图像中不包括所述目标用户的人脸图像的图像数量超出数量阈值,或者,所述目标用户作为手势识别对象的持续时长达到时长阈值时,结束将所述目标用户作为所述手势识别对象。After the camera recognizes the target user as a gesture object, the number of images obtained by shooting the shooting area that do not include the face image of the target user exceeds a number threshold, or the target user acts as When the duration of the gesture recognition object reaches the duration threshold, the target user is terminated as the gesture recognition object.
  6. 根据权利要求1至5任一所述的方法,其特征在于,在确定所述拍摄区域内的一个或 多个潜在用户之后,所述方法还包括:The method according to any one of claims 1 to 5, wherein after determining one or more potential users in the shooting area, the method further comprises:
    确定所述潜在用户在所述第一图像中的人脸图像位置;determining the position of the face image of the potential user in the first image;
    根据所述潜在用户在所述第一图像中的人脸图像位置,确定所述第一图像中所述潜在用户对应的待识别区域。A region to be identified corresponding to the potential user in the first image is determined according to the face image position of the potential user in the first image.
  7. 根据权利要求1至6任一所述的方法,其特征在于,所述将所述一个或多个潜在用户中的目标用户确定为手势识别对象,包括:The method according to any one of claims 1 to 6, wherein the determining a target user among the one or more potential users as a gesture recognition object comprises:
    当所述拍摄区域内存在手部动作与所述预设手势匹配的多个所述潜在用户时,将距离所述相机最近的潜在用户作为所述目标用户。When there are multiple potential users whose hand movements match the preset gesture in the shooting area, the potential user closest to the camera is used as the target user.
  8. 根据权利要求1至7任一所述的方法,其特征在于,在确定所述拍摄区域内的一个或多个潜在用户之后,所述方法还包括:The method according to any one of claims 1 to 7, wherein after determining one or more potential users in the shooting area, the method further comprises:
    获取所述潜在用户到所述相机的距离;Obtain the distance from the potential user to the camera;
    当所述潜在用户到所述相机的距离超出距离阈值时,输出距离提示,所述距离提示用于提示所述潜在用户靠近所述相机。When the distance from the potential user to the camera exceeds a distance threshold, a distance prompt is output, and the distance prompt is used to prompt the potential user to approach the camera.
  9. 根据权利要求8所述的方法,其特征在于,所述获取所述潜在用户到所述相机的距离,包括:The method according to claim 8, wherein said acquiring the distance from said potential user to said camera comprises:
    根据所述相机的焦距、所述潜在用户在所述第一图像中的双眼间距以及预设用户双眼间距,确定所述潜在用户到所述相机的距离。Determine the distance from the potential user to the camera according to the focal length of the camera, the distance between the eyes of the potential user in the first image, and the preset distance between the eyes of the user.
  10. 根据权利要求1至9任一所述的方法,其特征在于,所述根据所述多帧第一图像中所述潜在用户对应的待识别区域,确定所述潜在用户的手部动作,包括:The method according to any one of claims 1 to 9, wherein the determining the hand movement of the potential user according to the region to be identified corresponding to the potential user in the multi-frame first image includes:
    对所述多帧第一图像中所述潜在用户对应的待识别区域分别进行关键点检测,得到所述潜在用户的多组手部关键点信息;Perform key point detection on the areas to be identified corresponding to the potential users in the multiple frames of first images, to obtain multiple sets of hand key point information of the potential users;
    根据所述潜在用户的多组手部关键点信息,确定所述潜在用户的手部动作。According to the multiple sets of hand key point information of the potential user, the hand motion of the potential user is determined.
  11. 根据权利要求1至9任一所述的方法,其特征在于,所述潜在用户对应的待识别区域还包括所述潜在用户的肘部图像,所述根据所述多帧第一图像中所述潜在用户对应的待识别区域,确定所述潜在用户的手部动作,包括:The method according to any one of claims 1 to 9, wherein the area to be identified corresponding to the potential user further includes an elbow image of the potential user, and the The area to be identified corresponding to the potential user, determining the hand movement of the potential user, including:
    对所述多帧第一图像中所述潜在用户对应的待识别区域分别进行关键点检测,得到所述潜在用户的多组手部及肘部关键点信息;Perform key point detection on the areas to be identified corresponding to the potential users in the multiple frames of the first image, and obtain multiple sets of hand and elbow key point information of the potential user;
    根据所述潜在用户的多组手部及肘部关键点信息,确定所述潜在用户的手部动作。According to the multiple sets of hand and elbow key point information of the potential user, the hand motion of the potential user is determined.
  12. 根据权利要求1至11任一所述的方法,其特征在于,所述人脸图像为正脸图像。The method according to any one of claims 1 to 11, wherein the face image is a front face image.
  13. 一种手势识别对象确定装置,其特征在于,所述装置包括:A gesture recognition object determination device, characterized in that the device comprises:
    第一确定模块,用于根据相机对拍摄区域进行拍摄得到的多帧第一图像,确定所述拍摄区域内的一个或多个潜在用户,所述潜在用户满足:所述多帧第一图像中的每帧第一图像都 包括所述潜在用户的人脸图像;The first determining module is configured to determine one or more potential users in the shooting area according to multiple frames of first images obtained by shooting the shooting area with the camera, and the potential users satisfy: in the multiple frames of the first images Each frame of the first image includes the face image of the potential user;
    第二确定模块,用于根据所述多帧第一图像中所述潜在用户对应的待识别区域,确定所述潜在用户的手部动作,所述潜在用户对应的待识别区域包括所述潜在用户的手部图像;The second determination module is configured to determine the hand movement of the potential user according to the area to be identified corresponding to the potential user in the multiple frames of the first image, the area to be identified corresponding to the potential user includes the potential user image of the hand;
    第三确定模块,用于将所述一个或多个潜在用户中的目标用户确定为手势识别对象,所述目标用户的手部动作与预设手势匹配。The third determining module is configured to determine a target user among the one or more potential users as a gesture recognition object, and the hand motion of the target user matches a preset gesture.
  14. 根据权利要求13所述的装置,其特征在于,所述装置还包括:The device according to claim 13, further comprising:
    第一获取模块,用于在将所述一个或多个潜在用户中的目标用户确定为手势识别对象之后,获取多帧第二图像中所述目标用户对应的待识别区域,所述目标用户对应的待识别区域包括所述目标用户的手部图像,所述多帧第二图像是在将所述目标用户确定为手势识别对象之后,由所述相机对所述拍摄区域进行拍摄得到的;The first acquiring module is configured to, after determining a target user among the one or more potential users as a gesture recognition object, acquire a region to be recognized corresponding to the target user in multiple frames of second images, and the target user corresponds to The region to be recognized includes the hand image of the target user, and the multiple frames of second images are obtained by shooting the shooting region by the camera after the target user is determined as a gesture recognition object;
    手势识别模块,用于根据所述多帧第二图像中所述目标用户对应的待识别区域,对所述目标用户进行手势识别。The gesture recognition module is configured to perform gesture recognition on the target user according to the to-be-recognized area corresponding to the target user in the multiple frames of second images.
  15. 根据权利要求14所述的装置,其特征在于,所述预设手势包括待识别手势的起始部分,所述手势识别模块,用于:The device according to claim 14, wherein the preset gesture includes an initial part of the gesture to be recognized, and the gesture recognition module is configured to:
    根据所述多帧第一图像中所述目标用户对应的待识别区域以及所述多帧第二图像中所述目标用户对应的待识别区域,判断所述目标用户是否执行了所述待识别手势。According to the to-be-recognized area corresponding to the target user in the multi-frame first image and the to-be-recognized area corresponding to the target user in the multi-frame second image, determine whether the target user has performed the to-be-recognized gesture .
  16. 根据权利要求14或15所述的装置,其特征在于,所述第一获取模块,用于:The device according to claim 14 or 15, wherein the first acquisition module is configured to:
    根据保存的所述目标用户的人脸信息,确定所述目标用户在所述第二图像中的人脸图像位置;determining the face image position of the target user in the second image according to the stored face information of the target user;
    根据所述目标用户在所述第二图像中的人脸图像位置,确定所述第二图像中所述目标用户对应的待识别区域。A region to be identified corresponding to the target user in the second image is determined according to a face image position of the target user in the second image.
  17. 根据权利要求13至16任一所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 13 to 16, wherein the device further comprises:
    第四确定模块,用于当所述相机在所述目标用户作为手势识别对象之后,对所述拍摄区域进行拍摄得到的图像中不包括所述目标用户的人脸图像的图像数量超出数量阈值,或者,所述目标用户作为手势识别对象的持续时长达到时长阈值时,结束将所述目标用户作为所述手势识别对象。The fourth determination module is configured to: when the camera captures the target user as a gesture recognition object, the number of images obtained by shooting the shooting area that does not include the face image of the target user exceeds a threshold, Alternatively, when the duration of the target user as the gesture recognition object reaches a duration threshold, the target user is terminated as the gesture recognition object.
  18. 根据权利要求13至17任一所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 13 to 17, wherein the device further comprises:
    第五确定模块,用于在确定所述拍摄区域内的一个或多个潜在用户之后,确定所述潜在用户在所述第一图像中的人脸图像位置;A fifth determination module, configured to determine the face image position of the potential user in the first image after determining one or more potential users in the shooting area;
    第六确定模块,用于根据所述潜在用户在所述第一图像中的人脸图像位置,确定所述第一图像中所述潜在用户对应的待识别区域。A sixth determining module, configured to determine a region to be identified corresponding to the potential user in the first image according to the face image position of the potential user in the first image.
  19. 根据权利要求13至18任一所述的装置,其特征在于,所述第三确定模块,用于:The device according to any one of claims 13 to 18, wherein the third determination module is configured to:
    当所述拍摄区域内存在手部动作与所述预设手势匹配的多个所述潜在用户时,将距离所 述相机最近的潜在用户作为所述目标用户。When there are multiple potential users whose hand movements match the preset gesture in the shooting area, the potential user closest to the camera is used as the target user.
  20. 根据权利要求13至19任一所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 13 to 19, wherein the device further comprises:
    第二获取模块,用于获取所述潜在用户到所述相机的距离;A second acquiring module, configured to acquire the distance from the potential user to the camera;
    输出模块,用于当所述潜在用户到所述相机的距离超出距离阈值时,输出距离提示,所述距离提示用于提示所述潜在用户靠近所述相机。An output module, configured to output a distance prompt when the distance from the potential user to the camera exceeds a distance threshold, and the distance prompt is used to prompt the potential user to approach the camera.
  21. 根据权利要求20所述的装置,其特征在于,所述第二获取模块,用于:The device according to claim 20, wherein the second acquisition module is configured to:
    根据所述相机的焦距、所述潜在用户在所述第一图像中的双眼间距以及预设用户双眼间距,确定所述潜在用户到所述相机的距离。Determine the distance from the potential user to the camera according to the focal length of the camera, the distance between the eyes of the potential user in the first image, and the preset distance between the eyes of the user.
  22. 根据权利要求13至21任一所述的装置,其特征在于,所述第二确定模块,用于:The device according to any one of claims 13 to 21, wherein the second determination module is configured to:
    对所述多帧第一图像中所述潜在用户对应的待识别区域分别进行关键点检测,得到所述潜在用户的多组手部关键点信息;Perform key point detection on the areas to be identified corresponding to the potential users in the multiple frames of first images, to obtain multiple sets of hand key point information of the potential users;
    根据所述潜在用户的多组手部关键点信息,确定所述潜在用户的手部动作。According to the multiple sets of hand key point information of the potential user, the hand motion of the potential user is determined.
  23. 根据权利要求13至21任一所述的装置,其特征在于,所述潜在用户对应的待识别区域还包括所述潜在用户的肘部图像,所述第二确定模块,用于:The device according to any one of claims 13 to 21, wherein the area to be identified corresponding to the potential user further includes an elbow image of the potential user, and the second determining module is configured to:
    对所述多帧第一图像中所述潜在用户对应的待识别区域分别进行关键点检测,得到所述潜在用户的多组手部及肘部关键点信息;Perform key point detection on the areas to be identified corresponding to the potential users in the multiple frames of the first image, and obtain multiple sets of hand and elbow key point information of the potential user;
    根据所述潜在用户的多组手部及肘部关键点信息,确定所述潜在用户的手部动作。According to the multiple sets of hand and elbow key point information of the potential user, the hand motion of the potential user is determined.
  24. 根据权利要求13至23任一所述的装置,其特征在于,所述人脸图像为正脸图像。The device according to any one of claims 13 to 23, wherein the face image is a front face image.
  25. 一种手势识别对象确定设备,其特征在于,包括:处理器和存储器;A gesture recognition object determination device, characterized in that it includes: a processor and a memory;
    所述存储器,用于存储计算机程序,所述计算机程序包括程序指令;The memory is used to store a computer program, and the computer program includes program instructions;
    所述处理器,用于调用所述计算机程序,实现如权利要求1至12任一所述的手势识别对象确定方法。The processor is configured to call the computer program to implement the gesture recognition object determination method according to any one of claims 1 to 12.
  26. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有指令,当所述指令被处理器执行时,实现如权利要求1至12任一所述的手势识别对象确定方法。A computer-readable storage medium, characterized in that instructions are stored on the computer-readable storage medium, and when the instructions are executed by a processor, the gesture recognition object determination according to any one of claims 1 to 12 is realized method.
  27. 一种计算机程序产品,其特征在于,包括计算机程序,所述计算机程序被处理器执行时,实现如权利要求1至12任一所述的手势识别对象确定方法。A computer program product, characterized by comprising a computer program, and when the computer program is executed by a processor, implements the gesture recognition object determination method according to any one of claims 1 to 12.
PCT/CN2022/078623 2021-06-30 2022-03-01 Gesture recognition object determination method and apparatus WO2023273372A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202110736357.6 2021-06-30
CN202110736357 2021-06-30
CN202111034365.2A CN115565241A (en) 2021-06-30 2021-09-03 Gesture recognition object determination method and device
CN202111034365.2 2021-09-03

Publications (1)

Publication Number Publication Date
WO2023273372A1 true WO2023273372A1 (en) 2023-01-05

Family

ID=84689882

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/078623 WO2023273372A1 (en) 2021-06-30 2022-03-01 Gesture recognition object determination method and apparatus

Country Status (1)

Country Link
WO (1) WO2023273372A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116301361A (en) * 2023-03-08 2023-06-23 百度在线网络技术(北京)有限公司 Target selection method and device based on intelligent glasses and electronic equipment
CN116301363A (en) * 2023-02-27 2023-06-23 荣耀终端有限公司 Space gesture recognition method, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100266206A1 (en) * 2007-11-13 2010-10-21 Olaworks, Inc. Method and computer-readable recording medium for adjusting pose at the time of taking photos of himself or herself
CN107239727A (en) * 2016-12-07 2017-10-10 北京深鉴智能科技有限公司 Gesture identification method and system
CN108960163A (en) * 2018-07-10 2018-12-07 亮风台(上海)信息科技有限公司 Gesture identification method, device, equipment and storage medium
CN109977906A (en) * 2019-04-04 2019-07-05 睿魔智能科技(深圳)有限公司 Gesture identification method and system, computer equipment and storage medium
CN110032966A (en) * 2019-04-10 2019-07-19 湖南华杰智通电子科技有限公司 Human body proximity test method, intelligent Service method and device for intelligent Service

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100266206A1 (en) * 2007-11-13 2010-10-21 Olaworks, Inc. Method and computer-readable recording medium for adjusting pose at the time of taking photos of himself or herself
CN107239727A (en) * 2016-12-07 2017-10-10 北京深鉴智能科技有限公司 Gesture identification method and system
CN108960163A (en) * 2018-07-10 2018-12-07 亮风台(上海)信息科技有限公司 Gesture identification method, device, equipment and storage medium
CN109977906A (en) * 2019-04-04 2019-07-05 睿魔智能科技(深圳)有限公司 Gesture identification method and system, computer equipment and storage medium
CN110032966A (en) * 2019-04-10 2019-07-19 湖南华杰智通电子科技有限公司 Human body proximity test method, intelligent Service method and device for intelligent Service

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116301363A (en) * 2023-02-27 2023-06-23 荣耀终端有限公司 Space gesture recognition method, electronic equipment and storage medium
CN116301363B (en) * 2023-02-27 2024-02-27 荣耀终端有限公司 Space gesture recognition method, electronic equipment and storage medium
CN116301361A (en) * 2023-03-08 2023-06-23 百度在线网络技术(北京)有限公司 Target selection method and device based on intelligent glasses and electronic equipment

Similar Documents

Publication Publication Date Title
TWI751161B (en) Terminal equipment, smart phone, authentication method and system based on face recognition
US10043308B2 (en) Image processing method and apparatus for three-dimensional reconstruction
CN104956292B (en) The interaction of multiple perception sensing inputs
CN103353935B (en) A kind of 3D dynamic gesture identification method for intelligent domestic system
WO2023273372A1 (en) Gesture recognition object determination method and apparatus
US11776322B2 (en) Pinch gesture detection and recognition method, device and system
CN112585566B (en) Hand-covering face input sensing for interacting with device having built-in camera
US20110273551A1 (en) Method to control media with face detection and hot spot motion
WO2019214442A1 (en) Device control method, apparatus, control device and storage medium
US20120019684A1 (en) Method for controlling and requesting information from displaying multimedia
CN106155315A (en) The adding method of augmented reality effect, device and mobile terminal in a kind of shooting
US20150370336A1 (en) Device Interaction with Spatially Aware Gestures
CN110688914A (en) Gesture recognition method, intelligent device, storage medium and electronic device
WO2022174594A1 (en) Multi-camera-based bare hand tracking and display method and system, and apparatus
WO2023173668A1 (en) Input recognition method in virtual scene, device and storage medium
CN114138121B (en) User gesture recognition method, device and system, storage medium and computing equipment
CN115565241A (en) Gesture recognition object determination method and device
CN112083801A (en) Gesture recognition system and method based on VR virtual office
US20150185851A1 (en) Device Interaction with Self-Referential Gestures
Perra et al. Adaptive eye-camera calibration for head-worn devices
CN111103981A (en) Control instruction generation method and device
WO2024055957A1 (en) Photographing parameter adjustment method and apparatus, electronic device and readable storage medium
WO2023169282A1 (en) Method and apparatus for determining interaction gesture, and electronic device
CN117813581A (en) Multi-angle hand tracking
CN112367468B (en) Image processing method and device and electronic equipment

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE