WO2023201512A1 - 手势识别方法、交互方法、手势交互系统、电子设备、存储介质 - Google Patents

手势识别方法、交互方法、手势交互系统、电子设备、存储介质 Download PDF

Info

Publication number
WO2023201512A1
WO2023201512A1 PCT/CN2022/087576 CN2022087576W WO2023201512A1 WO 2023201512 A1 WO2023201512 A1 WO 2023201512A1 CN 2022087576 W CN2022087576 W CN 2022087576W WO 2023201512 A1 WO2023201512 A1 WO 2023201512A1
Authority
WO
WIPO (PCT)
Prior art keywords
gesture
images
action object
depth
grayscale
Prior art date
Application number
PCT/CN2022/087576
Other languages
English (en)
French (fr)
Inventor
王美丽
吕耀宇
陈丽莉
董学
张�浩
王佳斌
李扬冰
王明东
王雷
Original Assignee
京东方科技集团股份有限公司
北京京东方技术开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司, 北京京东方技术开发有限公司 filed Critical 京东方科技集团股份有限公司
Priority to CN202280000791.0A priority Critical patent/CN117255982A/zh
Priority to PCT/CN2022/087576 priority patent/WO2023201512A1/zh
Publication of WO2023201512A1 publication Critical patent/WO2023201512A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer

Definitions

  • Embodiments of the present disclosure relate to a gesture recognition method, an interaction method, a gesture interaction system, an electronic device, and a non-transitory computer-readable storage medium.
  • 3D displayed objects will need to exit and enter the screen.
  • people watch 3D displays in order to interact with the 3D display content immersively, it is necessary to provide an interactive system or interaction method that can obtain depth information and has high precision, low latency, and a large field of view to realize the user and 3D display. Show interactions between content.
  • gesture recognition technology is a major research hotspot, and it is applied in many fields such as naked-eye 3D displays, VR/AR/MR, vehicles, games and entertainment, smart wearables, and industrial design.
  • the core of realizing interactive functions based on gesture recognition technology is to collect the user's gesture information through sensing devices, such as cameras, recognize gestures through related recognition and classification algorithms, and assign different semantic information to different gestures to achieve different interactive functions.
  • At least one embodiment of the present disclosure provides a gesture recognition method, including: acquiring multiple sets of images taken at different shooting moments for a gesture action object, wherein each set of images includes at least a pair of corresponding depth maps and grayscales. Figure; According to the multiple sets of images, the depth map in each set of images is used to obtain spatial information, and the grayscale image in each set of images is used to obtain the posture information of the gesture action object to identify the gesture action object. dynamic gesture changes.
  • using a depth map in each set of images to obtain spatial information includes: determining a gesture area in the depth map according to the depth map, wherein: The spatial information includes the gesture area in the depth map; using the grayscale image in each group of images to obtain the posture information of the gesture action object includes: according to the gesture area in the depth map and the grayscale image, Determining the gesture information corresponding to each group of images for the gesture action object; identifying the dynamic gesture changes of the gesture action object, including: according to the gestures for the gesture action object respectively corresponding to the multiple sets of images information to determine the dynamic gesture changes of the gesture action object.
  • determining the gesture area in the depth map according to the depth map includes: traversing the depth map and counting the depth data in the depth map to establish Depth histogram: select the adaptive depth threshold corresponding to the depth map, and determine the gesture area in the depth map according to the adaptive depth threshold and the depth histogram.
  • the posture information for the gesture action object corresponding to each group of images includes finger status information and position information.
  • determining the gesture information corresponding to the gesture action object corresponding to each group of images including: applying the gesture area in the depth map to the grayscale image to obtain the grayscale The gesture analysis area in the figure; perform binarization processing on the gesture analysis area to obtain the gesture connected domain; perform convex hull detection on the gesture connected domain to obtain the finger status information, wherein the finger status information includes Whether there are fingers stretched out, and the number of fingers stretched out; based on the depth map, the position information is determined, where the position information includes the coordinate position of the gesture action object in the gesture interaction space.
  • determining the dynamic gesture change of the gesture action object based on the posture information corresponding to the plurality of sets of images for the gesture action object includes: according to The finger state information and position information corresponding to the plurality of sets of images respectively determine the finger extension state changes and position changes of the gesture action object within the recognition period composed of the different shooting moments; according to the finger extension state The change and the position change determine the dynamic gesture change of the gesture action object.
  • the coordinate position includes depth coordinates
  • the dynamic gesture changes of the gesture action object include gesture actions
  • determining the dynamic gesture change of the gesture action object including: responding to the finger extension state change indicating that at least one finger in the gesture action object is in the extension state for at least part of the recognition period, so The position change indicates that the depth coordinate of the target recognition point in the gesture action object first decreases and then increases during at least part of the period, and the gesture action is determined to be a click gesture.
  • the coordinate position includes depth coordinates
  • the dynamic gesture changes of the gesture action object include gesture actions
  • the finger extension state change and the position change determining the dynamic gesture change of the gesture action object, including: in response to the finger extension state change, indicating that at least one finger in the gesture action object is in the extension state for at least part of the recognition period, and The position change indicates that the depth coordinate of the target identification point in the gesture action object first decreases and then remains during at least part of the period, and the duration of the maintenance action exceeds the first threshold, and the gesture action is determined to be long. Press gesture.
  • the dynamic gesture changes of the gesture action object include gesture actions, and the gesture action object is determined based on the finger extension state change and the position change.
  • Dynamic gesture changes include: responding to the finger extension state change indicating that at least one finger in the gesture action object is in the extension state for at least part of the recognition period, and the position change indicating the gesture If the distance of the target recognition point in the action object sliding in the preset direction exceeds the second threshold within at least part of the period, it is determined that the gesture action is a sliding gesture, wherein the sliding distance is based on the sliding distance in the plurality of sets of images. The position information of the target identification point of the gesture action object is calculated.
  • the dynamic gesture changes of the gesture action object include gesture actions, and the gesture action object is determined based on the finger extension state change and the position change.
  • Dynamic gesture changes include: determining the gesture action in response to the finger extension state change indicating that the gesture action object transitions from at least one finger in the extension state to no finger in the extension state within the recognition period. for grabbing gestures.
  • the dynamic gesture changes of the gesture action object include gesture actions, and the gesture action object is determined based on the finger extension state change and the position change.
  • Dynamic gesture changes include: determining the gesture action in response to the finger extension state change indicating that the gesture action object transitions from no fingers in the extension state to at least one finger in the extension state within the recognition period. for release gesture.
  • determining the dynamic gesture change of the gesture action object based on the finger extension state change and the position change further includes: determining the gesture action Before, determine whether the gesture action object has a static time exceeding a third threshold before the action change occurs, and in response to the existence of a static time exceeding the third threshold, continue to determine the gesture action; in response to the absence of a static time exceeding the third threshold. Rest time to ensure that no change in gesture action occurs.
  • the dynamic gesture change of the gesture action object also includes the gesture position, and the gesture action object is determined based on the finger extension state change and the position change.
  • the dynamic gesture change further includes: in response to the gesture action being a click gesture, a long press gesture or a sliding gesture, determining the gesture position based on the position information of the target recognition point in the gesture action object, wherein the target The identification point includes the fingertip point of the target finger; in response to the gesture action being a grab gesture or a release gesture, determining the gesture position is obtained based on the position information of the gesture center of the gesture action object, wherein the gesture center is the The center of the largest inscribed circle of the gesture connected domain.
  • determining the gesture position is based on position information of a target recognition point in the gesture action object, including: obtaining a preset position around the target recognition point a plurality of position information respectively corresponding to a plurality of sampling points at; obtain the gesture position according to the plurality of position information and the position information of the target identification point; and determine the gesture position based on the gesture action object
  • Obtaining the position information of the gesture center includes: obtaining multiple position information corresponding to multiple sampling points at preset positions around the gesture center; and obtaining the position information based on the multiple position information and the position information of the gesture center. The gesture position.
  • obtaining multiple sets of images of the gesture action object taken at different shooting moments includes: using at least one shooting device to continuously shoot the gesture action object, and obtain Multiple sets of images respectively corresponding to the different shooting moments, wherein each shooting device is configured to synchronously output a pair of corresponding depth images and grayscale images at one shooting moment.
  • the gesture action object is closest to the at least one shooting device relative to other objects in each image.
  • each group of images in response to the number of the at least one shooting device being multiple, includes multiple pairs of corresponding depth images and grayscale images, and the multiple pairs of The depth map and the grayscale image are obtained by synchronously photographing the gesture action object by the plurality of shooting devices at the same shooting time, and the plurality of pairs of depth maps and grayscale images have different shooting angles.
  • the depth map in each set of images is used to obtain spatial information
  • the grayscale image in each set of images is used to obtain
  • the posture information of the gesture action object to identify the dynamic gesture changes of the gesture action object also includes: multiple pairs of depth maps obtained based on the same shooting device and belonging to the multiple groups of images respectively and corresponding to the different shooting moments. and grayscale images to determine the intermediate gesture changes of the gesture action object corresponding to the same shooting device; perform weighting and filtering processing on the multiple intermediate gesture changes corresponding to the multiple shooting devices to obtain the gesture action object dynamic gesture changes.
  • using at least one shooting device to continuously shoot the gesture action object to obtain multiple sets of images corresponding to the different shooting moments includes: using each shooting device Continuously photograph the gesture action object to obtain multiple pairs of depth images and grayscale images output by the photographing device corresponding to the different photographing moments.
  • each shooting device includes a first acquisition unit configured to acquire a grayscale image in every first frame, and every Nth A depth map is acquired in one frame, wherein the depth map is generated based on N grayscale images acquired in every N consecutive first frames, and the N grayscale images respectively correspond to N different phases, and the one The depth map and one of the N grayscale images are output to the shooting device synchronously, where N is a positive integer and greater than 1; each shooting device is used to continuously shoot the gesture action object to obtain the shooting The device outputs multiple pairs of depth images and grayscale images respectively corresponding to the different shooting moments, including: using the shooting device to output a pair of corresponding depth images and grayscale images in each first frame, wherein the output The depth map is predicted by smooth trajectory fitting based on the N grayscale images and the one depth map.
  • each shooting device includes a first acquisition unit configured to acquire a grayscale image in every first frame, and every Nth A depth map is acquired in one frame, wherein the depth map is generated based on N grayscale images acquired in every N consecutive first frames, and the N grayscale images respectively correspond to N different phases, and the one The depth map and one of the N grayscale images are output to the shooting device synchronously, where N is a positive integer and greater than 1; each shooting device is used to continuously shoot the gesture action object to obtain the shooting The device outputs multiple pairs of depth images and grayscale images respectively corresponding to the different shooting moments, including: using the shooting device to output a pair of corresponding depth images and grayscale images in at most every N-1 first frames, so The output depth map is calculated from the grayscale images of the N-1 first frames adjacent to the output grayscale image, and the output grayscale image and the adjacent N-1 first frames The grayscale image of the frame corresponds to the N different phases.
  • each shooting device includes a first acquisition unit and a second acquisition unit, and the second acquisition unit is configured to output a grayscale image in each second frame.
  • the first acquisition unit is configured to output a depth map every M second frames, M is a positive integer and greater than 1, use each shooting device to continuously shoot the gesture action object, and obtain the corresponding corresponding images output by the shooting device.
  • the multiple pairs of depth images and grayscale images at different shooting moments include: using the shooting device to output a pair of corresponding depth images and grayscale images in at most every M-1 second frames, wherein the output
  • the depth map includes a reference depth map, or a depth map predicted by smooth trajectory fitting based on the reference depth map and at least one grayscale image corresponding to the reference depth map, wherein the reference depth map includes the current depth map.
  • Two frames or the depth map output by the first acquisition unit before the current second frame, the current second frame is the second frame that outputs the pair of corresponding depth maps and grayscale images
  • the At least one grayscale image includes a grayscale image output by the second acquisition unit between the second frame corresponding to the reference depth map and the current second frame.
  • the first acquisition unit is further configured to obtain a pair of corresponding depth images and grayscale images in each first frame, and the obtained depth image It is calculated from the grayscale images of the N-1 first frames adjacent to the obtained grayscale image.
  • the obtained grayscale image and the grayscale images of the adjacent N-1 first frames Corresponding to N different phases, the frame length of the first frame is greater than the frame length of the second frame, and N is a positive integer and greater than 1.
  • At least one embodiment of the present disclosure provides an interaction method, including: displaying controls; using the gesture recognition method described in any embodiment of the present disclosure to identify dynamic gesture changes when the user performs a target action; and based on the identified dynamic gesture changes and The target action triggers the control.
  • the dynamic gesture changes include gesture actions, and triggering the control according to the recognized dynamic gesture changes and the target action includes: responding to the user The gesture action is consistent with the target action, triggering the control and displaying a visual feedback effect.
  • the dynamic gesture changes include gesture actions and gesture positions; triggering the control according to the recognized dynamic gesture changes and the target action includes: responding to The user's gesture action is consistent with the target action, and the user's gesture position matches the control position of the control, triggering the control and displaying a visual feedback effect, wherein the user's gesture position matches the control position.
  • the control position matching means that the gesture position is mapped to the position in the coordinate system where the control is located according to the mapping relationship and is consistent with the control position.
  • At least one embodiment of the present disclosure provides a gesture interaction system, including: at least one shooting device configured to continuously shoot the gesture action object to obtain images of the gesture action object at different shooting times. Multiple sets of captured images; a gesture recognition unit configured to receive the multiple sets of images, execute the gesture recognition method described in any embodiment of the present disclosure, and output the recognition results of dynamic gesture changes of the gesture action object; display A unit configured to receive the recognition result and display an interactive effect according to the recognition result.
  • the gesture interaction system includes multiple shooting devices, and the multiple shooting devices are configured to synchronously shoot the gesture action object from different angles to capture the gesture action object in the same shooting. Corresponding pairs of depth images and grayscale images are obtained at all times.
  • each shooting device includes a first acquisition unit and a second acquisition unit, and the first acquisition unit and the second acquisition unit are configured to synchronously capture the Gesture action object.
  • the plurality of shooting devices are configured to select some or all of the shooting devices to shoot the gesture action object according to the position of the gesture action object in the gesture interaction space.
  • the gesture recognition unit includes a digital signal processor.
  • At least one embodiment of the present disclosure provides an electronic device, including: a memory non-transiently storing computer-executable instructions; a processor configured to run the computer-executable instructions, wherein the computer-executable instructions are When the processor is running, the gesture recognition method according to any embodiment of the present disclosure or the interaction method according to any embodiment of the present disclosure is implemented.
  • At least one embodiment of the present disclosure provides a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer-executable instructions.
  • the computer-readable storage medium stores computer-executable instructions.
  • Figure 1 shows a schematic flow chart of gesture recognition
  • Figure 2 shows a schematic diagram of the detection process of a TOF camera
  • Figure 3 is a schematic flow chart of a gesture recognition method provided by at least one embodiment of the present disclosure
  • Figure 4 is a schematic diagram of a gesture interaction space provided by at least one embodiment of the present disclosure.
  • Figure 5A is a schematic diagram of convex hull detection provided by at least one embodiment of the present disclosure.
  • Figure 5B is a schematic diagram of the gesture information extraction process provided by an embodiment of the present disclosure.
  • Figure 6 shows the relationship between the depth image and the grayscale image of the TOF camera
  • Figure 7A is a schematic diagram of the corresponding relationship between the depth image and the grayscale image provided by an embodiment of the present disclosure
  • Figure 7B is a schematic diagram of the corresponding relationship between the depth image and the grayscale image provided by another embodiment of the present disclosure.
  • Figure 7C is a schematic diagram of the corresponding relationship between the depth image and the grayscale image provided by an embodiment of the present disclosure.
  • Figure 8 is a schematic flow chart of an interactive method provided by at least one embodiment of the present disclosure.
  • Figure 9 is a schematic block diagram of a gesture interaction system provided by at least one embodiment of the present disclosure.
  • Figure 10 is a schematic block diagram of a gesture recognition unit provided by at least one embodiment of the present disclosure.
  • Figure 11 is a schematic diagram of an electronic device provided by at least one embodiment of the present disclosure.
  • Figure 12 is a schematic diagram of a non-transitory computer-readable storage medium provided by at least one embodiment of the present disclosure
  • Figure 13 is a schematic diagram of a hardware environment provided by at least one embodiment of the present disclosure.
  • Gesture interaction refers to the use of computer graphics and other technologies to recognize the body language of gesture objects and convert it into commands to operate the device. It is a new way of human-computer interaction after mouse, keyboard and touch screen. Among the many interaction methods, gesture interaction is in line with human communication habits and is used most frequently in daily life. It has irreplaceable natural advantages, such as:
  • gesture operations can be performed anytime and anywhere.
  • gesture interaction is a very popular research field and can be applied to a variety of application scenarios.
  • the sensing devices for collecting gestures use monocular cameras, binocular cameras, structured light cameras, TOF (Time of flight) cameras, etc.
  • Gesture images collected by monocular cameras do not contain depth information. Deep learning methods are often used to extract gesture abstract features and complete gesture classification tasks. This method requires very high system computing power and relatively high image resolution requirements ( For example, the resolution needs to be 1920 pixels * 1080 pixels), and the processing speed is slow.
  • Binocular cameras can calculate depth through disparity information, and usually use deep learning methods to position the three-dimensional joint points of gestures.
  • the parallax registration of binocular data itself requires a huge amount of system computing power, and this method also has very high requirements on system computing power.
  • the structured light camera uses an infrared light source to project a pattern into the space. Since the infrared camera is sensitive to light, it can calculate the depth information of the scene through the deformation of the pattern. This method requires a very large amount of data to be processed, and usually requires a dedicated processing chip to calculate depth information, which is costly.
  • TOF cameras use time-of-flight technology to calculate the depth information of the scene through time difference or phase difference information, which requires little computing power.
  • time difference or phase difference information which requires little computing power.
  • it is also necessary to use deep learning methods to extract gesture joint points in the image.
  • the requirements for computing power are also high, and the processing speed is usually above 20ms.
  • Figure 1 shows a schematic flow chart of gesture recognition.
  • Gesture features include global features and local features.
  • Gesture feature forms include color histograms, skin color information, edge information, regional information, etc., thereby realizing gestures. detection and segmentation.
  • image preprocessing takes 0.5ms
  • global feature extraction takes 18ms
  • local feature extraction takes 5ms
  • gesture detection and segmentation takes 4.5ms.
  • abstract features are extracted, such as LBP (Local Binary Patterns, local binary patterns), Hu moments (image moments), SURF (Speeded-Up Robust Features, accelerated robust features), etc., to complete the classification of gestures.
  • LBP Local Binary Patterns, local binary patterns
  • Hu moments image moments
  • SURF Speeded-Up Robust Features, accelerated robust features
  • the joint point parameters can also be extracted through the established gesture joint point model to complete the recognition of the 3D gesture.
  • abstract feature extraction takes 5.3ms
  • gesture classification takes 4.5ms
  • joint point parameter extraction takes 6.5ms
  • 3D gesture recognition takes 3.3ms
  • loading the gesture joint point model takes 2000ms.
  • semantic information processing takes 0.2ms.
  • the current gesture recognition method requires high system computing power and a large delay.
  • the recognition delay of this method is usually 20ms or more, making it difficult to achieve real-time interaction.
  • image collection in existing gesture interaction systems is usually implemented using binocular cameras, structured light cameras, TOF cameras, etc.
  • the raw signal data (Raw Data) collected by the photosensitive element (sensor) needs to be preprocessed in the computer using the Image Signal Processor (ISP) algorithm.
  • the preprocessing includes black level compensation, Lens correction, bad pixel correction, noise removal, phase correction, depth calculation, data calibration, etc.
  • structured light cameras need to process a very large amount of data, they are generally equipped with a dedicated processing chip to execute the ISP algorithm. Therefore, the general depth camera resolution is 840 pixels * 480 pixels, and the frame rate is 30fps (frame per second).
  • Figure 2 shows a schematic diagram of the detection process of a TOF camera.
  • the photosensitive element used in a TOF camera is a silicon-based image sensor.
  • the TOF camera includes at least three parts: a light source, a receiving array, and a circuit.
  • the circuit includes a signal transmitter, a modulation unit, a demodulation unit, and a calculation unit.
  • the light source emits a modulated infrared light, which is reflected after hitting the target.
  • the reflected modulated square wave passes through the lens and is finally received by the receiving array.
  • the information is then decoded by the demodulation unit and calculation unit. Harmonize calculations to calculate distance information.
  • gesture interaction technology Since the frame rate of the depth camera is low, processing information takes up more resources, which further increases the delay of gesture interaction and increases resource consumption.
  • Gesture recognition methods currently mostly use deep learning methods. As mentioned above, their execution delays are high and system resource overhead is also large. Therefore, the current gesture interaction system consumes a lot of system resources as a whole, has high latency, and has low accuracy. This is one of the reasons why it is difficult to promote gesture interaction technology now.
  • the gesture recognition method includes: acquiring multiple sets of images taken at different shooting moments for the gesture action object, wherein each set of images includes at least a pair of corresponding depth images and grayscale images; based on the multiple sets of images, using The depth map in each set of images is used to obtain spatial information, and the grayscale image in each set of images is used to obtain the posture information of the gesture action object to identify the dynamic gesture changes of the gesture action object.
  • This gesture recognition method uses the synchronously collected depth map and grayscale image containing gesture action objects, uses the depth map to extract spatial information, and uses the grayscale image to obtain gesture information.
  • the recognition and positioning of gestures are achieved. Since it does not
  • the use of complex deep learning algorithms reduces the overall processing time, quickly obtains gesture recognition results, reduces system resource usage, and ensures real-time gesture interaction.
  • the processing time of gesture recognition can be reduced from 20 ms to 5 ms.
  • the gesture interaction method provided by the embodiments of the present disclosure can be applied to mobile terminals (such as mobile phones, tablets, etc.). It should be noted that the gesture recognition method provided by the embodiments of the present disclosure can be applied to the gesture interaction provided by the embodiments of the present disclosure.
  • the gesture interaction system can be configured on an electronic device.
  • the electronic device may be a personal computer, a mobile terminal, etc.
  • the mobile terminal may be a mobile phone, a tablet computer, or other hardware devices with various operating systems.
  • FIG. 3 is a schematic flow chart of a gesture recognition method provided by at least one embodiment of the present disclosure.
  • the gesture recognition method provided by at least one embodiment of the present disclosure includes steps S10 to S20.
  • Step S10 Obtain multiple sets of images taken at different shooting times for the gesture action object.
  • each set of images includes at least one pair of corresponding depth images and grayscale images.
  • corresponding depth image and grayscale image means that the depth image and the grayscale image correspond to the same shooting moment.
  • the gesture action object may include a human hand, such as a user's hand.
  • gesture action objects may also include other hand objects with the same shape as human hands, such as items with human hand shapes, such as inflatable balls in the shape of hands in a fist state, etc. This disclosure does not specifically limit this .
  • step S10 may include: using at least one shooting device to continuously shoot the gesture action object to obtain multiple sets of images respectively corresponding to the different shooting moments.
  • each shooting device is configured to synchronously output a pair of corresponding depth images and grayscale images at one shooting moment.
  • a gesture action object performs a gesture action within the shooting range of one or more shooting devices.
  • the one or more shooting devices synchronously shoot the gesture action object within a preset recognition period.
  • Each shooting device synchronously outputs a signal at a shooting moment.
  • the pair of corresponding depth images and grayscale images are captured by the shooting device at the same shooting moment.
  • a set of images will be obtained at one shooting time.
  • the set of images includes a pair of depth images and grayscale images. After shooting for a preset recognition period, multiple sets of images will be obtained. Like, multiple sets of images correspond to different shooting moments.
  • a set of images is obtained at one shooting moment.
  • the set of images includes T pairs of depth images and grayscale images, and the T pairs of depth images and grayscale images are respectively from the T A shooting device, after shooting in a preset recognition period, obtains multiple sets of images.
  • the multiple sets of images correspond to different shooting times, and each set of images includes T pairs of depth maps and grayscales corresponding to the same shooting time. picture.
  • T is a positive integer greater than 1.
  • T shooting devices need to simultaneously shoot gesture action objects.
  • T shooting devices receive trigger instructions at the same time, and when receiving the trigger instructions, they simultaneously shoot gesture action objects to obtain a set of images corresponding to the same shooting moment.
  • the set of images includes T pairs of depth images and grayscale images. .
  • T shooting devices are set up with different shooting angles and shooting positions to shoot gesture action objects from different angles, so that multiple pairs of depth images and grayscale images obtained at the same shooting moment have different shooting angles, which to a certain extent This ensures the anti-occlusion of gestures, reducing or avoiding the failure of recognition due to undetectable gesture action objects due to gestures being occluded.
  • the gesture action object when photographing a gesture action object, the gesture action object is closest to the at least one shooting device relative to other objects in the image.
  • the shooting device can be integrated at the center of the lower edge of the display unit, and the camera optical axis is tilted forward and upward to shoot the gesture action object, so the gesture action object can be considered to be the object closest to the shooting device.
  • Figure 4 is a schematic diagram of a gesture interaction space provided by at least one embodiment of the present disclosure.
  • the width of the display unit is 70 cm and the height is 40 cm.
  • the display unit is used to display display content that interacts with gesture action objects.
  • the display content includes general and easy-to-understand controls, and the controls are preset. After triggering the behavior or function, when the control is triggered, visual feedback effects will be displayed, such as changes in material, color, brightness, etc., to remind the user that the interaction is completed.
  • controls can include three-dimensional virtual controls, and controls can be presented in different ways such as buttons, icons, 3D models, etc.
  • the display unit can be implemented as a three-dimensional display.
  • the three-dimensional display can be a naked-eye three-dimensional display, that is, the user can see the three-dimensional display effect with both eyes without using other tools.
  • the gesture action object that is, the user's hand
  • the gesture action object needs to be 30 cm away from the eyes so as not to block the line of sight. Therefore, the gesture The interaction space is a three-dimensional space with a size of 70cm*40cm*35cm between the display unit and the user.
  • the gesture action object moves in the gesture interaction space, and the gesture action object is closest to the shooting device relative to other objects (such as other body parts of the user).
  • the shooting device continuously shoots the gesture action object to obtain multiple sets of images. Like the recognition of dynamic gesture changes.
  • Figure 4 shows a possible way to define the gesture interaction space.
  • the size of the gesture interaction space may also be different. , this disclosure does not impose specific restrictions on this.
  • the photographing device includes a first acquisition unit, for example, the first acquisition unit includes a TOF camera.
  • the first acquisition unit is configured to acquire the grayscale image every first frame, and acquire the depth image every N first frames.
  • the depth map is generated based on N grayscale images obtained for every N consecutive first frames.
  • the N grayscale images respectively correspond to N different phases.
  • the depth map is different from the N grayscale images.
  • N is a positive integer and greater than 1.
  • the first frame represents the time of a Differential Correlation Sample (DCS) phase, such as 25ms (milliseconds).
  • DCS Differential Correlation Sample
  • the shooting device can be configured to further include a second acquisition unit.
  • the second acquisition unit is, for example, a high-speed grayscale camera.
  • the second acquisition unit is configured to output a grayscale image every second frame
  • the first acquisition unit is configured to output a depth image every M second frames, where M is a positive integer and greater than 1.
  • the frame length of the second frame is smaller than the frame length of the first frame.
  • a high-speed grayscale camera outputs a grayscale image every 8.3ms, that is, the frame length of the second frame is 8.3ms.
  • the shooting device has the shortest A pair of depth images and grayscale images can be output every second frame, that is, the shooting device can output a pair of depth images and grayscale images every 8.3ms, as opposed to outputting a pair of depth images and grayscale images every 100ms.
  • the delay of image acquisition is greatly reduced, the frame rate is greatly increased, meeting the needs of low-latency real-time interaction, and overall realizing the rapid recognition of gesture actions and gesture positions, and the absolute delay of the processing process (including image acquisition time and image processing time) It can reach less than or equal to 15ms, and the reporting rate reaches 120Hz.
  • the present disclosure does not impose specific restrictions on the shooting device, and is not limited to the structure described in the embodiments of the present disclosure, as long as it can achieve synchronous output of a pair of depths corresponding to one shooting moment. images and grayscale images, as well as being able to output multiple pairs of depth images and grayscale images corresponding to multiple shooting moments.
  • step S20 according to multiple sets of images, the depth map in each set of images is used to obtain spatial information, and the grayscale image in each set of images is used to obtain posture information of the gesture action object to identify the gesture action object. dynamic gesture changes.
  • spatial information includes the gesture area in the depth map, the position information of the gesture action object, etc.
  • the recognition of dynamic gesture changes is performed on a shooting device basis. That is to say, if a shooting device is provided, that is, each set of images includes a pair of depth images and grayscale images, then dynamic gestures are recognized based on multiple sets of images. Change; if multiple shooting devices are provided, that is, each group of images includes multiple pairs of depth maps and grayscale images, multiple pairs of depth maps obtained based on the same shooting device that belong to multiple groups of images and correspond to different shooting moments and grayscale images to determine the gesture action changes corresponding to the shooting device, and finally obtain the final gesture action change recognition result based on the gesture action changes corresponding to the multiple shooting devices.
  • a shooting device takes a shooting device as an example.
  • a set of images includes a pair of depth images and grayscale images.
  • the specific description is to obtain the corresponding image of the shooting device.
  • the recognition process of gesture action changes.
  • each set of images includes one or more pairs of depth images and grayscale images
  • each pair of depth images and grayscale images undergoes the same processing to obtain each pair of depth images.
  • the depth map can be used to extract spatial information, such as extracting the area closest to the shooting device as the gesture area in the depth map, and acting on the gesture area in the depth map.
  • the corresponding grayscale image is used to obtain the gesture analysis area in the grayscale image, and then the posture information of the gesture action object is obtained based on the gesture analysis area.
  • the following describes in detail the process of obtaining the posture information of the corresponding gesture action object based on a pair of depth images and grayscale images.
  • using the depth map in each group of images to obtain spatial information may include: determining the gesture area in the depth map according to the depth map.
  • the spatial information includes the gesture area in the depth map.
  • determining the gesture area in the depth map based on the depth map may include: traversing the depth map, counting the depth data in the depth map to establish a depth histogram; selecting an adaptive depth threshold corresponding to the depth map, and calculating the depth according to the adaptive depth threshold and depth Histogram,determining gesture regions in depth images.
  • each pixel in the depth map can represent the gesture interaction space.
  • the depth data in the depth map is counted to create a depth histogram.
  • the depth histogram can reflect the occupancy of each depth value in the image.
  • Local thresholds are calculated based on the depth distribution of different areas in the depth map, so different thresholds can be adaptively calculated for different areas in the depth map. For example, two adaptive thresholds in the depth map are calculated, and all pixels within the range of the two adaptive thresholds are determined as gesture areas in the depth map.
  • step S20 using the grayscale image in each group of images to obtain the posture information of the gesture action object may include: determining the gesture-specific information corresponding to each group of images based on the gesture area and the grayscale image in the depth map. The posture information of the action object.
  • the posture information for the gesture action object corresponding to each set of images includes finger status information and position information.
  • determining the posture information for the gesture action object corresponding to each group of images may include: applying the gesture area in the depth map to the grayscale image to obtain the grayscale image.
  • Gesture analysis area Binarize the gesture analysis area to obtain the gesture connected domain; Perform convex hull detection on the gesture connected domain to obtain finger status information; Based on the depth map, determine the position information.
  • applying the gesture area in the depth map to the grayscale image includes selecting the area at the same position in the grayscale image as the Gesture analysis area.
  • convex hull detection can be implemented using any feasible convex hull detection method, and this disclosure does not impose specific limitations on this.
  • the finger status information includes whether there are extended fingers in the gesture action object and the number of extended fingers.
  • FIG. 5A is a schematic diagram of convex hull detection provided by at least one embodiment of the present disclosure.
  • the finger state information obtained by the convex hull detection includes that the fingers are in an extended state, and the number of extended fingers is 5.
  • the finger state information obtained by the convex hull detection includes no fingers in the extended state and the number of extended fingers is 0.
  • the position information includes the coordinate position of the gesture action object in the gesture interaction space.
  • each pixel in the depth map can represent the three-dimensional coordinates of a point in the gesture interaction space, and the three-dimensional coordinates can include each pixel in the gesture interaction space.
  • the abscissa, ordinate, and depth coordinates represent the distance between the object corresponding to the pixel and the shooting device. Since the shooting device is usually arranged on the plane where the display unit is located, the depth coordinate also represents the distance between the object corresponding to the pixel and the display unit.
  • the three-dimensional coordinate position of the gesture action object in the gesture interaction space can be obtained based on the depth map.
  • the coordinate position of the gesture action object in the gesture interaction space includes the coordinate position of each pixel in the gesture analysis area.
  • Figure 5B is a schematic diagram of the gesture information extraction process provided by an embodiment of the present disclosure.
  • the depth images are traversed, and the depth data in the depth images are counted to establish a depth histogram.
  • the depth histogram is as shown in Figure 5B.
  • Select two adaptive thresholds in the depth histogram determine all pixels within the range of the two adaptive thresholds as the gesture area in the depth map, and convert the depth map
  • the gesture area in is applied to the grayscale image to obtain the gesture analysis area located in the grayscale image as shown in Figure 5B.
  • the gesture analysis area is binarized to obtain the gesture connected domain.
  • the gesture connected domain is shown in the gray area marked as "gesture connected domain” in Figure 5B.
  • Convex hull detection is performed on the gesture connected domain to obtain finger status information.
  • the finger status information includes and only one finger is in the extended state.
  • the position information of the gesture action object can also be determined.
  • the specific process is as mentioned above and will not be described again here.
  • the dynamic gesture changes of the gesture action object can be determined based on multiple sets of images corresponding to different shooting times obtained by continuously shooting the gesture action object by the shooting device.
  • identifying the dynamic gesture changes of the gesture action object may include: determining the dynamic gesture changes of the gesture action object based on the posture information corresponding to the gesture action object in multiple sets of images.
  • determining the dynamic gesture changes of the gesture action object based on the posture information of the gesture action object corresponding to the multiple sets of images may include: determining the position of the gesture action object based on the finger state information and position information corresponding to the multiple sets of images.
  • the finger extension state changes and position changes within the recognition period composed of different shooting moments; based on the finger extension state changes and position changes, the dynamic gesture changes of the gesture action object are determined.
  • gesture interaction When performing gesture interaction, common interaction scenarios include clicking an icon, long pressing an icon, sliding to switch scenes, grabbing the model to move, rotate, scale, etc. From this, five natural dynamic gestures commonly used by humans can be extracted, namely click gesture, long press gesture, sliding gesture, grab gesture and release gesture.
  • the finger extension state changes and positions of the gesture action objects in the recognition period composed of different shooting moments (that is, the period when the dynamic gesture object is photographed) Variety.
  • the dynamic gesture changes of the gesture action object include gesture actions, and the gesture actions include at least any one of a click gesture, a long press gesture, a slide gesture, a grab gesture, and a release gesture.
  • determining the dynamic gesture change of the gesture action object based on the finger extension state change and the position change includes: responding to the finger extension state change to indicate that at least one finger in the gesture action object is extended for at least part of the recognition period. state, the position change indicates that the depth coordinate of the target recognition point in the gesture action object first decreases and then increases during at least part of the period, and the gesture action is determined to be a click gesture.
  • the target recognition point may be the fingertip of the target finger.
  • the target finger when a finger is in the extended state for at least part of the recognition period, the target finger is the finger; when there are multiple fingers that are in the extended state for at least part of the recognition period. If it is in the extended state for part of the period, the index finger or middle finger will be the target finger first.
  • the gesture action object has performed a click gesture.
  • determining the dynamic gesture change of the gesture action object based on the finger extension state change and the position change may include: responding to the finger extension state change to indicate that at least one finger in the gesture action object is in extension for at least part of the recognition period. out state, and the position change indicates that the depth coordinate of the target recognition point in the gesture action object first decreases and then remains during at least part of the period, and the duration of the maintenance action exceeds the first threshold, the gesture action is determined to be a long press gesture.
  • the target recognition point may be the fingertip of the target finger.
  • the target finger when a finger is in an extended state for at least part of the recognition period, the target finger is the finger; when there are multiple fingers on the fingertip during the recognition period, If the target finger is in the extended state at least part of the time, the index finger or the middle finger is preferred as the target finger.
  • the depth coordinate of the fingertip of the target finger first decreases and then remains, and the duration of the retention exceeds the first threshold, it is determined that the gesture action object has been recognized and executed. Long press gesture.
  • determining the dynamic gesture change of the gesture action object based on the finger extension state change and the position change includes: responding to the finger extension state change to indicate that at least one finger in the gesture action object is extended for at least part of the recognition period. state, and the position change indicates that the distance of the target recognition point in the gesture action object sliding along the preset direction exceeds the second threshold at least in part of the period, and the gesture action is determined to be a sliding gesture, wherein the sliding distance is determined by the gesture in the multiple sets of images
  • the position information of the target identification point of the action object is calculated.
  • the target recognition point may be the fingertip of the target finger.
  • the target finger when a finger is in an extended state for at least part of the recognition period, the target finger is the finger; when there are multiple fingers on the fingertip during the recognition period, If the target finger is in the extended state at least part of the time, the index finger or the middle finger is preferred as the target finger.
  • the preset direction may be the direction indicated by the prompt information displayed in the display unit.
  • the preset direction may be a horizontal direction, a vertical direction, or a direction at a certain angle with the horizontal direction. This disclosure does not impose specific limitations on this.
  • the second threshold may be a distance value in the gesture interaction space.
  • the coordinate system where the control in the display unit is located has a preset mapping relationship with the coordinate system of the gesture interaction space.
  • the second threshold can be the distance value in the coordinate system where the control is located.
  • the fingertip of the target finger slides a distance along a preset direction (such as a distance sliding left and right in the horizontal direction, or a distance sliding up and down in the vertical direction). ) exceeds the second threshold, it is determined that the gesture action object has performed the sliding gesture.
  • a preset direction such as a distance sliding left and right in the horizontal direction, or a distance sliding up and down in the vertical direction.
  • determining the dynamic gesture change of the gesture action object based on the finger extension state change and the position change includes: in response to the finger extension state change, indicating that the gesture action object transitions from at least one finger in the extended state to none within the recognition period.
  • the finger is extended and the gesture is determined to be a grabbing gesture.
  • determining the dynamic gesture change of the gesture action object based on the finger extension state change and the position change includes: in response to the finger extension state change, indicating that the gesture action object transitions from no finger in the extended state to presence within the recognition period. At least one finger is in an extended state, and the gesture action is determined to be a release gesture.
  • gesture recognition In order to reduce the possibility of incorrect operation of gesture recognition, during dynamic gesture recognition, it is possible to detect whether the gesture action object has a hover operation before performing the gesture action.
  • determining the dynamic gesture change of the gesture action object based on the finger extension state change and position change may also include: before determining the gesture action, determining whether the gesture action object has a static time exceeding a third threshold before the action change occurs, In response to the presence of a still time exceeding the third threshold, the gesture action continues to be determined; in response to the absence of a still time exceeding the third threshold, it is determined that no change in the gesture action has occurred.
  • the gesture action object has a static time exceeding the third threshold before performing a click gesture, a long press gesture, etc., then continue to determine the specific gesture action performed by the gesture action object in the above manner; if a gesture action object is detected If the action object does not have a static time exceeding the third threshold before performing a click gesture, long press gesture, etc., then it is determined that the gesture action object has not changed its gesture action, and the specific gesture action will no longer be determined to reduce the possibility of misrecognition. sex.
  • the gesture recognition method when determining the dynamic gesture change of the gesture action object, it is only necessary to determine whether the fingers are extended, without determining which fingers are in the extended state. , which reduces the workload of convex hull detection and does not require complex deep learning algorithm support, improves detection speed, and reduces the demand for system computing power.
  • a simplified recognition algorithm for gesture actions is designed, that is, establishing a depth histogram to extract the gesture area in the grayscale image, analyzing the gesture connected domain and performing convex hull detection, and then combining the fingers It recognizes gesture actions based on state changes and position changes, without using complex deep learning algorithms, and achieves rapid recognition of gesture actions as a whole.
  • the recognition time of gesture actions is less than or equal to 5ms (milliseconds).
  • the control itself also contains position information.
  • the dynamic gesture change of the gesture action object also includes the gesture position, and the gesture action and gesture position can be recognized synchronously.
  • determining the dynamic gesture change of the gesture action object based on the finger extension state change and position change also includes: in response to the gesture action being a click gesture, a long press gesture, or a sliding gesture, determining the gesture position based on the target in the gesture action object.
  • the position information of the recognition point is obtained, wherein the target recognition point includes the fingertip point of the target finger; in response to the gesture action being a grab gesture or a release gesture, the gesture position is determined based on the position information of the gesture center of the gesture action object, where , the gesture center is the center of the largest inscribed circle of the gesture connected domain.
  • the recognized gesture when the recognized gesture is a click gesture, a long press gesture or a sliding gesture, determine the gesture position as the position of the target recognition point; for example, when the recognized gesture is a grab gesture or a release gesture, determine the gesture position.
  • the position is the position of the center of the gesture.
  • the positions of multiple preset sampling points near the gesture position can also be counted, and the final gesture position can be positioned based on surface accuracy to improve the accuracy and precision of the gesture position.
  • determining the gesture position is based on the position information of the target recognition point in the gesture action object, which may include: obtaining multiple position information corresponding to multiple sampling points at preset positions around the target recognition point; according to the multiple position information and the position information of the target recognition point to obtain the gesture position.
  • determining the gesture position is based on the position information of the target identification point in the gesture action object, which may include: obtaining multiple position information corresponding to multiple sampling points at preset positions around the gesture center; based on the multiple position information and The position information of the gesture center is used to obtain the gesture position.
  • multiple sampling points around the gesture center or target recognition point are selected, and the position information of these sampling points and the position information of the gesture center or target recognition point are weighted to obtain a weighted result, and the weighted result is used as the final gesture position.
  • the position information includes a three-dimensional position
  • only the depth coordinates can be selected for calculation, which reduces the amount of calculation and improves the accuracy of depth position measurement.
  • step S20 may include: multiple pairs of depth images and grayscale images obtained based on the same shooting device and belonging to multiple sets of images and corresponding to different shooting moments. , determine the intermediate gesture changes of the gesture action object corresponding to the same shooting device; perform weighting and filtering processing on multiple intermediate gesture changes corresponding to multiple shooting devices to obtain the dynamic gesture changes of the gesture action object.
  • multiple shooting devices are used to simultaneously locate gesture actions and gesture positions.
  • Each shooting device uses multiple pairs of depth images and grayscale images obtained by the shooting device corresponding to different shooting moments to obtain the gesture action object corresponding to the shooting device.
  • the specific recognition process of gesture changes is as mentioned above and will not be described again here.
  • the recognition results of multiple shooting devices are combined to perform weighting correction, band-pass filtering correction and other processes to obtain the final recognition result.
  • the results of multiple shooting devices can be used to locate the gesture position, and the gesture action recognized by one of the shooting devices is selected as the recognition result of the gesture action, so as to improve the accuracy of gesture position positioning and reduce the amount of calculation.
  • this method can ensure the anti-occlusion of gestures to a certain extent, improve the accuracy of the final recognition results, and improve the robustness of dynamic gesture recognition. , achieving high-precision measurement.
  • the gesture connected domains of multiple shooting devices can be combined to obtain a final gesture connected domain, and this gesture connected domain can be obtained
  • the recognition process is the same as the above process and will not be described again here. This method can also improve the accuracy and robustness of gesture recognition results.
  • the accurate positioning of the gesture position and the accurate recognition of the gesture action can be achieved as a whole.
  • the positioning error of the gesture position is less than 2mm, and the recognition accuracy of the gesture action is greater than 95%.
  • gesture position detection and action recognition can be realized simultaneously.
  • the processing time is within 5ms, which is only one-quarter of the time required for gesture recognition (20ms and above) commonly used in combination with deep learning methods, and requires little system computing power; and, at least one embodiment of the present disclosure
  • the gesture recognition method provided can be implemented using an image processing method that requires less computing power, which is conducive to the hardware implementation of the algorithm; in addition, the multiple dynamic gestures defined are more in line with people's natural interaction needs, reducing the amount of calculation while ensuring gestures. Real-time interaction improves user experience.
  • depth cameras such as binocular cameras, structured light cameras, and TOF cameras
  • processing costs are high, their frame rates are low, and their delays are difficult to achieve in real-time interactions. will affect the user experience.
  • a structural arrangement scheme based on a depth camera or a combination of a depth camera and a high-speed grayscale camera is used.
  • a customized parallel processing process is implemented to achieve rapid detection of dynamic gesture changes through a combination of fast and slow speeds, trajectory prediction, and other methods.
  • using at least one shooting device to continuously shoot gesture action objects to obtain multiple sets of images corresponding to different shooting times includes: using each shooting device to continuously shoot gesture action objects to obtain multiple sets of images output by the shooting device corresponding to different shooting times. For depth images and grayscale images.
  • each photographing device includes a first acquisition unit, for example, the first acquisition unit is a TOF camera.
  • FIG. 6 shows the relationship between the depth image and the grayscale image of a TOF camera.
  • each first frame represents the time of a DCS phase, that is, DCS0, DCS1, DCS2, DCS3, etc. in Figure 6.
  • DCS0, DCS1, DCS2, and DCS3 respectively represent different phases, for example, DCS0 corresponds to phase 0°, DCS1 corresponds to phase 90°, DCS2 corresponds to phase 180°, and DCS3 corresponds to phase 270°.
  • phase means the phase difference between the signal sent by the first acquisition unit and the signal received.
  • DCS0 means that the acquired grayscale image corresponds to the phase difference between the signal sent by the first acquisition unit and the signal received is 0°. .
  • N can also take other values, such as 2 or 8.
  • grayscale images corresponding to 4 different phases are obtained in each DCS, represented as grayscale image Gray0, grayscale Gray1, Gray2 and Gray3, a depth map Dep0 is calculated based on Gray0, Gray1, Gray2 and Gray3, the depth map Dep0 and Gray3 can be synchronized Output the first acquisition unit.
  • grayscale images corresponding to 4 different phases are obtained in each DCS, represented as grayscale image Gray0, grayscale Gray1, Gray2 and Gray3 are calculated to obtain a depth map Dep1 based on Gray4, Gray5, Gray6 and Gray7. Depth map Dep1 and Gray7 can be synchronized. Output the first acquisition unit.
  • the calculation process of the depth map Dep2 is the same as the aforementioned process and will not be described again here.
  • the first acquisition unit outputs a pair of depth images and grayscale images every 100ms.
  • the grayscale image and the depth image have the same output frame rate and delay. But at this time, it is impossible to achieve the delay requirement of 10ms or lower. Therefore, in order to improve the frame rate of the grayscale image and depth image output by the shooting device, some embodiments of the present disclosure provide the following various implementation solutions.
  • using each shooting device to continuously shoot gesture action objects to obtain multiple pairs of depth images and grayscale images output by the shooting device corresponding to different shooting moments may include: using the shooting device to capture each first The frame outputs a pair of corresponding depth images and grayscale images.
  • the output depth image is predicted by smooth trajectory fitting based on N grayscale images and one depth image.
  • the shooting device can output a pair of corresponding depth images and grayscale images in each first frame.
  • the output depth image is simulated by using a smooth trajectory based on N grayscale images and the one depth image. It is obtained by the combined prediction, that is, the depth map is calculated based on the N grayscale images.
  • FIG. 7A is a schematic diagram of the corresponding relationship between the depth image and the grayscale image provided by an embodiment of the present disclosure.
  • every four grayscale images corresponding to different phases correspond to a depth map, that is, the first acquisition unit aligns with the depth map every four first frames to achieve position calibration of depth information and grayscale information.
  • any feasible prediction algorithm can be applied to smooth the trajectory simulation of the depth map Dep0 through the grayscale image Gray0, grayscale image Gray1, grayscale image Gray2 and grayscale image Gray3.
  • the depth map Dep0_1 corresponding to the grayscale image Gray0, the depth map Dep0_2 corresponding to the grayscale image Gray0, and the depth map Dep0_3 corresponding to the grayscale image Gray0 are obtained to improve the frame rate of depth information.
  • any feasible prediction algorithm can be applied to perform smooth trajectory fitting prediction on the depth map Dep1 through grayscale image Gray4, grayscale image Gray5, grayscale image Gray6, and grayscale image Gray7. , obtain the depth map Dep1_1 corresponding to the grayscale image Gray4, the depth map Dep1_2 corresponding to the grayscale image Gray5, and the depth map Dep1_3 corresponding to the grayscale image Gray6, to improve the frame rate of depth information.
  • the prediction algorithm may include an interpolation algorithm, a Kalman filter algorithm, etc., and this disclosure does not impose specific limitations on this.
  • the first acquisition unit can output a pair of corresponding grayscale images and depth images in each first frame.
  • the depth map corresponding to some grayscale images that currently do not have a corresponding depth map can also be calculated.
  • the depth map corresponding to the grayscale image Gray1 and the depth map corresponding to the grayscale image Gray5 are calculated, separated by two
  • the first frame outputs a pair of corresponding depth images and grayscale images.
  • the present disclosure is not limited to outputting a pair of corresponding depth images and grayscale images in each first frame, as long as the frame rate of the depth image can be increased.
  • the frame rate at which the shooting device outputs the depth map and the grayscale image can reach the frame rate at which the grayscale image is acquired in the first acquisition unit, and the delay can reach the frame length of the first frame, and no additional increase is required.
  • Grayscale camera which can save hardware costs.
  • the accuracy of the predicted depth map is limited by the accuracy of the prediction algorithm.
  • N the prediction accuracy of the depth map may be reduced. Therefore, the gesture recognition method provided by at least one embodiment of the present disclosure also provides another way to obtain a depth map and a grayscale image.
  • a depth map is generated using the grayscale images of every four first frames. Since The four phases change periodically, such as 0°, 90°, 180°, 270°, 0°, 90°,..., so you can choose four grayscale images of different phases to operate together to obtain a new depth map.
  • using each shooting device to continuously shoot gesture action objects, and obtaining multiple pairs of depth images and grayscale images output by the shooting device corresponding to different shooting moments may include: using the shooting device to capture gestures at most every N -1 first frame outputs a pair of corresponding depth images and grayscale images.
  • the output depth image is calculated from the grayscale images of N-1 first frames adjacent to the output grayscale image.
  • the output grayscale The image and the adjacent N-1 grayscale images of the first frame correspond to N different phases.
  • the shooting device can output a pair of corresponding depth images and grayscale images at most every N-1 first frame, and the output depth image passes through N-1 adjacent to the output grayscale image.
  • the grayscale image of the first frame is calculated, and the output grayscale image corresponds to N different phases with the adjacent N-1 grayscale images of the first frame.
  • FIG. 7B shows a schematic diagram of the corresponding relationship between the depth image and the grayscale image provided by another embodiment of the present disclosure.
  • the depth map Dep0 is calculated based on the grayscale images Gray0, Gray1, Gray2 and Gray3 corresponding to four different phases; the depth map Dep1_1 is calculated based on the grayscale images corresponding to four different phases.
  • the grayscale image Gray1, grayscale image Gray2, grayscale image Gray3 and grayscale image Gray4 are calculated.
  • the depth map Dep1_2 is based on the grayscale image Gray2, grayscale image Gray3, grayscale image Gray4 and grayscale corresponding to four different phases.
  • Figure Gray5 calculates,..., and so on.
  • each first frame outputs a pair of depth map and grayscale map.
  • the depth map Dep1_1 is based on three grayscale images adjacent to the grayscale image Gray4, and the three grayscale images are in grayscale Image Gray4 was obtained before.
  • the three grayscale images are grayscale image Gray1, grayscale image Gray2, and grayscale image Gray3.
  • Grayscale image Gray1, grayscale image Gray2, grayscale image Gray3, and grayscale image Gray4 correspond to 4 respectively. different phases. From this, a depth map is calculated using grayscale images corresponding to four different phases.
  • Figure 7B shows a schematic.
  • N takes different values
  • the depth map corresponding to the grayscale image can be calculated in a similar manner.
  • the present disclosure is not limited to outputting a pair of corresponding depth images and grayscale images in each first frame, as long as the frame rate of the depth image can be increased, that is, outputting every N-1 first frames at most is achieved.
  • a pair of corresponding depth images and grayscale images are not limited to outputting a pair of corresponding depth images and grayscale images in each first frame, as long as the frame rate of the depth image can be increased, that is, outputting every N-1 first frames at most is achieved.
  • the frame rate of the depth map and grayscale image output by the shooting device can reach the acquisition frame rate of the grayscale image in the first acquisition unit, and the delay can reach the frame length of the first frame, without the need to add additional grayscale images. degree camera, which can save hardware costs. Moreover, the accuracy of the depth map obtained in this way is not affected by the prediction algorithm, and good image accuracy of the depth map can be maintained even if N is large.
  • the grayscale image and the depth map need to share the same transmission bandwidth, so if the frame rate is relatively high, the transmission delay may be high, and the frame rate is limited to the grayscale image in the first acquisition unit Get the frame rate.
  • the shooting device may further include a second acquisition unit.
  • the second acquisition unit includes a high-speed grayscale camera, and ultra-high-speed depth information prediction and calibration are achieved through the high frame rate of the high-speed grayscale camera.
  • using each shooting device to continuously shoot gesture action objects, and obtaining multiple pairs of depth images and grayscale images output by the shooting device corresponding to different shooting moments may include: using the shooting device to output at most every M-1 second frames A pair of corresponding depth maps and grayscale images, wherein the output depth map includes a reference depth map, or a depth map predicted using smooth trajectory fitting based on the reference depth map and at least one grayscale image corresponding to the reference depth map, wherein, the reference depth map includes a depth map output by the first acquisition unit in the current second frame or before the current second frame, and the current second frame is the second frame that outputs a pair of corresponding depth maps and grayscale images, at least One grayscale image includes a grayscale image output by the second acquisition unit between the second frame corresponding to the reference depth map and the current second frame.
  • the shooting device can output a pair of corresponding depth images and grayscale images at most every M-1 second frames.
  • the output depth map may be a reference depth map, or a depth map predicted by smooth trajectory fitting based on the reference depth map and at least one grayscale image corresponding to the reference depth map.
  • the reference depth map includes a depth map output by the first acquisition unit in the current second frame or before the current second frame, and the current second frame is the second frame that outputs a pair of corresponding depth maps and grayscale images, at least One grayscale image includes a grayscale image output by the second acquisition unit between the second frame corresponding to the reference depth map and the current second frame.
  • grayscale images can also be used in combination with the baseline depth map for prediction.
  • the M second images before the current second frame can be used.
  • the grayscale image output by the frame is combined with the reference depth map and processed using a prediction algorithm or interpolation algorithm to obtain the depth map corresponding to the current second frame.
  • the second acquisition unit can follow the depth output of the first acquisition unit every M second frames.
  • the images are aligned once to achieve calibration of depth and grayscale information positions.
  • the prediction algorithm is used to perform smooth trajectory fitting prediction on the depth map through the high frame rate grayscale image obtained by the second acquisition unit. That is to say, in the other M-1 second frames, even if there is no depth
  • the image can also predict the depth map through the grayscale image output by the second acquisition unit, thereby realizing the acquisition and calibration of high-speed image depth information.
  • a pair of corresponding grayscale images and depth images can be output in every second frame, or a pair of corresponding grayscale images and depth images can be output in every two second frames. This disclosure does not apply to this. To make specific restrictions, as long as the output frame rate of the depth map can be increased, that is, a pair of corresponding depth maps and grayscale images can be output at most every M-1 first frame.
  • FIG. 7C is a schematic diagram of the corresponding relationship between the depth image and the grayscale image provided by yet another embodiment of the present disclosure.
  • DCS0, DCS1, DCS2, DCS3 respectively represent the first frame corresponding to different phases, f0_M, f1_1, f1_2,..., f1_M, f2_1,..., f2_M, f3_1, f3_2... represent the second frame respectively.
  • the first acquisition unit is configured to output a depth map every M second frames, where M is a positive integer and greater than 1.
  • the first acquisition unit outputs the depth map Dep_i in the second frame f1_1, outputs the depth map Dep_i+1 in the second frame f2_1, and outputs the depth map Dep_i+2 in the second frame f3_1.
  • the depth map Dep_i the depth map Dep_i+1, and the depth map Dep_i+2
  • the second acquisition unit is configured to output a grayscale image in each second frame, for example, output the grayscale image 1_1 in the second frame f1_1, output the grayscale image 1_2 in the second frame f1_2, and output the grayscale image 1_2 in the second frame f1_2.
  • the second frame f1_3 outputs grayscale image 1_3, and so on.
  • the frame length of the second frame is shorter than the frame length of the first frame.
  • the frame length of the first frame is 25ms and the frame length of the second frame is 8.3ms.
  • the following takes the process of outputting the depth map from the second frame f1_1 to the second frame f1_M as an example to illustrate the acquisition process of the depth map.
  • the reference depth map is the depth map Dep_i output by the first acquisition unit, and the reference depth map Dep_i is obtained by using the four first frames respectively.
  • Four grayscale images of different phases are calculated.
  • the depth map Dep_i is aligned with the grayscale image 1_1 output by the second acquisition unit, and is synchronously output to the shooting device in the second frame 1_1.
  • the reference depth map Dep_i and the grayscale image 1_1, grayscale image 1_2, and grayscale image 1_3 are used, combined with the prediction algorithm to predict the depth map D3 of the second frame f1_3, and the depth map D3 is Output the shooting device synchronously with the grayscale image 1_3.
  • the frame rate of a high-speed grayscale camera can reach up to several hundred hertz
  • the combination of a high-speed grayscale camera and a depth camera can achieve millisecond-level low latency, significantly reducing the delay required to acquire images.
  • the frame rate of image acquisition is increased, gesture movements and gesture position determinations are quickly realized, and the speed of interactive experience in 3D space is improved.
  • the accuracy of the predicted depth map is limited by the accuracy of the prediction algorithm.
  • M the prediction accuracy of the depth map may be reduced. Therefore, you can first use the grayscale images of different phases as mentioned above to obtain the depth map corresponding to each first frame, and then use the depth map obtained from each first frame for prediction to reduce the difference between images during prediction. interval to improve the accuracy of the predicted depth map.
  • the first acquisition unit is further configured to obtain a pair of corresponding depth images and grayscale images in each first frame, and the obtained depth image is passed through the N-1 first frames adjacent to the obtained grayscale image.
  • the grayscale image is calculated.
  • the obtained grayscale image and the adjacent N-1 grayscale images of the first frame correspond to N different phases.
  • the frame length of the first frame is greater than the frame length of the second frame, and N is positive. An integer greater than 1.
  • the first acquisition unit can obtain a pair of corresponding depth images and grayscale images in each first frame.
  • the obtained depth image is obtained by passing the grayscale values of N-1 first frames adjacent to the obtained grayscale image.
  • the image is calculated, and the obtained grayscale image and the adjacent N-1 grayscale images of the first frame correspond to N different phases.
  • the second acquisition unit can output a pair of corresponding depth images and grayscale images at most every M'-1 second frames.
  • the output depth image includes a reference depth image, or is based on the reference depth image and the corresponding depth image. At least one grayscale image and a depth image predicted by smooth trajectory fitting.
  • any first frame if the first frame does not have a corresponding depth map, then the N-1th image adjacent to the first frame and before the first frame is used.
  • the grayscale image output in one frame is calculated to generate the depth image corresponding to the first frame. The specific process will not be described again.
  • a depth map is obtained from the first acquisition unit in each first frame, and the frame rate of the depth map output by the first acquisition unit is the same as the frame rate of the grayscale image obtained by the first acquisition unit.
  • the second acquisition unit can keep up with the first acquisition unit every M' second frames.
  • the output depth map is aligned once to achieve calibration of depth and grayscale information positions.
  • the prediction algorithm is used to perform smooth trajectory fitting prediction on the depth map output by the first acquisition unit through the high frame rate grayscale image obtained by the second acquisition unit, that is to say, in other M'-1th frames
  • the depth map can be predicted through the grayscale image output by the second acquisition unit, thereby achieving high-speed image depth information acquisition and calibration.
  • M' is a positive integer
  • M' M/N
  • M means that when the first acquisition unit outputs a depth map every N first frames, the frame rate of the grayscale image output by the second acquisition unit is the first acquisition The ratio of the frame rate of the unit output depth map.
  • the frame rate of a high-speed grayscale camera can reach up to several hundred hertz
  • the combination of a high-speed grayscale camera and a depth camera can achieve millisecond-level low latency, significantly reducing the delay required to acquire images.
  • the frame rate of image acquisition is increased, gesture actions and position determination are quickly realized, and the speed of interactive experience in 3D space is improved.
  • the interval between the reference depth maps used for prediction is shortened at this time, for example, in the embodiment described in FIG. 7C , the interval between the reference depth maps is 4 first frames, and in this embodiment, The interval between the reference depth maps is shortened to 1 first frame, so the accuracy of prediction is improved, the interpolation multiple is reduced, and the accuracy of the predicted depth map is improved.
  • the frame rate of the grayscale image and depth image output by the shooting device is greatly improved, which reduces the delay in the image acquisition process and meets the low-latency interaction requirements, for example,
  • the specific indicators obtained during the test include: delay within 20ms, point reporting rate of 120Hz, and positioning accuracy within 2mm.
  • the recognition rate is over 95%. Therefore, the gesture recognition method provided by at least one embodiment of the present disclosure can quickly and accurately locate gesture positions and gesture actions, ensuring the smoothness and accuracy of the interaction process and improving user experience.
  • Figure 8 is a schematic flowchart of an interactive method provided by at least one embodiment of the present disclosure.
  • the interaction method provided by at least one embodiment of the present disclosure includes at least steps S30-S50.
  • step S30 controls are displayed.
  • controls can be displayed in a display unit, and the display unit includes any display device with display effects, such as a three-dimensional display, a large-size screen, etc.
  • the display unit can display display content that interacts with the gesture action object.
  • the display content includes general and easy-to-understand controls.
  • the content of the display unit and display controls please refer to the foregoing content, which will not be described again here. It should be noted that this disclosure does not impose specific restrictions on the type, shape, and performance of the display unit, nor does it impose specific restrictions on the number, material, color, shape, etc. of the controls.
  • step S40 dynamic gesture changes when the user performs the target action are identified.
  • the shooting device is integrated near the display unit and faces the user.
  • the user performs a target action based on information indicated by the display content or based on other information.
  • the shooting device continuously captures the process of the user performing the target action, and utilizes any embodiment of the present disclosure.
  • the gesture recognition method identifies dynamic gesture changes when the user performs a target action.
  • step S50 the control is triggered according to the recognized dynamic gesture change and the target action.
  • dynamic gesture changes include gesture actions, and gesture actions include, for example, any one of a click gesture, a long press gesture, a slide gesture, a grab gesture, and a release gesture.
  • step S50 may include: in response to the user's gesture action being consistent with the target action, triggering the control and displaying a visual feedback effect.
  • the gesture recognition method provided by any embodiment of the present disclosure is used to identify changes in gesture actions performed by the user. For example, if the gesture action is detected as a click gesture, if the target action is also a click gesture, the control is triggered and visual feedback is displayed. Effects, such as changes in control material, color, brightness, etc., or other visual feedback effects to remind the user that the interaction is completed.
  • the visual feedback effect can also be based on different gesture actions, including switching scenes, spatial movement, rotation, scaling, etc. This disclosure does not impose specific restrictions on this.
  • dynamic gesture changes also include gesture positions, and the control itself also contains position information. This control can only be triggered when the position of the gesture interaction coincides with the position of the control.
  • step S50 may include: in response to the user's gesture action being consistent with the target action and the user's gesture position matching the control position of the control, triggering the control and displaying a visual feedback effect.
  • the user's gesture position matches the control position of the control, and the gesture position is mapped to the position in the coordinate system where the control is located according to the mapping relationship, which is consistent with the control position.
  • mapping relationship between the coordinate system where the control is located and the coordinate system in the gesture interaction space where the gesture position is located.
  • the gesture position is mapped to the coordinate system where the control is located according to this mapping relationship. If the two positions coincide, the trigger is triggered. controls and displays visual feedback effects.
  • FIG. 9 is a schematic block diagram of a gesture interaction system provided by at least one embodiment of the present disclosure.
  • the gesture interaction system includes one or more shooting devices 901 , a gesture recognition unit 902 and a display unit 903 .
  • At least one shooting device 901 is configured to continuously shoot the gesture action object, so as to obtain multiple sets of images of the gesture action object that are respectively shot at different shooting moments.
  • the gesture recognition unit 902 is configured to receive multiple sets of images, execute the gesture recognition method described in any embodiment of the present disclosure, and output the recognition results of dynamic gesture changes of the gesture action object.
  • the display unit 903 is configured to receive the recognition result and display the interactive effect according to the recognition result.
  • the gesture recognition unit 902 includes a digital signal processor (Digital Signal Processor, DSP for short), that is, the gesture recognition method provided by at least one embodiment of the present disclosure is executed in a multi-channel DSP, and the gesture action object is output according to multiple sets of images. Recognition results of dynamic gesture changes.
  • DSP Digital Signal Processor
  • the processing of obtaining high frame rate grayscale images and depth images in a shooting device can also be implemented in hardware, such as in a digital signal processor.
  • the shooting device captures the raw signal data (Raw Data) and transmits the raw signal data to the gesture recognition unit.
  • the gesture recognition unit completes reading of the raw signal data, SIP processing, gesture positioning and recognition, and timing. control and many other functions.
  • gesture collection and dynamic gesture recognition are all completed on the slave computer, and the recognition results are directly reported to the display unit through a preset interface (such as a USB interface), so that the display unit can display corresponding interactive effects based on the content.
  • a preset interface such as a USB interface
  • the processing workload of the host computer is reduced, the processing efficiency of the system is improved, and the processing delay is reduced to ensure the real-time performance of gesture recognition; and the limited host computer resources can be used for the display of interactive effects to improve the user experience.
  • the recognition results include the gesture position, that is, the three-dimensional coordinates of the gesture action object in the gesture interaction space.
  • the recognition result may also include gesture actions.
  • the gesture actions are transmitted to the display unit according to a preset tag code. For example, a click gesture corresponds to "1", a long press gesture corresponds to "2”, and so on.
  • gesture actions can also be transmitted to the display unit according to other preset corresponding relationships, and this disclosure does not impose specific limitations on this.
  • the display unit may include a host computer, and the display unit may use the host computer to develop interactive effects in combination with the display content. Since the processing speed of the lower computer is faster, this setting method can improve the overall processing efficiency of the system, maximize the use of system resources, reduce processing delays, and achieve real-time gesture interaction.
  • the gesture interaction system includes multiple shooting devices 901 configured to synchronously shoot gesture action objects from different angles to obtain corresponding pairs of depth images and grayscale images at the same shooting moment.
  • a plurality of photographing devices 901 are disposed around the display unit and facing the user, for example, at a central position of the upper edge, lower edge, left edge or right edge of the display screen.
  • each photographing device includes a first acquisition unit and a second acquisition unit, and the first acquisition unit and the second acquisition unit are configured to synchronously photograph the gesture action object.
  • the first acquisition unit is a TOF camera
  • the second acquisition unit is a high-speed grayscale camera.
  • the multiple shooting devices synchronously capture the gesture action object, and the first acquisition unit and the second acquisition unit included in each shooting device also synchronously capture the gesture action object to obtain the image corresponding to the same shooting moment.
  • Multiple pairs of depth images and grayscale images are possible.
  • the gesture interaction system can intelligently call the corresponding shooting device according to the gesture position, control the exposure timing of multiple cameras, and collect multiple pairs of depth images and grayscale images at one shooting moment.
  • multiple photographing devices are configured to select some or all of the photographing devices to photograph the gesture action object according to the position of the gesture action object in the gesture interaction space.
  • the shooting device located near the lower right corner of the display unit can be configured not to capture the gesture interaction object, so as to reduce system resource consumption and hardware overhead.
  • Figure 10 is a schematic block diagram of a gesture recognition unit provided by at least one embodiment of the present disclosure.
  • the gesture recognition unit 902 may include: an image acquisition module 9021 and a processing module 9022.
  • these modules can be implemented by hardware (such as circuit) modules, software modules, or any combination of the two.
  • DSP digital signal processor
  • these modules can be implemented using DSP, or they can also be implemented through a central processing unit (CPU), a graphics processor (GPU), a tensor processor (TPU), or a field programmable gate array (FPGA).
  • CPU central processing unit
  • GPU graphics processor
  • TPU tensor processor
  • FPGA field programmable gate array
  • the image acquisition module 9021 is configured to acquire multiple sets of images taken at different shooting moments for the gesture action object.
  • each set of images includes at least one pair of corresponding depth images and grayscale images.
  • the processing module 9022 is configured to, based on multiple sets of images, use the depth map in each set of images to obtain spatial information, and use the grayscale image in each set of images to obtain posture information of the gesture action object to recognize the gesture. Dynamic gesture changes for action objects.
  • the image acquisition module 9021 and the processing module 9022 may include codes and programs stored in the memory; the processor may execute the codes and programs to implement some or all of the functions of the image acquisition module 9021 and the processing module 9022 as described above.
  • the image acquisition module 9021 and the processing module 9022 may be dedicated hardware devices used to implement some or all of the functions of the image acquisition module 9021 and the processing module 9022 as described above.
  • the image acquisition module 9021 and the processing module 9022 may be one circuit board or a combination of multiple circuit boards, used to implement the functions described above.
  • the one circuit board or a combination of multiple circuit boards may include: (1) one or more processors; (2) one or more non-transitory memories connected to the processors; and (3) Firmware stored in memory that is executable by the processor.
  • the image acquisition module 9021 can be used to implement step S10 shown in Figure 3, and the processing module 9022 can be used to implement step S20 shown in Figure 3. Therefore, for a specific description of the functions that the image acquisition module 9021 and the processing module 9022 can implement, please refer to the relevant descriptions of steps S10 to S20 in the embodiment of the above gesture recognition method, and repeated descriptions will not be repeated.
  • the gesture recognition unit 902 can achieve similar technical effects to the foregoing gesture recognition method, which will not be described again here.
  • the gesture recognition unit 902 may include more or less circuits or units, and the connection relationship between the various circuits or units is not limited and may be determined according to actual needs. .
  • the specific construction method of each circuit or unit is not limited. It can be composed of analog devices according to the circuit principle, or it can be composed of digital chips, or it can be composed in other applicable ways.
  • FIG. 11 is a schematic diagram of an electronic device provided by at least one embodiment of the present disclosure.
  • the electronic device includes a processor 101, a communication interface 102, a memory 103 and a communication bus 104.
  • the processor 101, the communication interface 102, and the memory 103 communicate with each other through the communication bus 104.
  • the processor 101, the communication interface 102, the memory 103 and other components can also communicate through network connections.
  • This disclosure does not limit the type and function of the network. It should be noted that the components of the electronic device shown in FIG. 11 are only exemplary and not restrictive. The electronic device may also have other components according to actual application requirements.
  • memory 103 is used to store computer-readable instructions on a non-transitory basis.
  • the processor 101 is used to implement the gesture recognition method according to any of the above embodiments when executing computer readable instructions.
  • each step of the gesture recognition method please refer to the above embodiments of the gesture recognition method, and will not be described again here.
  • gesture recognition method implemented by the processor 101 executing computer-readable instructions stored on the memory 103 are the same as the implementations mentioned in the foregoing method embodiments, and will not be described again here.
  • the communication bus 104 may be a Peripheral Component Interconnect Standard (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, or the like.
  • PCI Peripheral Component Interconnect Standard
  • EISA Extended Industry Standard Architecture
  • the communication bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
  • the communication interface 102 is used to implement communication between the electronic device and other devices.
  • the processor 101 and the memory 103 can be provided on the server side (or cloud).
  • processor 101 may control other components in the electronic device to perform desired functions.
  • the processor 101 may be a central processing unit (CPU), a network processor (NP), a tensor processor (TPU) or a graphics processing unit (GPU) or other device with data processing capabilities and/or program execution capabilities; it may also be Digital signal processor (DSP), application specific integrated circuit (ASIC), field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • the central processing unit (CPU) can be X86 or ARM architecture, etc.
  • memory 103 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • Volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache), etc.
  • Non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disk read-only memory (CD-ROM), USB memory, flash memory, and the like.
  • One or more computer-readable instructions may be stored on the computer-readable storage medium, and the processor 101 may execute the computer-readable instructions to implement various functions of the electronic device.
  • Various applications and various data can also be stored in the storage medium.
  • the electronic device may further include an image acquisition component.
  • the image acquisition component is used to acquire images.
  • Memory 903 is also used to store acquired images.
  • the image acquisition component may be a photographing device as described above.
  • FIG. 12 is a schematic diagram of a non-transitory computer-readable storage medium provided by at least one embodiment of the present disclosure.
  • the storage medium 1000 may be a non-transitory computer-readable storage medium, and one or more computer-readable instructions 1001 may be non-transitoryly stored on the storage medium 1000 .
  • the computer readable instructions 1001 are executed by a processor, one or more steps in the gesture recognition method described above may be performed.
  • the storage medium 1000 can be applied to the above-mentioned electronic device.
  • the storage medium 1000 can include a memory in the electronic device.
  • the storage medium may include a memory card of a smartphone, a storage component of a tablet computer, a hard drive of a personal computer, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM), Portable compact disk read-only memory (CD-ROM), flash memory, or any combination of the above storage media can also be other suitable storage media.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read only memory
  • CD-ROM Portable compact disk read-only memory
  • flash memory or any combination of the above storage media can also be other suitable storage media.
  • the description of the storage medium 1000 may refer to the description of the memory in the embodiment of the electronic device, and repeated descriptions will not be repeated.
  • Figure 13 shows a schematic diagram of a hardware environment provided for at least one embodiment of the present disclosure.
  • the electronic device provided by the present disclosure can be applied in an Internet system.
  • the computer system provided in FIG. 13 can be used to realize the functions of the gesture recognition device and/or the electronic device involved in the present disclosure.
  • Such computer systems may include personal computers, laptops, tablets, mobile phones, personal digital assistants, smart glasses, smart watches, smart rings, smart helmets, and any smart portable or wearable device.
  • the specific system in this example illustrates a hardware platform including a user interface using a functional block diagram.
  • Such computer equipment may be a general purpose computer equipment, or a special purpose computer equipment. Both computer devices can be used to implement the gesture recognition device and/or electronic device in this embodiment.
  • a computer system may include any component that implements the information currently described needed to implement gesture recognition.
  • a computer system can be implemented by a computer device through its hardware devices, software programs, firmware, and combinations thereof.
  • the related computer functions described in this embodiment to implement the information required for gesture recognition can be implemented in a distributed manner by a group of similar platforms. Disperse the processing load on a computer system.
  • the computer system may include a communication port 250, which is connected to a network that implements data communication.
  • the computer system may send and receive information and data through the communication port 250, that is, the communication port 250 may realize the communication between the computer system and Other electronic devices communicate wirelessly or wiredly to exchange data.
  • the computer system may also include a processor set 220 (ie, the processors described above) for executing program instructions.
  • the processor group 220 may be composed of at least one processor (eg, CPU).
  • the computer system may include an internal communications bus 210.
  • the computer system may include different forms of program storage units and data storage units (i.e., the memory or storage media described above), such as hard disk 270, read-only memory (ROM) 230, and random access memory (RAM) 240, which can be used to store Various data files used by the computer for processing and/or communications, and possibly program instructions executed by the processor set 220.
  • the computer system may also include an input/output component 260 for implementing input/output data flow between the computer system and other components (eg, user interface 280, etc.).
  • the following devices may be connected to the input/output component 260: input devices including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc. Output devices; storage devices including tapes, hard disks, etc.; and communication interfaces.
  • input devices including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.
  • LCD liquid crystal display
  • speaker a speaker
  • vibrator etc.
  • communication interfaces including tapes, hard disks, etc.
  • FIG. 13 illustrates a computer system having various devices, it should be understood that the computer system is not required to have all of the devices shown and may instead have more or fewer devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • User Interface Of Digital Computer (AREA)
  • Image Analysis (AREA)

Abstract

一种手势识别方法、交互方法、手势交互系统、电子设备和存储介质。该手势识别方法包括:获取针对手势动作对象的分别在不同拍摄时刻拍摄的多组图像,其中,每组图像包括至少一对对应的深度图和灰度图;根据多组图像,使用每组图像中的深度图来获取空间信息,使用每组图像中的灰度图来获取手势动作对象的姿态信息,以识别手势动作对象的动态手势变化。该手势识别方法整体上降低了处理时长,能够快速得到手势识别结果,减少系统资源占用,保证手势交互的实时性。

Description

手势识别方法、交互方法、手势交互系统、电子设备、存储介质 技术领域
本公开的实施例涉及一种手势识别方法、交互方法、手势交互系统、电子设备和非瞬时性计算机可读存储介质。
背景技术
随着裸眼3D(3-Dimenson,三维)光场显示的不断发展,为了达到立体3D显示的效果,3D显示的物体会有出屏和入屏的需求。人们在观看3D显示时,为了能沉浸式地与3D显示的内容进行交互,需要提供可获得深度信息,且具有高精度、低延时、大视场的交互系统或交互方法来实现用户和3D显示内容之间的互动。
目前,基于手势识别技术实现交互功能是一大研究热点,在裸眼3D显示器、VR/AR/MR、车载、游戏娱乐、智能穿戴、工业设计等多个领域均有应用。基于手势识别技术实现交互功能的核心是通过传感器件,例如相机等采集用户的手势信息,通过相关识别与分类算法识别手势,对不同手势赋予不同的语义信息,实现不同的交互功能。
发明内容
本公开至少一实施例提供一种手势识别方法,包括:获取针对手势动作对象的分别在不同拍摄时刻拍摄的多组图像,其中,每组图像包括至少一对对应的深度图和灰度图;根据所述多组图像,使用每组图像中的深度图来获取空间信息,使用每组图像中的灰度图来获取手势动作对象的姿态信息,以识别所述手势动作对象的动态手势变化。
例如,在本公开至少一实施例提供的手势识别方法中,使用每组图像中的深度图来获取空间信息,包括:根据所述深度图确定所述深度图中的手势区域,其中,所述空间信息包括所述深度图中的手势区域;使用每组图像中的灰度图来获取手势动作对象的姿态信息,包括:根据所述深度图中的手势区域和所述灰度图,确定所述每组图像对应的针对所述手势动作对象的姿态信息;识别所述手势动作对象的动态手势变化,包括:根据所述多组图像分 别对应的针对所述手势动作对象的姿态信息,确定所述手势动作对象的动态手势变化。
例如,在本公开至少一实施例提供的手势识别方法中,根据所述深度图确定所述深度图中的手势区域,包括:遍历所述深度图,统计所述深度图中的深度数据以建立深度直方图;选取所述深度图对应的自适应深度阈值,根据所述自适应深度阈值和所述深度直方图,确定所述深度图中的手势区域。
例如,在本公开至少一实施例提供的手势识别方法中,所述每组图像对应的针对所述手势动作对象的姿态信息包括手指状态信息和位置信息,根据所述深度图中的手势区域和所述灰度图,确定所述每组图像对应的针对所述手势动作对象的姿态信息,包括:将所述深度图中的手势区域作用于所述灰度图,得到所述灰度图中的手势分析区域;对所述手势分析区域进行二值化处理,得到手势连通域;对所述手势连通域进行凸包检测,得到所述手指状态信息,其中,所述手指状态信息包括是否存在手指伸出,以及伸出手指的数量;基于所述深度图,确定所述位置信息,其中,所述位置信息包括所述手势动作对象在手势交互空间中的坐标位置。
例如,在本公开至少一实施例提供的手势识别方法中,根据所述多组图像分别对应的针对所述手势动作对象的姿态信息,确定所述手势动作对象的动态手势变化,包括:根据所述多组图像分别对应的手指状态信息和位置信息,确定所述手势动作对象在所述不同拍摄时刻组成的识别时段内的手指伸出状态变化和位置变化;根据所述手指伸出状态变化和所述位置变化,确定所述手势动作对象的动态手势变化。
例如,在本公开至少一实施例提供的手势识别方法中,所述坐标位置包括深度坐标,所述手势动作对象的动态手势变化包括手势动作,根据所述手指伸出状态变化和所述位置变化,确定所述手势动作对象的动态手势变化,包括:响应于所述手指伸出状态变化指示所述手势动作对象中的至少一个手指在所述识别时段内的至少部分时段处于伸出状态,所述位置变化指示所述手势动作对象中的目标识别点在所述至少部分时段内的深度坐标先减小后增大,确定所述手势动作为单击手势。
例如,在本公开至少一实施例提供的手势识别方法中,所述坐标位置包括深度坐标,所述手势动作对象的动态手势变化包括手势动作,根据所述手指伸出状态变化和所述位置变化,确定所述手势动作对象的动态手势变化, 包括:响应于所述手指伸出状态变化指示所述手势动作对象中的至少一个手指在所述识别时段内的至少部分时段处于伸出状态,以及所述位置变化指示所述手势动作对象中的目标识别点在所述至少部分时段内的深度坐标先减小后保持,且所述保持动作的时长超过第一阈值,确定所述手势动作为长按手势。
例如,在本公开至少一实施例提供的手势识别方法中,所述手势动作对象的动态手势变化包括手势动作,根据所述手指伸出状态变化和所述位置变化,确定所述手势动作对象的动态手势变化,包括:响应于所述手指伸出状态变化指示所述手势动作对象中的至少一个手指在所述识别时段内的至少部分时段处于伸出状态,以及所述位置变化指示所述手势动作对象中的目标识别点在所述至少部分时段内沿预设方向滑动的距离超过第二阈值,确定所述手势动作为滑动手势,其中,所述滑动的距离基于所述多组图像中所述手势动作对象的目标识别点的位置信息计算得到。
例如,在本公开至少一实施例提供的手势识别方法中,所述手势动作对象的动态手势变化包括手势动作,根据所述手指伸出状态变化和所述位置变化,确定所述手势动作对象的动态手势变化,包括:响应于所述手指伸出状态变化指示所述手势动作对象在所述识别时段内从存在至少一个手指处于伸出状态转换为无手指处于伸出状态,确定所述手势动作为抓取手势。
例如,在本公开至少一实施例提供的手势识别方法中,所述手势动作对象的动态手势变化包括手势动作,根据所述手指伸出状态变化和所述位置变化,确定所述手势动作对象的动态手势变化,包括:响应于所述手指伸出状态变化指示所述手势动作对象在所述识别时段内从无手指处于伸出状态转换为存在至少一个手指处于伸出状态,确定所述手势动作为释放手势。
例如,在本公开至少一实施例提供的手势识别方法中,根据所述手指伸出状态变化和所述位置变化,确定所述手势动作对象的动态手势变化,还包括:在确定所述手势动作前,确定所述手势动作对象是否在发生动作变化前存在超过第三阈值的静止时间,响应于存在超过第三阈值的静止时间,继续确定所述手势动作;响应于不存在超过第三阈值的静止时间,确定未发生手势动作变化。
例如,在本公开至少一实施例提供的手势识别方法中,所述手势动作对象的动态手势变化还包括手势位置,根据所述手指伸出状态变化和所述位置 变化,确定所述手势动作对象的动态手势变化,还包括:响应于手势动作为单击手势、长按手势或滑动手势,确定所述手势位置基于所述手势动作对象中的目标识别点的位置信息得到,其中,所述目标识别点包括目标手指的指尖点;响应于手势动作为抓取手势或释放手势,确定所述手势位置基于所述手势动作对象的手势中心的位置信息得到,其中,所述手势中心为所述手势连通域的最大内接圆的圆心。
例如,在本公开至少一实施例提供的手势识别方法中,确定所述手势位置基于所述手势动作对象中的目标识别点的位置信息得到,包括:获取所述目标识别点周围的预设位置处的多个采样点分别对应的多个位置信息;根据所述多个位置信息和所述目标识别点的位置信息,得到所述手势位置;以及确定所述手势位置基于所述手势动作对象的手势中心的位置信息得到,包括:获取所述手势中心周围的预设位置处的多个采样点分别对应的多个位置信息;根据所述多个位置信息和所述手势中心的位置信息,得到所述手势位置。
例如,在本公开至少一实施例提供的手势识别方法中,获取针对手势动作对象的分别在不同拍摄时刻拍摄的多组图像,包括:利用至少一个拍摄装置连续拍摄所述手势动作对象,得到分别对应所述不同拍摄时刻的多组图像,其中,每个拍摄装置配置为在一个拍摄时刻同步输出一对对应的深度图和灰度图。
例如,在本公开至少一实施例提供的手势识别方法中,所述手势动作对象相对于每个图像中的其他对象最靠近所述至少一个拍摄装置。
例如,在本公开至少一实施例提供的手势识别方法中,响应于所述至少一个拍摄装置的数量是多个,每组图像包括多对对应的深度图和灰度图,所述多对深度图和灰度图由所述多个拍摄装置在同一拍摄时刻同步对所述手势动作对象进行拍摄得到,所述多对深度图和灰度图具有不同的拍摄角度。
例如,在本公开至少一实施例提供的手势识别方法中,根据所述多组图像,使用每组图像中的深度图来获取空间信息,使用每组图像中的灰度图来获取手势动作对象的姿态信息,以识别所述手势动作对象的动态手势变化,还包括:基于同一拍摄装置得到的分别属于所述多组图像的、且对应所述不同拍摄时刻的多对深度图和灰度图,确定所述同一拍摄装置对应的所述手势动作对象的中间手势变化;对所述多个拍摄装置分别对应的多个中间手势变 化进行加权和滤波处理,得到所述手势动作对象的动态手势变化。
例如,在本公开至少一实施例提供的手势识别方法中,利用至少一个拍摄装置连续拍摄所述手势动作对象,得到分别对应所述不同拍摄时刻的多组图像,包括:利用每个拍摄装置连续拍摄所述手势动作对象,得到所述拍摄装置输出的分别对应所述不同拍摄时刻的多对深度图和灰度图。
例如,在本公开至少一实施例提供的手势识别方法中,每个拍摄装置包括第一获取单元,所述第一获取单元配置为在每个第一帧获取灰度图,以及每N个第一帧获取深度图,其中,所述深度图基于所述每N个连续的第一帧获取的N个灰度图生成,所述N个灰度图分别对应N个不同的相位,所述一个深度图与所述N个灰度图中的一个灰度图同步输出所述拍摄装置,其中,N为正整数且大于1;利用每个拍摄装置连续拍摄所述手势动作对象,得到所述拍摄装置输出的分别对应所述不同拍摄时刻的多对深度图和灰度图,包括:利用所述拍摄装置在每个第一帧输出一对对应的深度图和灰度图,其中,所述输出的深度图根据所述N个灰度图和所述一个深度图,利用平滑轨迹拟合预测得到。
例如,在本公开至少一实施例提供的手势识别方法中,每个拍摄装置包括第一获取单元,所述第一获取单元配置为在每个第一帧获取灰度图,以及每N个第一帧获取深度图,其中,所述深度图基于所述每N个连续的第一帧获取的N个灰度图生成,所述N个灰度图分别对应N个不同的相位,所述一个深度图与所述N个灰度图中的一个灰度图同步输出所述拍摄装置,其中,N为正整数且大于1;利用每个拍摄装置连续拍摄所述手势动作对象,得到所述拍摄装置输出的分别对应所述不同拍摄时刻的多对深度图和灰度图,包括:利用所述拍摄装置在至多每N-1个第一帧输出一对对应的深度图和灰度图,所述输出的深度图通过与所述输出的灰度图相邻的N-1个第一帧的灰度图计算得到,所述输出的灰度图和所述相邻的N-1个第一帧的灰度图对应所述N个不同的相位。
例如,在本公开至少一实施例提供的手势识别方法中,每个拍摄装置包括第一获取单元和第二获取单元,所述第二获取单元配置为在每个第二帧输出一个灰度图,所述第一获取单元配置为每M个第二帧输出一个深度图,M为正整数且大于1,利用每个拍摄装置连续拍摄所述手势动作对象,得到所述拍摄装置输出的分别对应所述不同拍摄时刻的多对深度图和灰度图,包 括:利用所述拍摄装置在至多每M-1个第二帧输出一对对应的深度图和灰度图,其中,所述输出的深度图包括基准深度图,或者基于所述基准深度图和所述基准深度图对应的至少一个灰度图、利用平滑轨迹拟合预测得到的深度图,其中,所述基准深度图包括在当前第二帧或在所述当前第二帧之前由所述第一获取单元输出的深度图,所述当前第二帧为输出所述一对对应的深度图和灰度图的第二帧,所述至少一个灰度图包括在所述基准深度图对应的第二帧和所述当前第二帧之间由所述第二获取单元输出的灰度图。
例如,在本公开至少一实施例提供的手势识别方法中,所述第一获取单元还配置为,在每个第一帧得到一对对应的深度图和灰度图,所述得到的深度图通过与所述得到的灰度图相邻的N-1个第一帧的灰度图计算得到,所述得到的灰度图和所述相邻的N-1个第一帧的灰度图对应N个不同的相位,所述第一帧的帧长大于所述第二帧的帧长,N为正整数且大于1。
本公开至少一实施例提供一种交互方法,包括:显示控件;利用本公开任一实施例所述的手势识别方法识别用户执行目标动作时的动态手势变化;根据识别的所述动态手势变化和所述目标动作,触发所述控件。
例如,在本公开至少一实施例提供的交互方法中,所述动态手势变化包括手势动作,根据识别的所述动态手势变化和所述目标动作,触发所述控件,包括:响应于所述用户的手势动作与所述目标动作一致,触发所述控件并显示可视化反馈效果。
例如,在本公开至少一实施例提供的交互方法中,所述动态手势变化包括手势动作和手势位置;根据识别的所述动态手势变化和所述目标动作,触发所述控件,包括:响应于所述用户的手势动作与所述目标动作一致,所述用户的手势位置与所述控件的控件位置匹配,触发所述控件并显示可视化反馈效果,其中,所述用户的手势位置与所述控件的控件位置匹配表示,所述手势位置按映射关系映射至控件所在的坐标系中的位置与所述控件位置一致。
本公开至少一实施例提供一种手势交互系统,包括:至少一个拍摄装置,所述至少一个拍摄装置配置为连续拍摄所述手势动作对象,以获取针对所述手势动作对象的分别在不同拍摄时刻拍摄的多组图像;手势识别单元,配置为接收所述多组图像,执行本公开任一实施例所述的手势识别方法,输出所述手势动作对象的动态手势变化的识别结果;显示单元,配置为接收所述识 别结果,根据所述识别结果显示交互效果。
例如,在本公开至少一实施例提供的手势交互系统中,所述手势交互系统包括多个拍摄装置,所述多个拍摄装置配置为从不同角度同步拍摄所述手势动作对象,以在同一拍摄时刻得到对应的多对深度图和灰度图。
例如,在本公开至少一实施例提供的手势交互系统中,每个拍摄装置包括第一获取单元和第二获取单元,所述第一获取单元和所述第二获取单元配置为同步拍摄所述手势动作对象。
例如,在本公开至少一实施例提供的手势交互系统中,所述多个拍摄装置配置为根据所述手势动作对象在手势交互空间中的位置,选择部分或全部拍摄装置拍摄所述手势动作对象。
例如,在本公开至少一实施例提供的手势交互系统中,所述手势识别单元包括数字信号处理器。
本公开至少一实施例提供一种电子设备,包括:存储器,非瞬时性地存储有计算机可执行指令;处理器,配置为运行所述计算机可执行指令,其中,所述计算机可执行指令被所述处理器运行时实现根据本公开任一实施例所述的手势识别方法或者根据本公开任一实施例所述的交互方法。
本公开至少一实施例提供一种非瞬时性计算机可读存储介质,其中,所述非瞬时性计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令被处理器执行时实现根据本公开任一实施例所述的手势识别方法或者根据本公开任一实施例所述的交互方法。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例的附图作简单地介绍,显而易见地,下面描述中的附图仅仅涉及本公开的一些实施例,而非对本公开的限制。
图1示出了一种手势识别的示意性流程图;
图2示出了TOF相机的检测过程的示意图;
图3为本公开至少一实施例提供的一种手势识别方法的示意性流程图;
图4为本公开至少一实施例提供的手势交互空间的示意图;
图5A为本公开至少一实施例提供的凸包检测示意图;
图5B为本公开一实施例提供的姿态信息提取过程示意图;
图6示出了TOF相机的深度图和灰度图的关系;
图7A为本公开一实施例提供的深度图和灰度图的对应关系示意图;
图7B为本公开另一实施例提供的深度图和灰度图的对应关系示意图;
图7C为本公开一实施例提供的深度图和灰度图的对应关系示意图;
图8为本公开至少一实施例提供的交互的方法示意性流程图;
图9为本公开至少一实施例提供的一种手势交互系统的示意性框图;
图10为本公开至少一实施例提供的一种手势识别单元的示意性框图;
图11为本公开至少一实施例提供的一种电子设备的示意图;
图12为本公开至少一实施例提供的一种非瞬时性计算机可读存储介质的示意图;
图13为本公开至少一实施例提供的一种硬件环境的示意图。
具体实施方式
为了使得本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例的附图,对本公开实施例的技术方案进行清楚、完整地描述。显然,所描述的实施例是本公开的一部分实施例,而不是全部的实施例。基于所描述的本公开的实施例,本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。
除非另外定义,本公开使用的技术术语或者科学术语应当为本公开所属领域内具有一般技能的人士所理解的通常意义。本公开中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分不同的组成部分。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同,而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电性的连接,不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系,当被描述对象的绝对位置改变后,则该相对位置关系也可能相应地改变。为了保持本公开实施例的以下说明清楚且简明,本公开省略了部分已知功能和已知部件的详细说明。
手势交互是指利用计算机图形学等技术识别手势动作对象的肢体语言,并转化为命令来操作设备。它是继鼠标、键盘和触屏之后新的人机交互方式。在众多的交互方式中,手势交互符合人类的交流习惯,在日常生活中使用也 最为频繁,具有无法替代的天然优势,例如:
1)手势交互的学习成本较低,不需要像传统交互方式那样记住双击和鼠标左右点击的区别;
2)手势交互可以脱离实体接触,实现远距离控制;
3)交互动作更加丰富和自然,可根据场景需求设计多种不同的手势,实现多种不同的交互效果;
4)对用户正常活动影响较少,可以随时随地进行手势操作。
基于这些优势,手势交互是非常热门的研究领域,可以适用于多种应用场景。
通常,采集手势的传感器件采用单目相机、双目相机、结构光相机、TOF(Time of flight,飞行时间)相机等。
单目相机采集的手势图像不包含深度信息,常采用深度学习的方法提取手势抽象特征,完成手势的分类任务,这种方式对系统算力要求非常高,对图像分辨率的要求也比较高(例如分辨率需要达到1920像素*1080像素),处理速度慢。
双目相机可通过视差信息计算深度,通常也采用深度学习的方法实现手势三维关节点的定位。但双目数据的视差配准本身就需要极大的系统算力,这种方式也对系统算力要求非常高。
结构光相机是采用红外光源向空间中投射图案(pattern),由于红外相机感光,通过图案的变形程度计算场景的深度信息。这种方式需要处理的数据量非常大,且通常需配有专用的处理芯片来计算深度信息,成本较高。
TOF相机是采用光飞时间技术,通过时间差或相位差信息计算场景的深度信息,所需的算力小。但是在其用来做手势识别时,同样需要采用深度学习的方法提取图像中的手势关节点,对算力的要求也较高,处理速度通常在20ms以上。
因此,现阶段的手势识别方法均是采用深度学习的方法实现,图1示出了一种手势识别的示意性流程图。
如图1所示,首先,对图像进行预处理,提取一些手势特征,手势特征包括全局特征和局部特征,手势特征形式包括颜色直方图、肤色信息、边缘信息、区域信息等,由此实现手势的检测与分割。如图1所示,图像预处理需要花费0.5ms,全局特征提取需要花费18ms,局部特征提取需要花费5ms, 手势检测和分割需要花费4.5ms。
之后,提取抽象特征,如LBP(Local Binary Patterns,局部二值模式)、Hu矩(图像矩)、SURF(Speeded-Up Robust Features,加速稳健特征)等,完成手势的分类。例如,若手势中存在关节点,还可以通过建立的手势关节点模型,提取关节点参数,完成3D手势的识别。如图1所示,抽象特征提取需要花费5.3ms,手势分类需要花费4.5ms,关节点参数提取需要花费6.5ms,3D手势识别需要花费3.3ms,加载手势关节点模型需要花费2000ms。
最后,通过设置指定的语义信息,为不同手势赋予不同的功能。如图1所示,语义信息处理需要花费0.2ms。
因此,目前的手势识别方法所需的系统算力高,且延时较大,如图1所示,这种方法识别延时通常要在20ms及以上,难以做到实时交互。
此外,如前所述,现有的手势交互系统中的图像采集通常采用双目相机、结构光相机、TOF相机等实现。在采集单元中,需要将感光元件(sensor)采集的原始信号数据(Raw Data),在计算机中利用影像处理器(Image Signal Processor,简称ISP)算法进行预处理,预处理包括黑电平补偿、镜头校正、坏点校正、噪声去除、相位校正、深度计算、数据校准等。特别地,结构光相机因为需要处理的数据量非常大,因此一般会配有专用的处理芯片来执行ISP算法。因此,一般深度相机分辨率为840像素*480像素,帧率为30fps(每秒传输帧数,frame per second)。
图2示出了TOF相机的检测过程的示意图。
如图2所示,TOF相机中使用的感光元件为硅基图像传感器,TOF相机至少包括光源、接收阵列、电路三部分,电路包括信号发射机、调制单元、解调单元、计算单元等。在拍摄时,首先光源发出一束调制的红外光,照射到目标后发生反射,经过反射后的调制方波经过镜头后最终被接收阵列接收,然后再通过解调单元和计算单元对信息进行解调和计算,计算出距离信息。
由于深度相机的帧率较低,处理信息占用较多资源,这也进一步增加了手势交互的延时,增加了资源消耗。而手势识别方法目前多采用深度学习方法,如前所述,其执行延时较高,系统资源开销也较大。因此,目前的手势交互系统整体上系统资源消耗大,延时较高,精度也较低,这也是现在手势交互技术难以推广的原因之一。
本公开至少一实施例提供一种手势识别方法、交互方法、手势交互系统、电子设备和非瞬时性计算机可读存储介质。该手势识别方法包括:获取针对手势动作对象的分别在不同拍摄时刻拍摄的多组图像,其中,每组图像包括至少一对对应的深度图和灰度图;根据多组图像,使用每组图像中的深度图来获取空间信息,使用每组图像中的灰度图来获取手势动作对象的姿态信息,以识别手势动作对象的动态手势变化。
该手势识别方法利用同步采集的包含手势动作对象的深度图与灰度图,利用深度图提取空间信息,利用灰度图获取姿态信息,通过简单的图像处理,实现手势的识别与定位,由于不使用复杂的深度学习算法,整体上降低了处理时长,能够快速得到手势识别结果,减少系统资源占用,保证手势交互的实时性。例如,采用本公开至少一实施例提供的手势识别方法,能够将手势识别的处理时长由20ms及降低至5ms。
本公开实施例提供的手势交互方法可以应用在移动终端(例如,手机、平板电脑等)中,需要说明的是,本公开实施例提供的手势识别方法可应用于本公开实施例提供的手势交互系统,该手势交互系统可被配置于电子设备上。该电子设备可以是个人计算机、移动终端等,该移动终端可以是手机、平板电脑等具有各种操作系统的硬件设备。
下面结合附图对本公开的实施例进行详细说明,但是本公开并不限于这些具体的实施例。
图3为本公开至少一实施例提供的一种手势识别方法的示意性流程图。
如图3所示,本公开至少一实施例提供的手势识别方法包括步骤S10至步骤S20。
步骤S10,获取针对手势动作对象的分别在不同拍摄时刻拍摄的多组图像。
例如,每组图像包括至少一对对应的深度图和灰度图,这里,“对应的深度图和灰度图”指深度图和灰度图对应同一拍摄时刻。
例如,手势动作对象可以包括人体的手部,例如用户手部。例如,手势动作对象也可以包括其他具有与人体手部相同形态的手部对象,例如具有人体手部形态的物品,如形态为握拳状态手部的充气球等,本公开对此不做具体限制。
例如,步骤S10可以包括:利用至少一个拍摄装置连续拍摄手势动作对 象,得到分别对应所述不同拍摄时刻的多组图像。例如,每个拍摄装置配置为在一个拍摄时刻同步输出一对对应的深度图和灰度图。
例如,手势动作对象在一个或多个拍摄装置的拍摄范围内执行手势动作,该一个或多个拍摄装置在预设识别时段内同步拍摄手势动作对象,每个拍摄装置在一个拍摄时刻同步输出一对对应的深度图和灰度图,该一对对应的深度图和灰度图由该拍摄装置在同一个拍摄时刻拍摄得到。
例如,若拍摄装置的数量是1个,则在一个拍摄时刻得到一组图像,该一组图像包括一对深度图和灰度图,经过预设识别时段的拍摄后,得到多组图像,多组图像分别对应不同的拍摄时刻。
例如,若拍摄装置的数量是T个,则在一个拍摄时刻得到一组图像,该一组图像包括T对深度图和灰度图,该T对深度图和灰度图分别来自该T个拍摄装置,经过预设识别时段的拍摄后,得到多组图像,多组图像分别对应不同的拍摄时刻,并且,每组图像都包括对应同一拍摄时刻的T对深度图和灰度图。这里,T为大于1的正整数。
例如,T个拍摄装置需要同步拍摄手势动作对象。例如,T个拍摄装置同时接收触发指令,在接收到触发指令时同步拍摄手势动作对象,以得到对应同一个拍摄时刻的一组图像,该一组图像包括T对深度图和灰度图。
例如,T个拍摄装置设置有不同的拍摄角度和拍摄位置,以从不同角度拍摄手势动作对象,从而在同一拍摄时刻得到的多对深度图和灰度图具有不同的拍摄角度,这在一定程度上保证了手势的防遮挡,减少或避免由于手势被遮挡导致检测不到手势动作对象的动作以至于识别失败。
例如,在拍摄手势动作对象时,手势动作对象相对于图像中的其他对象最靠近该至少一个拍摄装置。
例如,从交互场景来看,可以将拍摄装置集成于显示单元下边缘的中央位置,相机光轴向前上方倾斜拍摄手势动作对象,所以可以认为手势动作对象是距离拍摄装置最近的物体。
图4为本公开至少一实施例提供的手势交互空间的示意图。
如图4所示,显示单元的宽度为70厘米,高度为40厘米,显示单元用于显示与手势动作对象进行交互的显示内容,例如显示内容包括通用的、易理解的控件,控件会预先设置好触发后的行为或功能,当控件触发时,会显示可视化反馈效果,如材质、颜色、亮度等发生变化,以此提醒用户交互完 成。
例如,控件可以包括三维虚拟控件,控件可以以按钮、图标、3D模型等不同方式呈现。例如,显示单元可以实现为三维显示器,例如,该三维显示器可以是裸眼三维显示器,也即用户不借助其他工具,通过双眼即可看到三维显示效果。
如图4所示,假设人体臂长为65厘米,用户在屏幕前方65cm处观看,在手势交互时需要手势动作对象(也即用户手部)距离眼睛30cm以外才不会遮挡视线,因此,手势交互空间为在显示单元和用户之间,尺寸为70cm*40cm*35cm的三维空间。
例如,手势动作对象在手势交互空间中运动,手势动作对象相对于其他对象(如用户其他身体部位)最靠近拍摄装置,拍摄装置连续拍摄手势动作对象以得到多组图像,以基于多组图像进行动态手势变化的识别。
需要说明的是,图4所示的是一种可能的手势交互空间定义方式,根据显示单元的类型、显示单元的形状、显示单元的尺寸、用户臂长不同,手势交互空间的尺寸也可能不同,本公开对此不做具体限制。
例如,在一些实施例中,拍摄装置包括第一获取单元,例如,第一获取单元包括TOF相机。为了获得较高的深度精度,一般TOF相机会采用对应N个不同相位的N个灰度图来计算获得一个深度图(例如N=1,4或8等)。
例如,第一获取单元配置为在每个第一帧获取灰度图,以及每N个第一帧获取深度图。例如,深度图基于每N个连续的第一帧获取的N个灰度图生成,所述N个灰度图分别对应N个不同的相位,该一个深度图与该N个灰度图中的一个灰度图同步输出拍摄装置。这里,N为正整数且大于1。
这里,第一帧即表示一个差分相关采样(Differential Correlation Sample,简称DCS)相位的时间,例如25ms(毫秒)。每N个第一帧,TOF相机同步输出一对灰度图和深度图,也即深度图和灰度图每N个第一帧实现一次对齐和校准。因此,此时,拍摄装置每N个第一帧输出一对灰度图和深度图,例如,在N=4时,第一帧的帧长为25ms时,每间隔100ms输出一对灰度图和深度图。
虽然第一获取单元所需的算力较小,但是延时高,帧率低,因此可以设置拍摄装置还包括第二获取单元,第二获取单元例如为高速灰度相机。
例如,第二获取单元配置为在每个第二帧输出一个灰度图,第一获取单 元配置为每M个第二帧输出一个深度图,M为正整数且大于1。
例如,第二帧的帧长小于第一帧的帧长,例如,高速灰度相机每间隔8.3ms输出一个灰度图,也即第二帧的帧长为8.3ms。
例如,高速灰度相机和TOF相机每M个第二帧实现一次对齐和校准,此时,结合预测算法等,拍摄装置输出一对灰度图和深度图的时间能够大幅降低,例如拍摄装置最短可以每个第二帧输出一对深度图和灰度图,也即此时拍摄装置能够每间隔8.3ms输出一对深度图和灰度图,相对于每100ms输出一对深度图和灰度图,图像采集的延时大幅降低,帧率大幅提高,满足低延时的实时交互需求,整体实现手势动作和手势位置的快速识别,处理过程的绝对延时(包括图像采集时间和图像处理时间)能够达到小于等于15ms,报点率达到120Hz。
关于拍摄装置提高输出灰度图和深度图的帧率的具体处理过程可以参考后文所述的内容。
当然,需要说明的是,在手势识别过程中,本公开对拍摄装置不做具体限制,并不限制为本公开实施例中所述的结构,只要能够实现同步输出对应一个拍摄时刻的一对深度图和灰度图,以及能够输出对应多个拍摄时刻的多对深度图和灰度图即可。
例如,在步骤S20,根据多组图像,使用每组图像中的深度图来获取空间信息,使用每组图像中的灰度图来获取手势动作对象的姿态信息,以识别手势动作对象的动态手势变化。
例如,空间信息包括深度图中的手势区域、手势动作对象的位置信息等。
例如,动态手势变化的识别以拍摄装置为单位进行,也就是说,若提供有一个拍摄装置,也即每组图像包括一对深度图和灰度图,则基于多组图像识别动态手势变化;若提供有多个拍摄装置,也即每组图像包括多对深度图和灰度图,基于同一拍摄装置得到的分别属于多组图像的、且对应不同拍摄时刻的多对深度图和灰度图,确定该拍摄装置对应的手势动作变化,最终基于多个拍摄装置分别对应的手势动作变化,得到最终的手势动作变化识别结果。
因此,基于每个拍摄装置得到的图像进行处理的过程相同,下面以提供有一个拍摄装置为例,例如此时一组图像包括一对深度图和灰度图,具体说明获得该拍摄装置对应的手势动作变化的识别过程。
例如,不论提供有一个还是多个拍摄装置,也即每组图像包括一对或多对深度图和灰度图,每对深度图和灰度图都进行相同的处理,得到每对深度图和灰度图对应的手势动作对象的姿态信息。
例如,由于手势动作对象是距离拍摄装置最近的物体,因此可以利用深度图提取空间信息,例如提取距离拍摄装置最近的区域作为深度图中的手势区域,并将该深度图中的手势区域作用于对应的灰度图,以得到灰度图中的手势分析区域,进而根据手势分析区域得到手势动作对象的姿态信息。
下面具体说明基于一对深度图和灰度图,得到对应的手势动作对象的姿态信息的过程。
例如,在步骤S20中,使用每组图像中的深度图来获取空间信息,可以包括:根据深度图确定深度图中的手势区域,例如,空间信息包括深度图中的手势区域。
例如,根据深度图确定深度图中的手势区域,可以包括:遍历深度图,统计深度图中的深度数据以建立深度直方图;选取深度图对应的自适应深度阈值,根据自适应深度阈值和深度直方图,确定深度图像中的手势区域。
在深度图中,横坐标和纵坐标对应像素位置,每个位置的像素灰度值对应的是该像素对应的物体距离拍摄装置的距离,所以深度图中的每个像素可以表示手势交互空间中一个点的三维坐标。
例如,统计深度图中的深度数据以建立深度直方图,深度直方图能体现每个深度值在图像中的占有率。根据深度图中不同区域的深度分布,计算局部阈值,所以对于深度图中不同区域,能够自适应计算不同的阈值。例如,计算得到深度图中两个自适应阈值,确定两个自适应阈值范围内的所有像素点作为深度图中的手势区域。
例如,在步骤S20中,使用每组图像中的灰度图来获取手势动作对象的姿态信息,可以包括:根据深度图中的手势区域和灰度图,确定每组图像对应的针对手势动作对象的姿态信息。
例如,每组图像对应的针对手势动作对象的姿态信息包括手指状态信息和位置信息。
例如,根据深度图中的手势区域和灰度图,确定每组图像对应的针对手势动作对象的姿态信息,可以包括:将深度图中的手势区域作用于灰度图,得到灰度图中的手势分析区域;对手势分析区域进行二值化处理,得到手势 连通域;对手势连通域进行凸包检测,得到手指状态信息;基于深度图,确定位置信息。
例如,由于深度图和灰度图对应于同一拍摄时刻,因此,将深度图中的手势区域作用于灰度图,包括根据深度图中的手势区域,在灰度图中选择相同位置的区域作为手势分析区域。
例如,拟合手势连通域的最大内接圆,定位最大内接圆的圆心为手势中心,沿该中心向外扩展,对手势连通域进行凸包检测,以得到手指状态信息。例如,凸包检测可以采用任何可行的凸包检测方法实现,本公开对此不做具体限制。
例如,手指状态信息包括手势动作对象中是否存在手指伸出,以及伸出手指的数量。
图5A为本公开至少一实施例提供的凸包检测示意图。例如,如图5A中左侧的(1)所示,若手势动作对象处于手指全展开状态,则凸包检测得到的手指状态信息包括有手指处于伸出状态,伸出手指的数量为5。例如,如图5A中右侧的(2)所示,若手势动作对象处于握拳状态,则凸包检测得到的手指状态信息包括无手指处于伸出状态,伸出手指的数量为0。
例如,位置信息包括手势动作对象在手势交互空间中的坐标位置。如前所述,由于深度图和灰度图是对应同一拍摄时刻,而深度图中每个像素可以表示手势交互空间中一个点的三维坐标,该三维坐标可以包括每个像素在手势交互空间中的横坐标、纵坐标、深度坐标,深度坐标表示该像素对应的对象距离拍摄装置的距离。由于通常拍摄装置设置在显示单元所在平面上,因此深度坐标也表示像素对应的对象距离显示单元的距离。
因此,基于深度图即可得到手势动作对象在手势交互空间中的三维坐标位置。这里,手势动作对象在手势交互空间中的坐标位置,包括手势分析区域中各个像素点的坐标位置。
图5B为本公开一实施例提供的姿态信息提取过程示意图。
如图5B所示,针对对应同一拍摄时刻的一对深度图和灰度图,遍历深度图,统计深度图中的深度数据以建立深度直方图,深度直方图如图5B中所示。
选取深度直方图中两个自适应阈值(如图5B中深度直方图中两条竖线所示),确定两个自适应阈值范围内的所有像素点作为深度图中的手势区域, 将深度图中的手势区域作用于灰度图,得到如图5B中所示的位于灰度图中的手势分析区域。
对手势分析区域进行二值化处理,得到手势连通域,手势连通域如图5B中标记为“手势连通域”的图中灰色区域所示。
对手势连通域进行凸包检测,得到手指状态信息,例如,手指状态信息包括有且仅有1根手指处于伸出状态。
例如,基于深度图,还可以确定手势动作对象的位置信息,具体过程如前所述,这里不再赘述。
在得到一对深度图和灰度图对应的姿态信息后,可以基于该拍摄装置连续拍摄手势动作对象所得到的分别对应不同拍摄时刻的多组图像,确定手势动作对象的动态手势变化。
例如,在步骤S20中,识别手势动作对象的动态手势变化,可以包括:根据多组图像分别对应的针对手势动作对象的姿态信息,确定手势动作对象的动态手势变化。
例如,根据多组图像分别对应的针对手势动作对象的姿态信息,确定手势动作对象的动态手势变化,可以包括:根据多组图像分别对应的手指状态信息和位置信息,确定手势动作对象在不同拍摄时刻组成的识别时段内的手指伸出状态变化和位置变化;根据手指伸出状态变化和位置变化,确定手势动作对象的动态手势变化。
在进行手势交互时,通常的交互场景包括单击图标、长按图标、滑动切换场景、抓取模型以使模型移动、旋转、缩放等。由此,可以提取人类常使用的5种自然的动态手势,即单击手势、长按手势、滑动手势、抓取手势和释放手势。
例如,根据多组图像分别对应的手指状态信息和位置信息,可以得到在不同拍摄时刻组成的识别时段中(也即拍摄动态手势对象的时段),手势动作对象的手指伸出状态变化和位置变化。
例如,手势动作对象的动态手势变化包括手势动作,手势动作至少包括单击手势、长按手势、滑动手势、抓取手势和释放手势中的任意一个。
例如,根据手指伸出状态变化和位置变化,确定手势动作对象的动态手势变化,包括:响应于手指伸出状态变化指示手势动作对象中的至少一个手指在识别时段内的至少部分时段处于伸出状态,位置变化指示手势动作对象 中的目标识别点在至少部分时段内的深度坐标先减小后增大,确定手势动作为单击手势。
例如,目标识别点可以是目标手指的指尖,例如,当有一个手指在识别时段内的至少部分时段处于伸出状态,则目标手指为该手指;当有多个手指在识别时段内的至少部分时段处于伸出状态,则目标手指优先选择食指或者中指。
例如,在识别时段内,识别到一个或多个手指处于伸出状态,目标手指的指尖的深度坐标先减小后增大,确定识别到手势动作对象执行了单击手势。
例如,根据手指伸出状态变化和位置变化,确定手势动作对象的动态手势变化,可以包括:响应于手指伸出状态变化指示手势动作对象中的至少一个手指在识别时段内的至少部分时段处于伸出状态,以及位置变化指示手势动作对象中的目标识别点在至少部分时段内的深度坐标先减小后保持,且保持动作的时长超过第一阈值,确定手势动作为长按手势。
同样的,目标识别点可以是目标手指的指尖,例如,当有一个手指在识别时段内的至少部分时段处于伸出状态,则目标手指为该手指;当有多个手指在识别时段内的至少部分时段处于伸出状态,则目标手指优先选择食指或者中指。
例如,在识别时段内,识别到一个或多个手指处于伸出状态,目标手指的指尖的深度坐标先减小后保持,且保持的时长超过第一阈值,确定识别到手势动作对象执行了长按手势。
例如,根据手指伸出状态变化和位置变化,确定手势动作对象的动态手势变化,包括:响应于手指伸出状态变化指示手势动作对象中的至少一个手指在识别时段内的至少部分时段处于伸出状态,以及位置变化指示手势动作对象中的目标识别点在至少部分时段内沿预设方向滑动的距离超过第二阈值,确定手势动作为滑动手势,其中,滑动的距离通过多组图像中手势动作对象的目标识别点的位置信息计算得到。
同样的,目标识别点可以是目标手指的指尖,例如,当有一个手指在识别时段内的至少部分时段处于伸出状态,则目标手指为该手指;当有多个手指在识别时段内的至少部分时段处于伸出状态,则目标手指优先选择食指或者中指。
例如,预设方向可以是显示单元中显示的提示信息指示的方向,例如预设方向可以是水平方向、竖直方向或者与水平方向呈一定夹角的方向,本公开对此不做具体限制。
例如,第二阈值可以是在手势交互空间中的距离值。
例如,显示单元中的控件所在的坐标系与手势交互空间的坐标系具有预设映射关系,第二阈值可以是控件所在的坐标系中的距离值,此时根据多组图像中手势动作对象的目标识别点的位置信息,按映射关系进行映射,以得到手势动作对象滑动的距离,从而根据该滑动的距离与第二阈值进行比较。
例如,在识别时段内,识别到一个或多个手指处于伸出状态,目标手指的指尖沿预设方向滑动的距离(例如沿水平方向左右滑动的距离,或沿竖直方向上下滑动的距离)超过第二阈值,确定识别到手势动作对象执行了滑动手势。
例如,根据手指伸出状态变化和位置变化,确定手势动作对象的动态手势变化,包括:响应于手指伸出状态变化指示手势动作对象在识别时段内从存在至少一个手指处于伸出状态转换为无手指处于伸出状态,确定手势动作为抓取手势。
例如,根据手指伸出状态变化和所述位置变化,确定手势动作对象的动态手势变化,包括:响应于手指伸出状态变化指示手势动作对象在识别时段内从无手指处于伸出状态转换为存在至少一个手指处于伸出状态,确定手势动作为释放手势。
为降低手势识别的误操作可能性,在动态手势识别时,可以检测手势动作对象是否在执行手势动作前存在悬停操作。
例如,根据手指伸出状态变化和位置变化,确定手势动作对象的动态手势变化,还可以包括:在确定手势动作前,确定手势动作对象是否在发生动作变化前存在超过第三阈值的静止时间,响应于存在超过第三阈值的静止时间,继续确定手势动作;响应于不存在超过第三阈值的静止时间,确定未发生手势动作变化。
也就是说,若检测到手势动作对象在执行单击手势、长按手势等之前,存在超过第三阈值的静止时间,则继续按照上述方式确定手势动作对象具体执行的手势动作;若检测到手势动作对象在执行单击手势、长按手势等之前,不存在超过第三阈值的静止时间,则确定手势动作对象没有发生手势动作变 化,不再继续确定具体的手势动作,以降低误识别的可能性。
需要说明的是,在本公开至少一实施例提供的手势识别方法中,确定手势动作对象的动态手势变化时,只需要确定手指是否伸出即可,而不需要确定具体哪些手指处于伸出状态,这减少了凸包检测的工作量,也不需要复杂的深度学习算法支撑,提升了检测速度,降低了对系统算力的需求。
在本公开至少一实施例提供的手势识别方法中,设计手势动作的简化识别算法,即建立深度直方图以提取灰度图中的手势区域,分析手势连通域并进行凸包检测,再结合手指状态变化和位置变化情况识别手势动作,不使用复杂的深度学习算法,整体上实现手势动作的快速识别,手势动作的识别用时小于等于5ms(毫秒)。
例如,控件本身也包含位置信息,当手势交互动作符合预期的目标动作,手势交互的位置也与控件位置也重合时,触发此控件。因此,在本公开至少一实施例提供的手势识别方法中,手势动作对象的动态手势变化还包括手势位置,可以同步识别手势动作和手势位置。
例如,根据手指伸出状态变化和位置变化,确定手势动作对象的动态手势变化,还包括:响应于手势动作为单击手势、长按手势或滑动手势,确定手势位置基于手势动作对象中的目标识别点的位置信息得到,其中,目标识别点包括目标手指的指尖点;响应于手势动作为抓取手势或释放手势,确定手势位置基于所述手势动作对象的手势中心的位置信息得到,其中,所述手势中心为手势连通域的最大内接圆的圆心。
例如,当识别到的手势动作为单击手势、长按手势或滑动手势时,确定手势位置为目标识别点的位置;例如,当识别到的手势动作为抓取手势或释放手势时,确定手势位置为手势中心的位置。
例如,为实现高精度的位置测量,例如深度测量,还可以统计手势位置附近的多个预设采样点的位置,根据面精度定位最终的手势位置,提升手势位置的准确性和精度。
例如,确定手势位置基于手势动作对象中的目标识别点的位置信息得到,可以包括:获取目标识别点周围的预设位置处的多个采样点分别对应的多个位置信息;根据多个位置信息和目标识别点的位置信息,得到手势位置。
例如,确定手势位置基于手势动作对象中的目标识别点的位置信息得到,可以包括:获取手势中心周围的预设位置处的多个采样点分别对应的多 个位置信息;根据多个位置信息和手势中心的位置信息,得到手势位置。
例如,选择手势中心或者目标识别点周围的多个采样点,将这些采样点的位置信息以及手势中心或者目标识别点的位置信息进行加权计算,得到加权结果,将加权结果作为最终的手势位置。
当然,由于位置信息包括三维位置,也可以仅选择其中的深度坐标进行计算,减少计算量的同时提高深度位置测量的准确性。
例如,在一些实施例中,在提供有多个拍摄装置时,步骤S20可以包括:基于同一拍摄装置得到的分别属于多组图像的、且对应不同拍摄时刻的多对深度图和灰度图,确定同一拍摄装置对应的手势动作对象的中间手势变化;对多个拍摄装置分别对应的多个中间手势变化进行加权和滤波处理,得到手势动作对象的动态手势变化。
例如,此时使用多个拍摄装置同步定位手势动作和手势位置,每个拍摄装置利用该拍摄装置得到的对应不同拍摄时刻的多对深度图和灰度图,得到拍摄装置对应的手势动作对象的手势变化,具体识别过程如前所述,这里不再赘述。之后,结合多个拍摄装置的识别结果进行加权修正、带通滤波修正等处理,得到最终的识别结果。
例如,在另一些实施例中,可以使用多个拍摄装置的结果定位手势位置,选择其中一个拍摄装置识别的手势动作作为手势动作的识别结果,以提高手势位置定位的精度,降低计算量。
由于多个拍摄装置具有不同的拍摄角度和/或拍摄位置,这种方式可以在一定程度上保证了手势的防遮挡,且提升了最终识别结果的准确性,提升了动态手势识别的鲁棒性,实现高精度测量。
例如,在另一些实施例中,也可以在姿态信息获取阶段,综合多个拍摄装置的手势连通域(例如加权修正或带通滤波等),得到一个最终的手势连通域,利用这个手势连通域进行后续的动态手势识别,识别过程与上述过程相同,这里不再赘述。这种方式同样也可以提升手势识别结果的准确性和鲁棒性。
由此,通过设置多个拍摄装置,整体上实现手势位置的准确定位和手势动作的准确识别,经过实际测量,手势位置的定位误差小于2mm,手势动作的识别准确率大于95%。
通过上述方法,能够同步实现手势的位置检测以及动作识别。经过实际 测量处理用时在5ms以内,仅为目前常用的结合深度学习方法进行手势识别所需时间(20ms及以上)的四分之一,对系统算力需求小;并且,本公开至少一实施例提供的手势识别方法使用算力要求较小的图像处理方法即可实现,有利于算法硬件化;此外,定义的多种动态手势更符合人的自然交互需求,在降低计算量的同时能够保证手势交互的实时性,提升用户体验。
如前所述,目前现有的深度相机(例如双目相机、结构光相机和TOF相机)虽然都可实现深度信息的获取,但其处理成本高、帧率低、延时大,在实时交互中会影响用户的使用体验。
因此,本公开至少一实施例提供的手势识别方法中,为达到高精度、低延时、大视角的实时互动需求,基于深度相机或深度相机和高速灰度相机相结合的结构排布方案,并加以自定义的并行处理流程,通过快慢速结合,轨迹预测等方式实现动态手势变化的快速检测。
例如,利用至少一个拍摄装置连续拍摄手势动作对象,得到分别对应不同拍摄时刻的多组图像,包括:利用每个拍摄装置连续拍摄手势动作对象,得到拍摄装置输出的分别对应不同拍摄时刻的多对深度图和灰度图。
例如,每个拍摄装置包括第一获取单元,例如第一获取单元为TOF相机。
例如,图6示出了TOF相机的深度图和灰度图的关系。参考如前所述的内容,每个第一帧表示一个DCS相位的时间,也即图6中的DCS0、DCS1、DCS2、DCS3等,DCS0、DCS1、DCS2、DCS3分别表示不同的相位,例如,DCS0对应相位0°,DCS1对应相位90°,DCS2对应相位180°,DCS3对应相位270°。这里,“相位”表示第一获取单元发送信号和接收信号之间的相位差,例如,DCS0表示所获取的灰度图像对应于第一获取单元发送信号和接收信号之间的相位差为0°。
例如,在图6中,TOF相机会采用对应4个不同相位的4个灰度图来计算获得一个深度图,也即此时N=4。当然,需要说明的是,图6仅给出了一种示意,在其他实施例中,N还可以取其他值,例如2或8等。
如图6所示,在包括4个DCS,也即4个第一帧的frame1中,分别在每个DCS获取对应4个不同相位的4个灰度图,表示为灰度图Gray0、灰度图Gray1、灰度图Gray2和灰度图Gray3,基于灰度图Gray0、灰度图Gray1、灰度图Gray2和灰度图Gray3计算得到一个深度图Dep0,深度图Dep0和灰 度图Gray3可以同步输出第一获取单元。
如图6所示,在包括4个DCS,也即4个第一帧的frame2中,分别在每个DCS获取对应4个不同相位的4个灰度图,表示为灰度图Gray0、灰度图Gray1、灰度图Gray2和灰度图Gray3,基于灰度图Gray4、灰度图Gray5、灰度图Gray6和灰度图Gray7计算得到一个深度图Dep1,深度图Dep1和灰度图Gray7可以同步输出第一获取单元。
关于深度图Dep2的计算过程与前述过程相同,这里不再赘述。
如前所述,当第一帧的帧长为25ms时,则第一获取单元每间隔100ms输出一对深度图和灰度图,灰度图和深度图具有相同的输出帧率和延时,但此时是无法实现10ms或更低的延时要求。因此,为提高拍摄装置输出灰度图和深度图的帧率,本公开一些实施例提供了如下的多种实施方案。
例如,在一些实施例中,利用每个拍摄装置连续拍摄手势动作对象,得到拍摄装置输出的分别对应不同拍摄时刻的多对深度图和灰度图,可以包括:利用拍摄装置在每个第一帧输出一对对应的深度图和灰度图,其中,输出的深度图根据N个灰度图和一个深度图,利用平滑轨迹拟合预测得到。
也就是说,此时拍摄装置能够在每个第一帧输出一对对应的深度图和灰度图,其中,输出的深度图是根据N个灰度图和该一个深度图,利用平滑轨迹拟合预测得到,也就是该1个深度图是基于该N个灰度图计算得到。
图7A为本公开一实施例提供的深度图和灰度图的对应关系示意图。
结合图6,每四个对应不同相位的灰度图对应一个深度图,即第一获取单元每4个第一帧与深度图进行一次对齐,实现深度信息和灰度信息的位置校准。
例如,在frame1的DCS0、DCS1、DCS2和DCS3中,可应用任意可行的预测算法,通过灰度图Gray0、灰度图Gray1、灰度图Gray2和灰度图Gray3对深度图Dep0进行平滑轨迹拟合预测,得到灰度图Gray0对应的深度图Dep0_1、灰度图Gray0对应的深度图Dep0_2、灰度图Gray0对应的深度图Dep0_3,以提升深度信息的帧率。
例如,在frame2的DCS0、DCS1、DC2中,可应用任意可行的预测算法,通过灰度图Gray4、灰度图Gray5、灰度图Gray6和灰度图Gray7对深度图Dep1进行平滑轨迹拟合预测,得到灰度图Gray4对应的深度图Dep1_1、灰度图Gray5对应的深度图Dep1_2、灰度图Gray6对应的深度图Dep1_3, 以提升深度信息的帧率。
关于frame3中的处理过程与前述过程相同,这里不再赘述。
例如,预测算法可以包括内插算法、卡尔曼滤波算法等,本公开对此不做具体限制。
由此,第一获取单元能够在每个第一帧,输出一对对应的灰度图和深度图。
需要说明的是,也可以计算部分目前不具有对应的深度图的灰度图所对应的深度图,例如计算灰度图Gray1对应的深度图、灰度图Gray5对应的深度图,以间隔两个第一帧输出一对对应的深度图和灰度图,本公开不限制于在每个第一帧输出一对对应的深度图和灰度图,只要能够提高深度图的帧率即可。
在这些实施例中,拍摄装置输出深度图和灰度图的帧率可以达到第一获取单元中获取灰度图的帧率,延时可以达到一个第一帧的帧长,并且不需要额外增加灰度相机,可节约硬件成本。
上述实施例中,预测得到的深度图的精度受限于预测算法的精度,在N较大时,可能会降低深度图的预测精度。因此,本公开至少一实施例提供的手势识别方法还提供另一种深度图和灰度图的获取方式,参考图6,利用每4个第一帧的灰度图生成一幅深度图,由于4个相位是周期变化的,例如为0°,90°,180°,270°,0°,90°,…,因此可以选择4个不同相位的灰度图一起运算,获得新的深度图。
例如,在另一些实施例中,利用每个拍摄装置连续拍摄手势动作对象,得到拍摄装置输出的分别对应不同拍摄时刻的多对深度图和灰度图,可以包括:利用拍摄装置在至多每N-1个第一帧输出一对对应的深度图和灰度图,输出的深度图通过与输出的灰度图相邻的N-1个第一帧的灰度图计算得到,输出的灰度图和相邻的N-1个第一帧的灰度图对应N个不同的相位。
也就是说,此时拍摄装置能够在至多每N-1个第一帧输输出一对对应的深度图和灰度图,输出的深度图通过与输出的灰度图相邻的N-1个第一帧的灰度图计算得到,输出的灰度图和相邻的N-1个第一帧的灰度图对应N个不同的相位。
图7B示出了本公开另一实施例提供的深度图和灰度图的对应关系示意图。
如图7B所示,深度图Dep0是基于对应4个不同相位的灰度图Gray0、灰度图Gray1、灰度图Gray2和灰度图Gray3计算得到;深度图Dep1_1是基于对应4个不同相位的灰度图Gray1、灰度图Gray2、灰度图Gray3和灰度图Gray4计算得到,深度图Dep1_2是基于对应4个不同相位的灰度图Gray2、灰度图Gray3、灰度图Gray4和灰度图Gray5计算得到,…,以此类推。
也就是说,在图7B中,每个第一帧输出一对深度图和灰度图。例如,以在一个第一帧输出深度图Dep1_1和灰度图Gray4为例,深度图Dep1_1是基于与灰度图Gray4相邻的3个灰度图,且该3个灰度图是在灰度图Gray4之前得到,该3个灰度图为灰度图Gray1、灰度图Gray2、灰度图Gray3,灰度图Gray1、灰度图Gray2、灰度图Gray3和灰度图Gray4分别对应4个不同的相位。由此,利用对应4个不同相位的灰度图计算得到一个深度图。
当然,图7B所示为一种示意,当N取不同值时,可以以类似的方式计算得到灰度图对应的深度图。
需要说明的是,也可以计算部分灰度图对应的深度图,例如计算Gray5对应的深度图、Gray7等对应的深度图,以间隔两个第一帧输出一对对应的深度图和灰度图。因此,本公开不限制于在每个第一帧输出一对对应的深度图和灰度图,只要能够提高深度图的帧率即可,也即实现至多每N-1个第一帧输输出一对对应的深度图和灰度图。
在上述实施例中,拍摄装置输出深度图和灰度图的帧率可以达到第一获取单元中灰度图的获取帧率,延时可以达到一个第一帧的帧长,不需要额外增加灰度相机,可节约硬件成本。并且,这种方式得到的深度图的精度不受预测算法的影响,即使N较大也能保持较好的深度图的图像精度。
在上述方式中,能够实现不增加灰度相机的前提下提高拍摄装置的帧率,减少手势识别的处理延时。但是,这种方式中灰度图和深度图需要共享相同传输带宽,因此如果若帧率比较高,可能使得传输延时较高,并且,帧率受限于第一获取单元中的灰度图的获取帧率。
因此,在另一些实施例中,拍摄装置还可以包括第二获取单元,例如,第二获取单元包括高速灰度相机,通过高速灰度相机的高帧率实现超高速的深度信息预测和校准。
例如,利用每个拍摄装置连续拍摄手势动作对象,得到拍摄装置输出的 分别对应不同拍摄时刻的多对深度图和灰度图,可以包括:利用拍摄装置在至多每M-1个第二帧输出一对对应的深度图和灰度图,其中,输出的深度图包括基准深度图,或者基于基准深度图和基准深度图对应的至少一个灰度图、利用平滑轨迹拟合预测得到的深度图,其中,基准深度图包括在当前第二帧或在当前第二帧之前由第一获取单元输出的深度图,当前第二帧为输出一对对应的深度图和灰度图的第二帧,至少一个灰度图包括在基准深度图对应的第二帧和当前第二帧之间由第二获取单元输出的灰度图。
也就是说,此时拍摄装置能够在至多每M-1个第二帧输出一对对应的深度图和灰度图。并且,输出的深度图可以是基准深度图,或者是基于基准深度图和基准深度图对应的至少一个灰度图,利用平滑轨迹拟合预测得到的深度图。这里,基准深度图包括在当前第二帧或在当前第二帧之前由第一获取单元输出的深度图,当前第二帧为输出一对对应的深度图和灰度图的第二帧,至少一个灰度图包括在基准深度图对应的第二帧和当前第二帧之间由第二获取单元输出的灰度图。
当然,根据预测算法的不同,也可以使用更多的灰度图结合基准深度图进行预测,例如,在当前第二帧没有对应的深度图时,可以利用当前第二帧之前的M个第二帧输出的灰度图,结合基准深度图,利用预测算法或插值算法进行处理,得到当前第二帧对应的深度图。
例如,由于第二获取单元输出灰度图的帧率是第一获取单元输出深度图的帧率的M倍,因此第二获取单元每M个第二帧就能跟第一获取单元输出的深度图进行一次对齐,实现深度和灰度信息位置的校准。在其他第二帧,利用预测算法,通过第二获取单元得到的高帧率灰度图对深度图进行平滑轨迹拟合预测,也就是说在其他M-1个第二帧中,即使没有深度图也能通过第二获取单元输出的灰度图实现对深度图的预测,进而实现高速图像深度信息的获取和校准。
需要说明的是,可以在每个第二帧输出一对对应的灰度图和深度图,也可以在每两个第二帧输出一对对应的灰度图和深度图,本公开对此不做具体限制,只要能够提高深度图的输出帧率即可,也即实现至多每M-1个第一帧输输出一对对应的深度图和灰度图。
下面结合附图具体说明此时拍摄装置的具体处理过程。
图7C为本公开再一实施例提供的深度图和灰度图的对应关系示意图。
例如,在图7C中,DCS0、DCS1、DCS2、DCS3分别表示对应不同相位的第一帧,f0_M、f1_1、f1_2、...、f1_M、f2_1、...、f2_M、f3_1、f3_2...分别表示第二帧。
例如,如图7C所示,第一获取单元配置为每M个第二帧输出一个深度图,M为正整数且大于1。例如,第一获取单元在第二帧f1_1输出深度图Dep_i,在第二帧f2_1输出深度图Dep_i+1,在第二帧f3_1输出深度图Dep_i+2。关于深度图Dep_i、深度图Dep_i+1、深度图Dep_i+2的计算过程可以参考前述实施例,这里不再赘述。
如图7C所示,第二获取单元配置为在每个第二帧输出一个灰度图,例如,在第二帧f1_1输出灰度图1_1,在第二帧f1_2输出灰度图1_2,在第二帧f1_3输出灰度图1_3,以此类推。
第二帧的帧长比第一帧的帧长短,例如,第一帧的帧长为25ms,第二帧的帧长为8.3ms。
下面以第二帧f1_1至第二帧f1_M输出深度图的过程为例,说明深度图的获取过程。
需要说明的是,对于第二帧f1_1至第二帧f1_M输出的深度图,基准深度图为由第一获取单元输出的深度图Dep_i,基准深度图Dep_i是利用4个第一帧得到的分别对应4个不同相位的灰度图计算得到,深度图Dep_i和第二获取单元输出的灰度图1_1对齐,在第二帧1_1同步输出拍摄装置。
在当前第二帧为第二帧f1_2时,利用基准深度图Dep_i和灰度图1_1、灰度图1_2,结合预测算法预测第二帧f1_2的深度图D2,将深度图D2和灰度图1_2同步输出拍摄装置。
在当前第二帧为第二帧f1_3时,利用基准深度图Dep_i和灰度图1_1、灰度图1_2、灰度图1_3,结合预测算法预测第二帧f1_3的深度图D3,将深度图D3和灰度图1_3同步输出拍摄装置。
以此类推,后续过程不再赘述。
由于高速灰度相机的帧率最高可达几百赫兹,因此在上述实施例中,利用高速灰度相机和深度相机的结合,能够实现毫秒级别的低时延,大幅降低获取图像所需的延时,提高图像获取帧率,快速实现手势动作及手势位置的判定,提高3D空间中交互体验的速度。
在该实施例中,预测得到的深度图的精度受限于预测算法的精度,在M 较大时,可能会降低深度图的预测精度。因此,可以先采用如上所述的利用不同相位的灰度图,得到每个第一帧对应的深度图,之后,再利用每个第一帧得到的深度图进行预测,降低预测时图像之间的间隔,提高预测的深度图的准确率。
例如,第一获取单元还配置为,在每个第一帧得到一对对应的深度图和灰度图,得到的深度图通过与得到的灰度图相邻的N-1个第一帧的灰度图计算得到,得到的灰度图和相邻的N-1个第一帧的灰度图对应N个不同的相位,第一帧的帧长大于第二帧的帧长,N为正整数且大于1。关于第一获取单元输出深度图的具体处理过程可以参考图7B相关的实施例的描述,这里不再赘述。
此时,第一获取单元能够在每个第一帧得到一对对应的深度图和灰度图,得到的深度图通过与得到的灰度图相邻的N-1个第一帧的灰度图计算得到,得到的灰度图和相邻的N-1个第一帧的灰度图对应N个不同的相位。第二获取单元能够在至多每M’-1个第二帧输出一对对应的深度图和灰度图,例如,输出的深度图包括基准深度图,或者基于基准深度图和基准深度图对应的至少一个灰度图、利用平滑轨迹拟合预测得到的深度图。
例如,参考图7B相关的实施例,在任一个第一帧,若第一帧没有对应的深度图,则利用与该第一帧相邻的、且在该第一帧之前的N-1个第一帧输出的灰度图,计算生成该第一帧对应的深度图,具体过程不再赘述。
由此,实现在每个第一帧从第一获取单元得到一个深度图,第一获取单元输出的深度图的帧率与第一获取单元得到灰度图的帧率相同。
此时,由于第二获取单元输出灰度图的帧率是第一获取单元输出深度图的帧率的M’倍,因此第二获取单元每M’个第二帧就能跟第一获取单元输出的深度图进行一次对齐,实现深度和灰度信息位置的校准。在其他第二帧,利用预测算法,通过第二获取单元得到的高帧率灰度图对第一获取单元输出的深度图进行平滑轨迹拟合预测,也就是说在其他M’-1个第二帧中,即使没有深度图也能通过第二获取单元输出的灰度图实现对深度图的预测,进而实现高速图像深度信息的获取和校准。
这里,M’为正整数,且M’=M/N,M表示在第一获取单元每N个第一帧输出一个深度图时,第二获取单元输出灰度图的帧率为第一获取单元输出深度图的帧率的比值。
关于预测的具体过程可以参考图7C所述的内容,重复之处不再赘述。
由于高速灰度相机的帧率最高可达几百赫兹,因此在上述实施例中,利用高速灰度相机和深度相机的结合,能够实现毫秒级别的低时延,大幅降低获取图像所需的延时,提高图像获取帧率,快速实现手势动作及位置的判定,提高3D空间中交互体验的速度。
并且,由于此时用于预测的基准深度图之间的间隔缩短,例如在图7C所述的实施例中,基准深度图之间的间隔为4个第一帧,而在该实施例中,基准深度图之间的间隔缩短为1个第一帧,因此预测时的准确率得到提高,减少内插倍数,提升预测得到的深度图的精度。
因此,在本公开至少一实施例提供的手势识别方法中,拍摄装置输出灰度图和深度图的帧率大幅提高,降低了图像采集过程的延时,满足低延时的交互需求,例如,在采用帧率为120fps的高速灰度相机和帧率为30fps的TOF相机组合作为拍摄装置进行拍摄时,测试得到的具体指标包括:延时20ms以内,报点率120Hz,定位精度达到2mm以内,识别率95%以上。因此,本公开至少一实施例提供的手势识别方法能够快速准确地定位手势位置与手势动作,保证交互过程的流畅度和准确率,提升用户体验。
本公开至少一实施例还提供一种交互方法。图8为本公开至少一实施例提供的交互的方法示意性流程图。
如图8所示,本公开至少一实施例提供的交互方法至少包括步骤S30-S50。
在步骤S30,显示控件。
例如,可以在显示单元中显示控件,显示单元包括任意具有显示效果的显示装置,如三维显示器、大尺寸屏幕等。显示单元能够显示与手势动作对象进行交互的显示内容,例如显示内容包括通用的、易理解的控件,关于显示单元、显示控件的内容可以参考前述内容,这里不再赘述。需要说明的是,本公开对显示单元的类型、形状、性能不做具体限制,同样对控件的数量、材质、颜色、形状等也不做具体限制。
在步骤S40,识别用户执行目标动作时的动态手势变化。
例如,拍摄装置集成在显示单元附近且面向用户,例如,用户根据显示内容指示的信息,或者根据其他信息执行目标动作,拍摄装置连续拍摄用户执行目标动作的过程,并利用本公开任一实施例所述的手势识别方法识别用 户执行目标动作时的动态手势变化。
关于手势识别方法参考如前所述的内容,这里不再赘述。
在步骤S50,根据识别的所述动态手势变化和所述目标动作,触发所述控件。
例如,在一些实施例中,动态手势变化包括手势动作,手势动作例如包括单击手势、长按手势、滑动手势、抓取手势及释放手势中的任一个。
例如,步骤S50可以包括:响应于用户的手势动作与目标动作一致,触发控件并显示可视化反馈效果。
例如,此时利用本公开任一实施例提供的手势识别方法,识别用户执行的手势动作变化,例如检测到手势动作为单击手势,若目标动作也是单击手势,则触发控件并显示可视化反馈效果,例如控件材质、颜色、亮度等发生变化,或者其他可视化反馈效果,以提醒用户交互完成。
例如,可视化反馈效果还可以根据手势动作的不同,包括切换场景、空间移动、旋转、缩放等,本公开对此不做具体限制。
例如,在另一些实施例中,动态手势变化还包括手势位置,控件本身也包含位置信息,只有当手势交互的位置与控件的位置重合时,才能触发此控件。
例如,步骤S50可以包括:响应于用户的手势动作与目标动作一致,用户的手势位置与控件的控件位置匹配,触发控件并显示可视化反馈效果。这里,用户的手势位置与控件的控件位置匹配表示,手势位置按映射关系映射至控件所在的坐标系中的位置与控件位置一致。
如前所述,控件所在的坐标系与手势位置所在的手势交互空间中的坐标系存在预定映射关系,将手势位置按照该映射关系映射至控件所在的坐标系,若两者位置重合,则触发控件并显示可视化反馈效果。
本公开至少一实施例还提供一种手势交互系统,图9为本公开至少一实施例提供的一种手势交互系统的示意性框图。
如图9所示,手势交互系统包括一个或多个拍摄装置901、手势识别单元902和显示单元903。
例如,至少一个拍摄装置901配置为连续拍摄手势动作对象,以获取针对手势动作对象的分别在不同拍摄时刻拍摄的多组图像。
手势识别单元902配置为接收多组图像,执行本公开任一实施例所述的 手势识别方法,输出手势动作对象的动态手势变化的识别结果。
显示单元903配置为接收识别结果,根据识别结果显示交互效果。
例如,手势识别单元902包括数字信号处理器(Digital Signal Processor,简称DSP),也即本公开至少一实施例提供的手势识别方法在多路DSP中执行,根据多组图像,输出手势动作对象的动态手势变化的识别结果。
例如,拍摄装置中获得高帧率的灰度图和深度图的处理,也可以在硬件中实现,例如在数字信号处理器中实现。
例如,在一些实施例中,拍摄装置拍摄得到原始信号数据(Raw Data)并将原始信号数据传输给手势识别单元,手势识别单元完成原始信号数据的读取、SIP处理、手势定位与识别、时序控制等多项功能。
传统方案通常在专用处理芯片或上位机中执行图像预处理,同时基于深度学习算法在上位机执行手势识别,占用大量的系统资源。本公开至少一实施例提供的手势识别方法使用的图像处理方法实现容易,算法能够实现硬件化,图像信号处理与动态手势识别均在DSP中执行,节省系统资源,快速准确地定位手势位置与手势动作,保证交互过程的流畅度和准确率。
例如,手势采集、动态手势识别均在下位机完成,识别结果直接通过预设接口(例如USB接口)上报给显示单元,以供显示单元结合内容显示相应的交互效果。由此,降低上位机的处理工作量,提升系统的处理效率,用于降低处理延时,保障手势识别的实时性;并且可以将有限的上位机资源用于交互效果的显示,提升用户体验。
例如,识别结果包括手势位置,也即手势动作对象在手势交互空间中的三维坐标。例如,识别结果还可以包括手势动作,例如,手势动作按照预设的标记码传输给显示单元,例如,单击手势对应“1”,长按手势对应“2”,以此类推。当然,手势动作还可以按其他预设的对应关系传输给显示单元,本公开对此不作具体限制。
例如,显示单元可以包括上位机,显示单元可以利用上位机结合显示内容开发交互效果。由于下位机处理速度较快,因此这种设置方式可以整体上提升系统的处理效率,最大化利用系统资源,降低处理延时,实现手势交互的实时性。
例如,手势交互系统包括多个拍摄装置901,多个拍摄装置901配置为从不同角度同步拍摄手势动作对象,以在同一拍摄时刻得到对应的多对深度 图和灰度图。
例如,多个拍摄装置901设置在显示单元的周围且面向用户,例如设置在显示屏幕的上边缘、下边缘、左边缘或右边缘的中央位置。
例如,每个拍摄装置包括第一获取单元和第二获取单元,第一获取单元和第二获取单元配置为同步拍摄手势动作对象。例如,第一获取单元为TOF相机,第二获取单元为高速灰度相机。
例如,当存在多个拍摄装置时,多个拍摄装置同步拍摄手势动作对象,并且每个拍摄装置包括的第一获取单元和第二获取单元也同步拍摄手势动作对象,以得到对应同一拍摄时刻的多对深度图和灰度图。
关于第一获取单元、第二获取单元、拍摄装置的具体设置和处理参考如前所述的实施例,这里不再赘述。
例如,手势交互系统能够根据手势位置智能化调用相应的拍摄装置,控制多相机的曝光时序,在一个拍摄时刻可采集多对深度图与灰度图。例如,多个拍摄装置配置为根据手势动作对象在手势交互空间中的位置,选择部分或全部拍摄装置拍摄手势动作对象。
例如,若检测到手势位置位于显示单元的左上角,则可以设置位于显示单元右下角附近的拍摄装置不需拍摄手势交互对象,以降低系统资源消耗,降低硬件开销。
图10为本公开至少一实施例提供的一种手势识别单元的示意性框图。
如图10所示,手势识别单元902可以包括:图像获取模块9021、处理模块9022。
例如,这些模块可以通过硬件(例如电路)模块、软件模块或二者的任意组合等实现,以下实施例与此相同,不再赘述。例如,如前所述,这些模块可以利用DSP实现,或者,也可以通过中央处理单元(CPU)、图像处理器(GPU)、张量处理器(TPU)、现场可编程逻辑门阵列(FPGA)或者具有数据处理能力和/或指令执行能力的其它形式的处理单元以及相应计算机指令来实现这些模块。
例如,图像获取模块9021被配置为获取针对手势动作对象的分别在不同拍摄时刻拍摄的多组图像。例如,每组图像包括至少一对对应的深度图和灰度图。
例如,处理模块9022被配置为根据多组图像,使用每组图像中的深度 图来获取空间信息,使用每组图像中的灰度图来获取手势动作对象的姿态信息,以识别手势动作对象的动态手势变化。
例如,图像获取模块9021、处理模块9022可以包括存储在存储器中的代码和程序;处理器可以执行该代码和程序以实现如上所述的图像获取模块9021、处理模块9022的一些功能或全部功能。例如,图像获取模块9021、处理模块9022可以是专用硬件器件,用来实现如上所述的图像获取模块9021、处理模块9022的一些或全部功能。例如,图像获取模块9021、处理模块9022可以是一个电路板或多个电路板的组合,用于实现如上所述的功能。在本申请实施例中,该一个电路板或多个电路板的组合可以包括:(1)一个或多个处理器;(2)与处理器相连接的一个或多个非暂时的存储器;以及(3)处理器可执行的存储在存储器中的固件。
需要说明的是,图像获取模块9021可以用于实现图3所示的步骤S10,处理模块9022可以用于实现图3所示的步骤S20。从而关于图像获取模块9021、处理模块9022能够实现的功能的具体说明可以参考上述手势识别方法的实施例中的步骤S10至步骤S20的相关描述,重复之处不再赘述。此外,手势识别单元902可以实现与前述手势识别方法相似的技术效果,在此不再赘述。
需要注意的是,在本公开的实施例中,该手势识别单元902可以包括更多或更少的电路或单元,并且各个电路或单元之间的连接关系不受限制,可以根据实际需求而定。各个电路或单元的具体构成方式不受限制,可以根据电路原理由模拟器件构成,也可以由数字芯片构成,或者以其他适用的方式构成。
本公开至少一实施例还提供一种电子设备,图11为本公开至少一实施例提供的一种电子设备的示意图。
例如,如图11所示,电子设备包括处理器101、通信接口102、存储器103和通信总线104。处理器101、通信接口102、存储器103通过通信总线104实现相互通信,处理器101、通信接口102、存储器103等组件之间也可以通过网络连接进行通信。本公开对网络的类型和功能在此不作限制。应当注意,图11所示的电子设备的组件只是示例性的,而非限制性的,根据实际应用需要,该电子设备还可以具有其他组件。
例如,存储器103用于非瞬时性地存储计算机可读指令。处理器101用 于执行计算机可读指令时,实现根据上述任一实施例所述的手势识别方法。关于该手势识别方法的各个步骤的具体实现以及相关解释内容可以参见上述手势识别方法的实施例,在此不作赘述。
例如,处理器101执行存储器103上所存放的计算机可读指令而实现的手势识别方法的其他实现方式,与前述方法实施例部分所提及的实现方式相同,这里也不再赘述。
例如,通信总线104可以是外设部件互连标准(PCI)总线或扩展工业标准结构(EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
例如,通信接口102用于实现电子设备与其他设备之间的通信。
例如,处理器101和存储器103可以设置在服务器端(或云端)。
例如,处理器101可以控制电子设备中的其它组件以执行期望的功能。处理器101可以是中央处理器(CPU)、网络处理器(NP)、张量处理器(TPU)或者图形处理器(GPU)等具有数据处理能力和/或程序执行能力的器件;还可以是数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。中央处理器(CPU)可以为X86或ARM架构等。
例如,存储器103可以包括一个或多个计算机程序产品的任意组合,计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。非易失性存储器例如可以包括只读存储器(ROM)、硬盘、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、USB存储器、闪存等。在所述计算机可读存储介质上可以存储一个或多个计算机可读指令,处理器101可以运行所述计算机可读指令,以实现电子设备的各种功能。在存储介质中还可以存储各种应用程序和各种数据等。
例如,在一些实施例中,电子设备还可以包括图像获取部件。图像获取部件用于获取图像。存储器903还用于存储获取的图像。
例如,图像获取部件可以是如前所述的拍摄装置。
例如,关于电子设备执行手势识别的过程的详细说明可以参考手势识别 方法的实施例中的相关描述,重复之处不再赘述。
图12为本公开至少一实施例提供的一种非瞬时性计算机可读存储介质的示意图。例如,如图12所示,存储介质1000可以为非瞬时性计算机可读存储介质,在存储介质1000上可以非暂时性地存储一个或多个计算机可读指令1001。例如,当计算机可读指令1001由处理器执行时可以执行根据上文所述的手势识别方法中的一个或多个步骤。
例如,该存储介质1000可以应用于上述电子设备中,例如,该存储介质1000可以包括电子设备中的存储器。
例如,存储介质可以包括智能电话的存储卡、平板电脑的存储部件、个人计算机的硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、闪存、或者上述存储介质的任意组合,也可以为其他适用的存储介质。
例如,关于存储介质1000的说明可以参考电子设备的实施例中对于存储器的描述,重复之处不再赘述。
图13示出了为本公开至少一实施例提供的一种硬件环境的示意图。本公开提供的电子设备可以应用在互联网系统。
利用图13中提供的计算机系统可以实现本公开中涉及的手势识别装置和/或电子设备的功能。这类计算机系统可以包括个人电脑、笔记本电脑、平板电脑、手机、个人数码助理、智能眼镜、智能手表、智能指环、智能头盔及任何智能便携设备或可穿戴设备。本实施例中的特定系统利用功能框图解释了一个包含用户界面的硬件平台。这种计算机设备可以是一个通用目的的计算机设备,或一个有特定目的的计算机设备。两种计算机设备都可以被用于实现本实施例中的手势识别装置和/或电子设备。计算机系统可以包括实施当前描述的实现手势识别所需要的信息的任何组件。例如,计算机系统能够被计算机设备通过其硬件设备、软件程序、固件以及它们的组合所实现。为了方便起见,图13中只绘制了一台计算机设备,但是本实施例所描述的实现手势识别所需要的信息的相关计算机功能是可以以分布的方式、由一组相似的平台所实施的,分散计算机系统的处理负荷。
如图13所示,计算机系统可以包括通信端口250,与之相连的是实现数据通信的网络,例如,计算机系统可以通过通信端口250发送和接收信息及数据,即通信端口250可以实现计算机系统与其他电子设备进行无线或有线 通信以交换数据。计算机系统还可以包括一个处理器组220(即上面描述的处理器),用于执行程序指令。处理器组220可以由至少一个处理器(例如,CPU)组成。计算机系统可以包括一个内部通信总线210。计算机系统可以包括不同形式的程序储存单元以及数据储存单元(即上面描述的存储器或存储介质),例如硬盘270、只读存储器(ROM)230、随机存取存储器(RAM)240,能够用于存储计算机处理和/或通信使用的各种数据文件,以及处理器组220所执行的可能的程序指令。计算机系统还可以包括一个输入/输出组件260,输入/输出组件260用于实现计算机系统与其他组件(例如,用户界面280等)之间的输入/输出数据流。
通常,以下装置可以连接输入/输出组件260:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置;包括例如磁带、硬盘等的存储装置;以及通信接口。
虽然图13示出了具有各种装置的计算机系统,但应理解的是,并不要求计算机系统具备所有示出的装置,可以替代地,计算机系统可以具备更多或更少的装置。
对于本公开,还有以下几点需要说明:
(1)本公开实施例附图只涉及到与本公开实施例涉及到的结构,其他结构可参考通常设计。
(2)为了清晰起见,在用于描述本发明的实施例的附图中,层或结构的厚度和尺寸被放大。可以理解,当诸如层、膜、区域或基板之类的元件被称作位于另一元件“上”或“下”时,该元件可以“直接”位于另一元件“上”或“下”,或者可以存在中间元件。
(3)在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合以得到新的实施例。
以上所述仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,本公开的保护范围应以所述权利要求的保护范围为准。

Claims (32)

  1. 一种手势识别方法,包括:
    获取针对手势动作对象的分别在不同拍摄时刻拍摄的多组图像,其中,每组图像包括至少一对对应的深度图和灰度图;
    根据所述多组图像,使用每组图像中的深度图来获取空间信息,使用每组图像中的灰度图来获取手势动作对象的姿态信息,以识别所述手势动作对象的动态手势变化。
  2. 根据权利要求1所述的手势识别方法,其中,
    使用每组图像中的深度图来获取空间信息,包括:根据所述深度图确定所述深度图中的手势区域,其中,所述空间信息包括所述深度图中的手势区域;
    使用每组图像中的灰度图来获取手势动作对象的姿态信息,包括:根据所述深度图中的手势区域和所述灰度图,确定所述每组图像对应的针对所述手势动作对象的姿态信息;
    识别所述手势动作对象的动态手势变化,包括:根据所述多组图像分别对应的针对所述手势动作对象的姿态信息,确定所述手势动作对象的动态手势变化。
  3. 根据权利要求2所述的手势识别方法,其中,根据所述深度图确定所述深度图中的手势区域,包括:
    遍历所述深度图,统计所述深度图中的深度数据以建立深度直方图;
    选取所述深度图对应的自适应深度阈值,根据所述自适应深度阈值和所述深度直方图,确定所述深度图中的手势区域。
  4. 根据权利要求2或3所述的手势识别方法,其中,所述每组图像对应的针对所述手势动作对象的姿态信息包括手指状态信息和位置信息,
    根据所述深度图中的手势区域和所述灰度图,确定所述每组图像对应的针对所述手势动作对象的姿态信息,包括:
    将所述深度图中的手势区域作用于所述灰度图,得到所述灰度图中的手势分析区域;
    对所述手势分析区域进行二值化处理,得到手势连通域;
    对所述手势连通域进行凸包检测,得到所述手指状态信息,其中,所述 手指状态信息包括是否存在手指伸出,以及伸出手指的数量;
    基于所述深度图,确定所述位置信息,其中,所述位置信息包括所述手势动作对象在手势交互空间中的坐标位置。
  5. 根据权利要求4所述的手势识别方法,其中,根据所述多组图像分别对应的针对所述手势动作对象的姿态信息,确定所述手势动作对象的动态手势变化,包括:
    根据所述多组图像分别对应的手指状态信息和位置信息,确定所述手势动作对象在所述不同拍摄时刻组成的识别时段内的手指伸出状态变化和位置变化;
    根据所述手指伸出状态变化和所述位置变化,确定所述手势动作对象的动态手势变化。
  6. 根据权利要求5所述的手势识别方法,其中,所述坐标位置包括深度坐标,
    所述手势动作对象的动态手势变化包括手势动作,
    根据所述手指伸出状态变化和所述位置变化,确定所述手势动作对象的动态手势变化,包括:
    响应于所述手指伸出状态变化指示所述手势动作对象中的至少一个手指在所述识别时段内的至少部分时段处于伸出状态,所述位置变化指示所述手势动作对象中的目标识别点在所述至少部分时段内的深度坐标先减小后增大,确定所述手势动作为单击手势。
  7. 根据权利要求5所述的手势识别方法,其中,所述坐标位置包括深度坐标,所述手势动作对象的动态手势变化包括手势动作,
    根据所述手指伸出状态变化和所述位置变化,确定所述手势动作对象的动态手势变化,包括:
    响应于所述手指伸出状态变化指示所述手势动作对象中的至少一个手指在所述识别时段内的至少部分时段处于伸出状态,以及所述位置变化指示所述手势动作对象中的目标识别点在所述至少部分时段内的深度坐标先减小后保持,且所述保持动作的时长超过第一阈值,确定所述手势动作为长按手势。
  8. 根据权利要求5所述的手势识别方法,其中,所述手势动作对象的动态手势变化包括手势动作,
    根据所述手指伸出状态变化和所述位置变化,确定所述手势动作对象的动态手势变化,包括:
    响应于所述手指伸出状态变化指示所述手势动作对象中的至少一个手指在所述识别时段内的至少部分时段处于伸出状态,以及所述位置变化指示所述手势动作对象中的目标识别点在所述至少部分时段内沿预设方向滑动的距离超过第二阈值,确定所述手势动作为滑动手势,其中,所述滑动的距离基于所述多组图像中所述手势动作对象的目标识别点的位置信息计算得到。
  9. 根据权利要求5所述的手势识别方法,其中,
    所述手势动作对象的动态手势变化包括手势动作,
    根据所述手指伸出状态变化和所述位置变化,确定所述手势动作对象的动态手势变化,包括:
    响应于所述手指伸出状态变化指示所述手势动作对象在所述识别时段内从存在至少一个手指处于伸出状态转换为无手指处于伸出状态,确定所述手势动作为抓取手势。
  10. 根据权利要求5所述的手势识别方法,其中,
    所述手势动作对象的动态手势变化包括手势动作,
    根据所述手指伸出状态变化和所述位置变化,确定所述手势动作对象的动态手势变化,包括:
    响应于所述手指伸出状态变化指示所述手势动作对象在所述识别时段内从无手指处于伸出状态转换为存在至少一个手指处于伸出状态,确定所述手势动作为释放手势。
  11. 根据权利要求6-10任一项所述的手势识别方法,其中,根据所述手指伸出状态变化和所述位置变化,确定所述手势动作对象的动态手势变化,还包括:
    在确定所述手势动作前,确定所述手势动作对象是否在发生动作变化前存在超过第三阈值的静止时间,
    响应于存在超过第三阈值的静止时间,继续确定所述手势动作;
    响应于不存在超过第三阈值的静止时间,确定未发生手势动作变化。
  12. 根据权利要求5-11任一项所述的手势识别方法,其中,
    所述手势动作对象的动态手势变化还包括手势位置,
    根据所述手指伸出状态变化和所述位置变化,确定所述手势动作对象的动态手势变化,还包括:
    响应于手势动作为单击手势、长按手势或滑动手势,确定所述手势位置基于所述手势动作对象中的目标识别点的位置信息得到,其中,所述目标识别点包括目标手指的指尖点;
    响应于手势动作为抓取手势或释放手势,确定所述手势位置基于所述手势动作对象的手势中心的位置信息得到,其中,所述手势中心为所述手势连通域的最大内接圆的圆心。
  13. 根据权利要求12所述的手势识别方法,其中,确定所述手势位置基于所述手势动作对象中的目标识别点的位置信息得到,包括:
    获取所述目标识别点周围的预设位置处的多个采样点分别对应的多个位置信息;
    根据所述多个位置信息和所述目标识别点的位置信息,得到所述手势位置;以及,
    确定所述手势位置基于所述手势动作对象的手势中心的位置信息得到,包括:
    获取所述手势中心周围的预设位置处的多个采样点分别对应的多个位置信息;
    根据所述多个位置信息和所述手势中心的位置信息,得到所述手势位置。
  14. 根据权利要求1-13任一项所述的手势识别方法,其中,获取针对手势动作对象的分别在不同拍摄时刻拍摄的多组图像,包括:
    利用至少一个拍摄装置连续拍摄所述手势动作对象,得到分别对应所述不同拍摄时刻的多组图像,其中,每个拍摄装置配置为在一个拍摄时刻同步输出一对对应的深度图和灰度图。
  15. 根据权利要求14所述的手势识别方法,其中,所述手势动作对象相对于每个图像中的其他对象最靠近所述至少一个拍摄装置。
  16. 根据权利要求14或15所述的手势识别方法,其中,响应于所述至少一个拍摄装置的数量是多个,每组图像包括多对对应的深度图和灰度图,所述多对深度图和灰度图由所述多个拍摄装置在同一拍摄时刻同步对所述手势动作对象进行拍摄得到,所述多对深度图和灰度图具有不同的拍摄角 度。
  17. 根据权利要求16所述的手势识别方法,其中,根据所述多组图像,使用每组图像中的深度图来获取空间信息,使用每组图像中的灰度图来获取手势动作对象的姿态信息,以识别所述手势动作对象的动态手势变化,还包括:
    基于同一拍摄装置得到的分别属于所述多组图像的、且对应所述不同拍摄时刻的多对深度图和灰度图,确定所述同一拍摄装置对应的所述手势动作对象的中间手势变化;
    对所述多个拍摄装置分别对应的多个中间手势变化进行加权和滤波处理,得到所述手势动作对象的动态手势变化。
  18. 根据权利要求14-17任一项所述的手势识别方法,其中,利用至少一个拍摄装置连续拍摄所述手势动作对象,得到分别对应所述不同拍摄时刻的多组图像,包括:
    利用每个拍摄装置连续拍摄所述手势动作对象,得到所述拍摄装置输出的分别对应所述不同拍摄时刻的多对深度图和灰度图。
  19. 根据权利要求18所述的手势识别方法,其中,每个拍摄装置包括第一获取单元,
    所述第一获取单元配置为在每个第一帧获取灰度图,以及每N个第一帧获取深度图,其中,所述深度图基于所述每N个连续的第一帧获取的N个灰度图生成,所述N个灰度图分别对应N个不同的相位,所述一个深度图与所述N个灰度图中的一个灰度图同步输出所述拍摄装置,其中,N为正整数且大于1;
    利用每个拍摄装置连续拍摄所述手势动作对象,得到所述拍摄装置输出的分别对应所述不同拍摄时刻的多对深度图和灰度图,包括:
    利用所述拍摄装置在每个第一帧输出一对对应的深度图和灰度图,其中,所述输出的深度图根据所述N个灰度图和所述一个深度图,利用平滑轨迹拟合预测得到。
  20. 根据权利要求18所述的手势识别方法,其中,每个拍摄装置包括第一获取单元,
    所述第一获取单元配置为在每个第一帧获取灰度图,以及每N个第一帧获取深度图,其中,所述深度图基于所述每N个连续的第一帧获取的N个 灰度图生成,所述N个灰度图分别对应N个不同的相位,所述一个深度图与所述N个灰度图中的一个灰度图同步输出所述拍摄装置,其中,N为正整数且大于1;
    利用每个拍摄装置连续拍摄所述手势动作对象,得到所述拍摄装置输出的分别对应所述不同拍摄时刻的多对深度图和灰度图,包括:
    利用所述拍摄装置在至多每N-1个第一帧输出一对对应的深度图和灰度图,所述输出的深度图通过与所述输出的灰度图相邻的N-1个第一帧的灰度图计算得到,所述输出的灰度图和所述相邻的N-1个第一帧的灰度图对应所述N个不同的相位。
  21. 根据权利要求18所述的手势识别方法,其中,每个拍摄装置包括第一获取单元和第二获取单元,
    所述第二获取单元配置为在每个第二帧输出一个灰度图,所述第一获取单元配置为每M个第二帧输出一个深度图,M为正整数且大于1,
    利用每个拍摄装置连续拍摄所述手势动作对象,得到所述拍摄装置输出的分别对应所述不同拍摄时刻的多对深度图和灰度图,包括:
    利用所述拍摄装置在至多每M-1个第二帧输出一对对应的深度图和灰度图,其中,所述输出的深度图包括基准深度图,或者基于所述基准深度图和所述基准深度图对应的至少一个灰度图、利用平滑轨迹拟合预测得到的深度图,
    其中,所述基准深度图包括在当前第二帧或在所述当前第二帧之前由所述第一获取单元输出的深度图,所述当前第二帧为输出所述一对对应的深度图和灰度图的第二帧,所述至少一个灰度图包括在所述基准深度图对应的第二帧和所述当前第二帧之间由所述第二获取单元输出的灰度图。
  22. 根据权利要求21所述的手势识别方法,其中,所述第一获取单元还配置为,在每个第一帧得到一对对应的深度图和灰度图,所述得到的深度图通过与所述得到的灰度图相邻的N-1个第一帧的灰度图计算得到,所述得到的灰度图和所述相邻的N-1个第一帧的灰度图对应N个不同的相位,所述第一帧的帧长大于所述第二帧的帧长,N为正整数且大于1。
  23. 一种交互方法,包括:
    显示控件;
    利用权利要求1-22任一项所述的手势识别方法识别用户执行目标动作 时的动态手势变化;
    根据识别的所述动态手势变化和所述目标动作,触发所述控件。
  24. 根据权利要求23所述的交互方法,其中,所述动态手势变化包括手势动作,
    根据识别的所述动态手势变化和所述目标动作,触发所述控件,包括:
    响应于所述用户的手势动作与所述目标动作一致,触发所述控件并显示可视化反馈效果。
  25. 根据权利要求23所述的交互方法,其中,所述动态手势变化包括手势动作和手势位置;
    根据识别的所述动态手势变化和所述目标动作,触发所述控件,包括:
    响应于所述用户的手势动作与所述目标动作一致,所述用户的手势位置与所述控件的控件位置匹配,触发所述控件并显示可视化反馈效果,其中,所述用户的手势位置与所述控件的控件位置匹配表示,所述手势位置按映射关系映射至控件所在的坐标系中的位置与所述控件位置一致。
  26. 一种手势交互系统,包括:
    至少一个拍摄装置,所述至少一个拍摄装置配置为连续拍摄所述手势动作对象,以获取针对所述手势动作对象的分别在不同拍摄时刻拍摄的多组图像;
    手势识别单元,配置为接收所述多组图像,执行权利要求1-22任一项所述的手势识别方法,输出所述手势动作对象的动态手势变化的识别结果;
    显示单元,配置为接收所述识别结果,根据所述识别结果显示交互效果。
  27. 根据权利要求26所述的手势交互系统,其中,所述手势交互系统包括多个拍摄装置,所述多个拍摄装置配置为从不同角度同步拍摄所述手势动作对象,以在同一拍摄时刻得到对应的多对深度图和灰度图。
  28. 根据权利要求26或27所述的手势交互系统,其中,每个拍摄装置包括第一获取单元和第二获取单元,所述第一获取单元和所述第二获取单元配置为同步拍摄所述手势动作对象。
  29. 根据权利要求27所述的手势交互系统,其中,所述多个拍摄装置配置为根据所述手势动作对象在手势交互空间中的位置,选择部分或全部拍摄装置拍摄所述手势动作对象。
  30. 根据权利要求26-29任一项所述的手势交互系统,其中,所述手势 识别单元包括数字信号处理器。
  31. 一种电子设备,包括:
    存储器,非瞬时性地存储有计算机可执行指令;
    处理器,配置为运行所述计算机可执行指令,
    其中,所述计算机可执行指令被所述处理器运行时实现根据权利要求1-22任一项所述的手势识别方法或权利要求23-25所述的交互方法。
  32. 一种非瞬时性计算机可读存储介质,其中,所述非瞬时性计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令被处理器执行时实现根据权利要求1-22中任一项所述的手势识别方法或权利要求23-25所述的交互方法。
PCT/CN2022/087576 2022-04-19 2022-04-19 手势识别方法、交互方法、手势交互系统、电子设备、存储介质 WO2023201512A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202280000791.0A CN117255982A (zh) 2022-04-19 2022-04-19 手势识别方法、交互方法、手势交互系统、电子设备、存储介质
PCT/CN2022/087576 WO2023201512A1 (zh) 2022-04-19 2022-04-19 手势识别方法、交互方法、手势交互系统、电子设备、存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/087576 WO2023201512A1 (zh) 2022-04-19 2022-04-19 手势识别方法、交互方法、手势交互系统、电子设备、存储介质

Publications (1)

Publication Number Publication Date
WO2023201512A1 true WO2023201512A1 (zh) 2023-10-26

Family

ID=88418846

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/087576 WO2023201512A1 (zh) 2022-04-19 2022-04-19 手势识别方法、交互方法、手势交互系统、电子设备、存储介质

Country Status (2)

Country Link
CN (1) CN117255982A (zh)
WO (1) WO2023201512A1 (zh)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160033753A1 (en) * 2014-07-31 2016-02-04 Canon Kabushiki Kaisha Image acquiring apparatus
CN106648078A (zh) * 2016-12-05 2017-05-10 北京光年无限科技有限公司 应用于智能机器人的多模态交互方法及系统

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160033753A1 (en) * 2014-07-31 2016-02-04 Canon Kabushiki Kaisha Image acquiring apparatus
CN106648078A (zh) * 2016-12-05 2017-05-10 北京光年无限科技有限公司 应用于智能机器人的多模态交互方法及系统

Also Published As

Publication number Publication date
CN117255982A (zh) 2023-12-19

Similar Documents

Publication Publication Date Title
CA3016921C (en) System and method for deep learning based hand gesture recognition in first person view
US20180224948A1 (en) Controlling a computing-based device using gestures
US9696859B1 (en) Detecting tap-based user input on a mobile device based on motion sensor data
WO2018177379A1 (zh) 手势识别、控制及神经网络训练方法、装置及电子设备
US9734392B2 (en) Image processing device and image processing method
CN110209273A (zh) 手势识别方法、交互控制方法、装置、介质与电子设备
US20140104168A1 (en) Touchless input
KR20100138602A (ko) 실시간으로 피사체의 손을 검출하기 위한 장치 및 방법
WO2014200665A1 (en) Robust tracking using point and line features
US10528145B1 (en) Systems and methods involving gesture based user interaction, user interface and/or other features
US10607069B2 (en) Determining a pointing vector for gestures performed before a depth camera
US20170344104A1 (en) Object tracking for device input
CN103679788A (zh) 一种移动终端中3d图像的生成方法和装置
US10861169B2 (en) Method, storage medium and electronic device for generating environment model
US9760177B1 (en) Color maps for object tracking
CN111598149B (zh) 一种基于注意力机制的回环检测方法
US9377866B1 (en) Depth-based position mapping
CN117581275A (zh) 眼睛注视分类
US20160140762A1 (en) Image processing device and image processing method
CN106569716B (zh) 单手操控方法及操控系统
WO2020061792A1 (en) Real-time multi-view detection of objects in multi-camera environments
WO2023201512A1 (zh) 手势识别方法、交互方法、手势交互系统、电子设备、存储介质
CN117132515A (zh) 一种图像处理方法及电子设备
JP7293362B2 (ja) 撮影方法、装置、電子機器及び記憶媒体
CN103558948A (zh) 一种应用在虚拟光学键盘人机交互方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22937750

Country of ref document: EP

Kind code of ref document: A1