CN116820251B

CN116820251B - Gesture track interaction method, intelligent glasses and storage medium

Info

Publication number: CN116820251B
Application number: CN202311097207.0A
Authority: CN
Inventors: 刘浩楠; 潘仲光
Original assignee: Zhongshu Yuanyu Digital Technology Shanghai Co ltd
Current assignee: Zhongshu Yuanyu Digital Technology Shanghai Co ltd
Priority date: 2023-08-28
Filing date: 2023-08-28
Publication date: 2023-11-07
Anticipated expiration: 2043-08-28
Also published as: CN116820251A

Abstract

The embodiment of the application provides a gesture track interaction method, intelligent glasses and a storage medium. In the method, the hand can be shot in response to an initialization event in the process of rotating the hand by a user to obtain continuous multiple target hand images of the hand at different angles; and determining a target hand joint tracking strategy from a plurality of hand joint tracking strategies according to the target hand images. Shooting hand motions of a user to obtain continuous multiple groups of hand images; according to the multiple groups of hand images, recognizing gesture tracks of a user; and according to the gesture track, predicting gesture patterns of the user in the real space and executing interaction instructions. By the method, the target key points to be tracked can be accurately determined according to the personalized hand information of the user for users with different hand integrity conditions, so that the interactive instruction which the user wants to send out can be accurately executed.

Description

Gesture track interaction method, intelligent glasses and storage medium

Technical Field

The application relates to the technical field of virtual reality, in particular to a gesture track interaction method, intelligent glasses and a storage medium.

Background

Virtual Reality (VR) interaction technology is an emerging comprehensive information technology, and users can interact and interact with objects in a Virtual environment by means of necessary devices, so that experiences and experiences equivalent to those of a real environment are generated.

Currently, users can interact with virtual reality devices (such as VR and XR glasses) typically through gestures. However, when a user whose hand is missing interacts with the virtual reality device, the virtual reality device often cannot determine whether a key point cannot be detected due to the missing portion of the hand or due to object shielding, thereby erroneously executing an interaction instruction. Therefore, a solution is needed.

Disclosure of Invention

The application provides a gesture track interaction method, intelligent glasses and a storage medium, which are used for aiming at users with different hand integrity conditions and accurately executing interaction instructions which the users want to send.

The embodiment of the application provides a gesture track interaction method which is suitable for intelligent glasses, wherein the intelligent glasses comprise binocular cameras, and the method comprises the following steps: responding to an initialization event of a user, and shooting the hand through the binocular camera in the process of rotating the hand by the user to obtain continuous multiple target hand images of the hand at different angles; determining a target hand joint tracking strategy from a plurality of preset hand joint tracking strategies according to the continuous target hand images; the different hand joint tracking strategies are used for tracking target key points on hands under different complete conditions, and the target key points on the hands under different complete conditions are not completely the same; shooting hand actions of the user in a real space through the binocular camera to obtain continuous multiple groups of hand images; any set of hand images including left eye images and right eye images; according to the multiple groups of hand images, recognizing gesture tracks of the user by utilizing the target hand joint tracking strategy; predicting a target gesture pattern of the user in real space according to the gesture track; and executing interaction instructions matched with the target gesture pattern.

Further optionally, determining a target hand joint tracking strategy from a plurality of preset hand joint tracking strategies according to the continuous plurality of target hand images includes: determining the complete condition of the hand according to the continuous multiple target hand images; and determining a target hand joint tracking strategy from the preset various hand joint tracking strategies according to the corresponding relation between the preset complete condition and the hand joint tracking strategy.

Further optionally, determining the integrity of the hand according to the continuous multiple target hand images includes: determining coordinates of at least one hand key point of the hand under different angles based on the continuous multiple target hand images by using a hand joint tracking algorithm; establishing a hand model according to coordinates of hand key points of the hand under different angles; comparing the hand model with a preset general hand model, and determining the complete condition of the hand according to the comparison result.

Further optionally, identifying the gesture track of the user according to the multiple sets of hand images using the target hand joint tracking strategy includes: determining target key points for recording tracks by utilizing the target hand joint tracking strategy; identifying coordinates of the target key points in a plurality of groups of hand images shot by the binocular cameras by adopting a binocular depth algorithm to obtain a coordinate sequence of the target key points; and determining the gesture track of the user according to the coordinate sequence of the target key point.

Further optionally, determining a target key point for recording a trajectory using the target hand joint tracking strategy includes any one of: when the target hand joint tracking strategy corresponds to the complete condition of hand soundness, responding to target key point configuration operation of a user, and determining the target key point; when the target hand joint tracking strategy corresponds to the complete condition of partial finger deficiency, selecting a key point closest to the missing part on the missing finger as a target key point; when the target hand joint tracking strategy corresponds to the complete condition of single finger deformity or multiple finger deformity, shielding the completely missing fingers, and responding to target key point configuration operation of a user aiming at the unshielded fingers to determine the target key points; and when the target hand joint tracking strategy corresponds to the complete condition of the overall incomplete hand, determining the limb ending of the user closest to the incomplete hand as the target key point.

Further alternatively, a binocular depth algorithm is adopted to identify coordinates of the target key points in the multiple groups of hand images shot by the binocular cameras, and a coordinate sequence of the target key points is obtained, including: identifying the target keypoints from the plurality of sets of hand images; for a group of hand images shot at any moment, determining parallax of the target key points in a left eye hand image and a right eye hand image in any group of hand images; and determining coordinates of the target key points at the shooting time of any group of hand images by adopting a binocular depth algorithm according to the parallax.

Further optionally, determining the gesture track of the user according to the coordinate sequence of the target key point includes: determining coordinates meeting gesture motion starting conditions from the coordinate sequence of the target key points, and taking the coordinates as starting positions of gesture tracks; and determining coordinates meeting the gesture motion stop condition in the coordinate sequence as a stop position of gesture estimation; and determining the gesture track of the user according to the starting position, the stopping position and the coordinates between the starting position and the stopping position.

Further optionally, the gesture motion stop condition includes: the depth value in the coordinates is greater than a set first depth threshold; the first depth threshold comprises: the depth value of the virtual desktop provided by the intelligent glasses; the gesture motion stop condition includes: the depth value in the coordinates is less than or equal to a set second depth threshold; the second depth threshold is determined from the first depth threshold.

Further optionally, before predicting the target gesture pattern of the user in the real space according to the gesture track, the method further includes: determining a hand image corresponding to any coordinate in the gesture track; performing background recognition on the hand image to judge whether the hands of the user in the hand image are attached to a designated surface or not; if yes, determining the normal direction of the appointed surface by adopting a computer vision algorithm; determining the sight direction of the binocular camera at the shooting moment of the hand image; and carrying out homography transformation correction on the coordinates according to the included angle between the normal direction and the sight direction to obtain corrected coordinates, and updating the gesture track according to the corrected coordinates.

Further optionally, before predicting the target gesture pattern of the user in the real space according to the gesture track, the method further includes: judging whether the intelligent glasses move within the shooting time range of the gesture track according to the movement data of the intelligent glasses; if yes, correcting the gesture track according to the motion vector of the intelligent glasses in the shooting time range, and obtaining the corrected gesture track.

Further optionally, predicting a gesture pattern of the user in the real space according to the gesture track includes: calculating the similarity between the gesture track and gesture patterns in a preset pattern library; and determining a gesture pattern with the highest similarity with the gesture track from the pattern library according to the calculated similarity, and taking the gesture pattern as a target gesture pattern corresponding to the gesture track.

Further optionally, executing the interaction instruction matched with the target gesture pattern includes: determining an interaction mode corresponding to the target gesture pattern according to the corresponding relation between the interaction mode and the gesture pattern; displaying the target gesture pattern and an interaction mode corresponding to the target gesture pattern on a display screen of the intelligent glasses; and responding to the triggering operation of the appointed event, and executing the interaction instruction corresponding to the interaction mode.

The embodiment of the application also provides intelligent glasses, which comprise: the camera comprises a camera component, a memory and a processor; wherein the memory is for: store one or more computer instructions; the processor is configured to execute the one or more computer instructions to: and calling the camera component and executing the steps in the gesture track interaction method.

Embodiments of the present application also provide a computer readable storage medium, which when executed by a processor, causes the processor to implement the steps in the gesture track interaction method.

In this embodiment, the smart glasses may respond to the initialization event, and shoot the hand in the process of rotating the hand by the user to obtain a plurality of continuous target hand images of the hand at different angles; and determining a target hand joint tracking strategy from a plurality of hand joint tracking strategies according to the target hand images. Shooting hand motions of a user to obtain continuous multiple groups of hand images; according to the multiple groups of hand images, recognizing gesture tracks of a user; and according to the gesture track, predicting gesture patterns of the user in the real space and executing interaction instructions. By the method, the target key points to be tracked can be accurately determined according to the personalized hand information of the user for users with different hand integrity conditions, so that the interactive instruction which the user wants to send out can be accurately executed.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a flowchart of a gesture track interaction method according to an exemplary embodiment of the present application;

fig. 2 is a schematic structural diagram of smart glasses according to an exemplary embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Currently, users can interact with virtual reality devices (such as VR and XR glasses) typically through gestures. However, when a user whose hand is missing interacts with the virtual reality device, the virtual reality device often cannot determine whether a key point cannot be detected due to the missing portion of the hand or due to object shielding, thereby erroneously executing an interaction instruction.

In view of the foregoing technical problems, in some embodiments of the present application, a solution is provided, and in the following, the technical solutions provided by the embodiments of the present application will be described in detail with reference to the accompanying drawings.

Fig. 1 is a flow chart of a gesture track interaction method according to an exemplary embodiment of the present application, as shown in fig. 1, the method includes:

and 11, responding to an initialization event of a user, and shooting the hand by using the binocular camera in the process of rotating the hand by the user to obtain continuous multiple target hand images of the hand at different angles.

Step 12, determining a target hand joint tracking strategy from a plurality of preset hand joint tracking strategies according to a plurality of continuous target hand images; the different hand joint tracking strategies are used for tracking target key points on hands with different complete conditions, and the target key points on the hands with different complete conditions are not identical.

Step 13, shooting hand actions of a user in a real space through a binocular camera to obtain continuous multiple groups of hand images; any set of hand images includes left eye images and right eye images.

And 14, recognizing gesture tracks of the user by utilizing a target hand joint tracking strategy according to the plurality of groups of hand images.

And 15, predicting a target gesture pattern of the user in the real space according to the gesture track.

And step 16, executing interaction instructions matched with the target gesture pattern.

This embodiment may be performed by an intelligent device, which may be implemented as intelligent glasses, which may be provided with binocular cameras. Such as VR (Virtual Reality) glasses, MR (Mixed Reality) glasses, VR Head Mounted Display (HMD), and the like, the present embodiment is not limited. Hereinafter, an exemplary description will be made with smart glasses as an execution subject.

In this embodiment, in response to an initialization event of a user, in a process of rotating a hand of the user, the hand is photographed by the binocular camera, so as to obtain continuous multiple target hand images of the hand at different angles. The initialization event may be the first use of the smart glasses by the user, or an initialization instruction sent by the user. The binocular camera includes: two monocular cameras. In this way, the intelligent glasses can collect hand information of the hands of the user more completely in the initialization stage.

The intelligent glasses can determine a target hand joint tracking strategy from a plurality of preset hand joint tracking strategies according to a plurality of continuous target hand images. The hand joint tracking strategies are used for tracking target key points on hands under different complete conditions, and the target key points on the hands under different complete conditions are not identical. Wherein the target key points are used for recording the track. The complete condition of the hand of the user includes, but is not limited to, any of the following: the hand is sound, the fingers are partially incomplete, the single finger is incomplete, the fingers are incomplete, and the whole hand is incomplete.

In other words, the plurality of hand joint tracking strategies are personalized tracking strategies preset for hands of different completeness. Because the complete condition of the hands of the users may be different, the hand joint tracking strategy can individually and accurately determine the target key points to be tracked for the hands of each user. Further, the track of the target key point may be recorded during the subsequent interaction of the user through gestures, which will be described in detail later.

Based on the initializing step, the intelligent glasses can accurately determine the corresponding hand joint tracking strategy for the hand of the user based on the hand image of the user.

After the initialization process is completed, the intelligent glasses can shoot the hand actions of the user in the real space through the binocular camera to obtain continuous multiple groups of hand images. Any group of hand images comprise a left eye image and a right eye image, and the left eye image and the right eye image are respectively shot by two monocular cameras. The continuous image refers to an image having a continuously variable spatial position and gray scale in a two-dimensional coordinate system. In this embodiment, the hand motion may be photographed with a higher photographing frequency, so that the plurality of groups of hand images perceived by the human eye are visually continuous.

The intelligent glasses can identify gesture tracks of users by utilizing a target hand joint tracking strategy according to a plurality of groups of hand images. The intelligent glasses can utilize a target hand joint tracking strategy to identify coordinates of target key points in a plurality of groups of hand images, and form gesture tracks of users according to the coordinates of the target key points.

Then, the intelligent glasses can predict the target gesture pattern of the user in the real space according to the recognized gesture track. The intelligent glasses can screen gesture patterns corresponding to the gesture tracks from gesture patterns in a preset pattern library to serve as target gesture patterns; the gesture pattern prediction model can be used for prediction, wherein the gesture pattern prediction model has the gesture pattern prediction capability after a large number of samples are trained.

In this embodiment, the correspondence between the gesture pattern and the interaction instruction may be advanced. For example, the circular gesture pattern corresponds to a screen capturing instruction, the x-shaped gesture pattern corresponds to a shutdown instruction, the spiral coil gesture pattern corresponds to a delete instruction, and the like, and will not be described in detail. Furthermore, the intelligent glasses can match corresponding interaction instructions for the target gesture patterns according to the corresponding relation between the preset gesture patterns and the interaction instructions, and execute the interaction instructions matched with the target gesture patterns.

In some alternative embodiments, determining the target hand-joint tracking strategy from the preset plurality of hand-joint tracking strategies according to the continuous plurality of target hand images may be implemented based on the following steps:

step 121, determining the complete condition of the hand according to the continuous multiple target hand images.

The intelligent glasses can determine coordinates of at least one hand key point of the hand under different angles based on continuous multiple target hand images by using a hand joint tracking algorithm. Among other things, the hand joint tracking algorithm may be used to: the hand key points of the user's hand on each target hand image are positioned, so that the hand key points can be tracked. Then, the intelligent glasses can establish a hand model according to coordinates of hand key points of the hands under different angles. The hand model is built by three-dimensional reconstruction techniques such as NeRF (Neural Radiance Field, nerve radiation field) techniques. The intelligent glasses can acquire a preset general hand model, the general hand model is a complete hand model, and no missing part exists, so that the hand model and the preset general hand model can be compared, and the complete condition of the hand can be determined according to the comparison result. The specific way of determining the hand integrity condition may be: the intelligent glasses can respectively extract point clouds of the hand model and the universal hand model; performing point cloud matching, and calculating a chamfering distance or other matching functions; and continuously discarding the random joint part on the universal hand model, fitting the random joint part with the hand model until the fitting effect is optimal, and determining the complete condition of the hand according to the fitted universal hand model.

Step 122, determining a target hand joint tracking strategy from a plurality of hand joint tracking strategies according to the corresponding relation between the preset complete condition and the hand joint tracking strategy.

In this way, the intelligent glasses can accurately determine the hand joint tracking strategy based on the hand integrity condition of the user.

In some alternative embodiments, step 14, "identify gesture trajectories of the user using a target hand joint tracking strategy from multiple sets of hand images," may be implemented based on the following steps:

and 141, determining target key points for recording the track by utilizing a target hand joint tracking strategy. The specific way of determining the target key point may be any of the following ways:

mode 1, when a target hand joint tracking strategy corresponds to a complete condition of hand soundness, determining a target key point in response to a target key point configuration operation of a user. That is, the target keypoints may be user configurable. For example, the user may configure the target keypoint as any one of the keypoints, such as a fingertip or a first knuckle, etc., and the user may also configure the target keypoint as any plurality of keypoints, such as a fingertip, a first knuckle, and a second knuckle, etc.

And 2, when the target hand joint tracking strategy corresponds to the complete condition of partial finger deficiency, selecting a key point closest to the missing part on the missing finger as a target key point. That is, according to the target hand joint tracking strategy, if there is a missing portion on any finger, the key point nearest to the missing portion can be used as the target key point, and the target key point can inherit the interaction logic of the most distal joint of the finger under the condition of no deformity.

And 3, when the target hand joint tracking strategy corresponds to the complete condition of single finger deformity or multiple finger deformity, shielding the completely missing fingers, and responding to the target key point configuration operation of the user aiming at the unshielded fingers to determine the target key points. In this way, when the key points of the hand are tracked by the hand joint tracking algorithm, the algorithm does not identify the key points on the shielded finger any more, so that the potential false touch condition is avoided.

Mode 4, when the target hand joint tracking strategy corresponds to the complete condition of the whole incomplete hand, determining the limb ending of the user nearest to the incomplete hand as the target key point. Wherein the limb tip closest to the malformed hand may be a wrist tip or an elbow tip, etc.

Through the various alternatives in step 141, the smart glasses may determine the target key point for recording the track more accurately. Thereafter, the following step 142 may continue:

and 142, identifying coordinates of target key points in a plurality of groups of hand images shot by the binocular cameras by adopting a binocular depth algorithm to obtain a coordinate sequence of the target key points. The binocular depth algorithm, also called binocular positioning algorithm or binocular vision algorithm, is an algorithm simulating human vision principle and using a computer to passively sense distance, and the main principle is as follows: an object is observed from two points, images under different visual angles are obtained, and the position of the object is calculated through the matching relation of pixels between the images and the triangulation principle. As will be further described below.

The smart glasses may identify the target keypoints from the plurality of sets of hand images and determine coordinates of the target keypoints based on parallax of the target keypoints in the plurality of sets of hand images. An example of any one set of hand images taken at any one time will be described below. For a set of hand images taken at any one time, the parallax of the target keypoints in the left-eye hand image and the right-eye hand image in the set of hand images is determined. Wherein, each hand image can be corresponding to a shooting time stamp. The intelligent glasses can perform stereo matching on the left-eye hand image and the right-eye hand image in any group of hand images according to the binocular camera external parameters by utilizing a stereo vision matching algorithm so as to find matched corresponding points (namely target key points) from the left-eye hand image and the right-eye hand image in any group of hand images, and calculate parallax of the target key points in the left-eye hand image and the right-eye hand image. Furthermore, the intelligent glasses can determine the coordinates of the target key points at the shooting time of any group of hand images by adopting a binocular depth algorithm according to the parallax. Optionally, after determining the coordinates of the target key point, the smart glasses may collect the distance value between the infrared sensor and the target key point by using the infrared sensor of the smart glasses. And further, the coordinates of the determined target key point are subjected to position correction by the distance value. By means of the position correction mode, errors between the coordinates and the actual positions of the target key points can be reduced, and therefore accuracy of the coordinates of the target key points is further improved.

Through the step 142, the intelligent glasses can accurately identify the coordinates of the target key points in the multiple groups of hand images shot by the binocular cameras. Thereafter, the following step 143 may be continued:

step 143, determining a gesture track of the user according to the coordinate sequence of the target key point. Specifically, the intelligent glasses can determine coordinates meeting the gesture motion starting condition in the coordinate sequence of the target key points as the starting position of the gesture track; and determining coordinates meeting the gesture motion stop condition in the coordinate sequence as a stop position of gesture estimation. In some alternative embodiments, the gesture motion initiation condition may include: the depth value in the coordinates is greater than the set first depth threshold. The first depth threshold may be determined based on user-defined settings, or may be determined based on a virtual desktop provided by the smart glasses. The smart glasses may provide a virtual desktop corresponding to a depth value, which may be the first depth threshold. Alternatively, the gesture motion stop condition may be understood as a gesture motion pause condition or a gesture motion end condition, and the gesture stop condition may include: the depth value in the coordinates is less than or equal to the set second depth threshold. Accordingly, the second depth threshold may be determined based on user-defined settings, or may be determined based on a virtual desktop provided by the smart glasses. The depth threshold value in the gesture motion stop condition may be the same as or different from the depth threshold value in the gesture motion start condition. For example, the second depth threshold may be a virtual desktop depth value-2, or a virtual desktop depth value-3, etc., so as to prevent erroneous judgment caused by excessive user operation.

In other words, the coordinates in the coordinate sequence with the depth value greater than the set depth threshold value are the start positions of the gesture tracks, and the coordinates in the coordinate sequence with the depth value less than or equal to the set depth threshold value are the stop positions of the gesture tracks.

The smart glasses may then determine a gesture trajectory of the user based on the start position, the stop position, and coordinates between the start position and the stop position. The intelligent glasses can perform curve fitting on the starting position, the stopping position and coordinates between the starting position and the stopping position to form a gesture track of a user. In this way, the intelligent glasses can accurately identify gesture tracks of users according to the plurality of groups of hand images.

The steps 141-143 are described in detail above, and through these steps, the smart glasses can accurately identify the gesture tracks of the user with different hand integrity conditions according to the multiple sets of hand images by using the target hand joint tracking strategy.

In some optional embodiments, before predicting the target gesture pattern of the user in the real space according to the gesture track, the smart glasses may further correct the gesture track according to the hand image corresponding to the gesture track, so as to improve accuracy of the identified gesture track. An exemplary description will be made below.

In some embodiments a, the gesture track may be modified based on the background of the hand image.

And correcting the gesture track, including correcting any coordinate in the gesture track. The intelligent glasses can determine a hand image corresponding to any coordinate in the gesture track. The intelligent glasses can search the region where the coordinates are located to obtain the hand image.

Then, the intelligent glasses can perform background recognition on the hand image to recognize whether the hand of the user adheres to the surface and the surface to which the hand of the user adheres, for example, whether the hand of the user adheres to the surface (i.e. a space state) or not can be recognized, and whether the hand of the user adheres to the surface of a desk, a display screen, a paper surface, a wall surface, a chair surface or a blackboard surface and the like can be recognized. Furthermore, the intelligent glasses can further judge whether the surface attached to the hands of the user is a designated surface, wherein the designated surface can be any surface such as a display screen, a paper surface, a wall surface, a table surface or a chair surface.

Taking any coordinate and the hand image corresponding to the coordinate as an example, if the hand image is based on the hand image to determine that the hand of the user is attached to the appointed surface, a normal direction of the appointed surface can be determined by adopting a computer vision algorithm. The smart glasses may determine a gaze direction of the binocular camera at a photographing time of the hand image, wherein the gaze direction may be determined according to parameters of an inertial measurement unit of the smart glasses, wherein the inertial measurement unit may include, but is not limited to: a gyroscope. And then, the intelligent glasses can carry out homography transformation correction on the coordinate according to the determined included angle between the normal direction and the sight direction, so as to obtain the corrected coordinate. Wherein homography transformation corrections are used to map points in one plane (designated surface) into another plane (viewing plane perpendicular to the line of sight of the smart glasses). And the corresponding coordinate in the other plane after the coordinate is corrected by homography transformation is the corrected coordinate. Based on the above steps, the smart glasses may update the gesture track according to the corrected coordinates.

In this way, the intelligent glasses can correct the gesture track to the effect of vertical viewing of human eyes by means of homography transformation correction so as to prevent distortion of the gesture track caused by the problem of visual angles.

In some practical scenarios, the user may move the head during wearing the smart glasses, so that a deviation exists between the gesture track collected by the smart glasses and the real track.

Therefore, in some optional embodiments B, before predicting the target gesture pattern of the user in the real space according to the gesture track, the smart glasses may further determine whether the smart glasses move within the shooting time range of the gesture track according to the movement data of the smart glasses, and correct the gesture track according to the movement data of the smart glasses.

The intelligent glasses can acquire motion data by using the inertial measurement unit, wherein the motion data can be a moving direction of the intelligent glasses in a shooting time range and a moving amount of the intelligent glasses in the moving direction. The motion data may be determined from a rotation matrix and a translation matrix of an inertial measurement unit of the smart device. Wherein the rotation matrix is used to determine the direction of movement and the translation matrix is used to determine the amount of movement in a certain direction.

If the intelligent glasses are judged to move within the shooting time range of the gesture track, the gesture track can be corrected according to the motion vector of the intelligent glasses within the shooting time range, and the corrected gesture track is obtained. For example, a rotation matrix and a translation matrix of the inertial measurement unit in a shooting time range may be superimposed on the gesture track to obtain a new gesture track, so as to correct the gesture track.

In this way, the gesture trajectory may be modified to reduce the deviation between the gesture trajectory and the real trajectory due to the shaking of the user's head.

In some alternative embodiments C, the gesture trajectory may be modified based on hand images identified by multiple sets of binocular cameras.

The intelligent glasses can be provided with a plurality of groups of binocular cameras. The intelligent glasses can determine a plurality of gesture tracks according to hand images obtained by synchronously shooting the hands of the user by the plurality of groups of binocular cameras, and compensate the gesture tracks. The intelligent glasses can conduct interpolation processing on the missing part in the gesture track by adopting another gesture track aiming at any gesture track in the gesture tracks to obtain the optimized gesture tracks. In other words, the gesture tracks can be complemented with each other, so that the gesture tracks are more complete. Based on the gesture track, the intelligent glasses can perform weighted average on the gesture tracks to obtain an average gesture track.

Optionally, when the frame rate of the smart glasses is higher than a first preset threshold, for example, higher than 60FPS, the smart glasses may smooth the average gesture track, so that the gesture track is smoother. Optionally, when the frame rate of the smart glasses is lower than the second preset threshold, the smart glasses may also smooth the average gesture track. The second threshold may be less than or equal to the first threshold, which is not limited in this embodiment.

In some alternative embodiments, the smart glasses predict the gesture pattern of the user in the real space according to the gesture track, which can be achieved based on the following steps:

step 151, calculating the similarity between the gesture track and the gesture pattern in the preset pattern library. Wherein, gesture patterns in the pattern library can be added, deleted or modified by user definition. For example, the similarity of the gesture track calculated by the smart glasses to the gesture patterns A1-A6 in the pattern library is 20%, 0%, 30%, 40%, 70% and 95%, respectively.

Step 152, determining a gesture pattern with the highest similarity to the gesture track from the pattern library according to the calculated similarity, and using the gesture pattern as a target gesture pattern corresponding to the gesture track. Along the foregoing examples, the smart glasses may determine, from the pattern library, a gesture pattern A6 having the highest similarity to the gesture track as the target gesture pattern corresponding to the gesture track.

In this way, the smart glasses can perform approximation matching in the pattern library for the gesture track so as to accurately predict the gesture pattern of the user in the real space.

In some alternative embodiments, the smart glasses execute the interaction instruction matched with the target gesture pattern, which can be implemented based on the following steps:

step 161, determining an interaction mode corresponding to the target gesture pattern according to the correspondence between the interaction mode and the gesture pattern.

Step 162, displaying the target gesture pattern and the interaction mode corresponding to the target gesture pattern on the display screen of the smart glasses.

And 163, responding to the triggering operation of the designated event, and executing the interaction instruction corresponding to the interaction mode.

Wherein the interaction pattern includes, but is not limited to: open the designated program, return to the previous level menu, enter the next level menu, play, pause, and shut down, etc. The smart glasses may display the target gesture pattern and an interaction pattern corresponding to the target gesture pattern on the display screen. The specified event may be a user event, such as a click event or a voice confirmation time of a user, and the specified event may also be a system event, such as a countdown ending event after the target gesture pattern and the interaction mode corresponding to the target gesture pattern are displayed or a countdown ending event corresponding to each of different interaction modes. The different interaction modes can have priorities, and the countdown time durations among the different priorities are different.

In this way, the smart glasses may execute an interaction instruction matching the target gesture pattern that is closer to the user's intent.

In some optional embodiments, in response to an initialization event of a user, in a process of rotating a hand of the user, shooting the hand by using a binocular camera to obtain continuous multiple target hand images of the hand at different angles, which can be achieved based on the following ways:

in response to an initialization event of a user, shooting the hand through a binocular camera in the process of rotating the hand by the user to obtain a plurality of first hand images; screening a plurality of second hand images with definition higher than a preset definition threshold from the first hand images; based on an image recognition algorithm or a hand joint tracking algorithm, whether the hand of the user in each second hand image meets preset gesture conditions or not is recognized. The preset posture conditions at least comprise: the hands of the user are completely flattened. Then, the intelligent glasses can screen out hand images meeting preset gesture conditions from the second hand images to serve as continuous target hand images.

In some alternative embodiments, the smart glasses may also take as training samples a continuous plurality of target hand images of the user's hand at different angles and target keypoints with the final determined user's hand. And performing migration training on a preset hand joint tracking model by using a training sample, thereby obtaining a personalized hand joint tracking model. The individualized hand joint tracking model obtained through training has the capability of determining target key points according to hand images of users, and can accurately determine the target key points to be tracked for users with different hand integrity conditions.

The gesture track interaction method will be further described below in connection with the actual scenario.

In an actual scene, after the complete condition of the hand is determined, the intelligent glasses can also find the key point which is most adjacent to the key point of the defect under the condition of partial defect of the finger, define the key point as a peripheral joint and inherit the interaction logic of the most peripheral joint of the finger under the condition of non-defect; if the finger is in the single-finger or multi-finger incomplete state, the missing joint is shielded on the universal hand joint tracking algorithm, so that the algorithm does not identify the missing finger any more, and potential false touch is avoided; under the conditions of sound hand, partial joint defect of fingers, single finger defect or multiple finger defect, retraining the three-dimensional reconstructed hand model by using transfer learning to obtain a personalized hand joint tracking model; if the fingers are completely missing or the hands are completely incomplete, canceling a hand joint tracking algorithm of the hand, and replacing the hand joint tracking algorithm with a target detection algorithm for detecting limb tips; optionally changing the interaction logic into three-dimensional space motion trail tracking and identification of limb tips.

In an actual scene, after determining a target key point, the intelligent glasses can determine which frame to start tracking, and record the motion track of each frame after the point until the end. And acquiring data of the three-dimensional accelerometer and data of the inertial measurement unit of the same frame, and calculating the position and orientation of the inertial measurement unit of the frame according to the data of the three-dimensional accelerometer and the data of the inertial measurement unit of the same frame. Calculating the position and orientation of the initial frame inertial measurement unit under the current frame coordinate system; generating translation and rotation matrices; calculating an inverse matrix of the translation matrix and the rotation matrix; and performing matrix multiplication on the current frame track coordinates and inverse matrixes of the translation matrix and the rotation matrix to obtain corrected coordinates. By the specific mode of the motion compensation/hand joint tracking algorithm, the intelligent glasses can track the target key points, so that the gesture track of the user is obtained.

It should be noted that, the execution subjects of each step of the method provided in the above embodiment may be the same device, or the method may also be executed by different devices. For example, the execution subject of steps 11 to 14 may be the device a; for another example, the execution subject of steps 11 and 12 may be device a, and the execution subject of steps 13 and 14 may be device B; etc.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations appearing in a specific order are included, but it should be clearly understood that the operations may be performed out of the order in which they appear herein or performed in parallel, the sequence numbers of the operations such as 11, 12, etc. are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel.

It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

Fig. 2 is a schematic structural diagram of smart glasses according to an exemplary embodiment of the present application, as shown in fig. 2, the smart glasses include: camera assembly 201, memory 202, and processor 203.

Memory 202 is used to store computer programs and may be configured to store various other data to support operations on the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, contact data, phonebook data, messages, pictures, video, etc.

In some embodiments, the processor 203 is coupled to the memory 202 for executing the computer program in the memory 202 for: responding to an initialization event of a user, and shooting the hand through the binocular camera in the process of rotating the hand by the user to obtain continuous multiple target hand images of the hand at different angles; determining a target hand joint tracking strategy from a plurality of preset hand joint tracking strategies according to the continuous target hand images; the different hand joint tracking strategies are used for tracking target key points on hands under different complete conditions, and the target key points on the hands under different complete conditions are not completely the same; shooting hand actions of the user in a real space through the binocular camera to obtain continuous multiple groups of hand images; any set of hand images including left eye images and right eye images; according to the multiple groups of hand images, recognizing gesture tracks of the user by utilizing the target hand joint tracking strategy; predicting a target gesture pattern of the user in real space according to the gesture track; and executing interaction instructions matched with the target gesture pattern.

Further optionally, when the processor 203 determines the target hand joint tracking strategy from the preset plurality of hand joint tracking strategies according to the continuous plurality of target hand images, the method is specifically used for: determining the complete condition of the hand according to the continuous multiple target hand images; and determining a target hand joint tracking strategy from the preset various hand joint tracking strategies according to the corresponding relation between the preset complete condition and the hand joint tracking strategy.

Further optionally, the processor 203 is specifically configured to, when determining the integrity of the hand according to the continuous multiple target hand images: determining coordinates of at least one hand key point of the hand under different angles based on the continuous multiple target hand images by using a hand joint tracking algorithm; establishing a hand model according to coordinates of hand key points of the hand under different angles; comparing the hand model with a preset general hand model, and determining the complete condition of the hand according to the comparison result.

Further optionally, the processor 203 is specifically configured to, when identifying the gesture track of the user according to the multiple sets of hand images and using the target hand joint tracking strategy: determining target key points for recording tracks by utilizing the target hand joint tracking strategy; identifying coordinates of the target key points in a plurality of groups of hand images shot by the binocular cameras by adopting a binocular depth algorithm to obtain a coordinate sequence of the target key points; and determining the gesture track of the user according to the coordinate sequence of the target key point.

Further optionally, the processor 203 uses the target hand joint tracking strategy to determine a target key point for recording a trajectory, specifically for any of the following: when the target hand joint tracking strategy corresponds to the complete condition of hand soundness, responding to target key point configuration operation of a user, and determining the target key point; when the target hand joint tracking strategy corresponds to the complete condition of partial finger deficiency, selecting a key point closest to the missing part on the missing finger as a target key point; when the target hand joint tracking strategy corresponds to the complete condition of single finger deformity or multiple finger deformity, shielding the completely missing fingers, and responding to target key point configuration operation of a user aiming at the unshielded fingers to determine the target key points; and when the target hand joint tracking strategy corresponds to the complete condition of the overall incomplete hand, determining the limb ending of the user closest to the incomplete hand as the target key point.

Further optionally, the processor 203 identifies coordinates of the target key point in the multiple groups of hand images captured by the binocular camera by using a binocular depth algorithm, and when obtaining a coordinate sequence of the target key point, the method is specifically used for: identifying the target keypoints from the plurality of sets of hand images; for a group of hand images shot at any moment, determining parallax of the target key points in a left eye hand image and a right eye hand image in any group of hand images; and determining coordinates of the target key points at the shooting time of any group of hand images by adopting a binocular depth algorithm according to the parallax.

Further optionally, when the processor 203 determines the gesture track of the user according to the coordinate sequence of the target key point, the method is specifically used for: determining coordinates meeting gesture motion starting conditions from the coordinate sequence of the target key points, and taking the coordinates as starting positions of gesture tracks; and determining coordinates meeting the gesture motion stop condition in the coordinate sequence as a stop position of gesture estimation; and determining the gesture track of the user according to the starting position, the stopping position and the coordinates between the starting position and the stopping position.

Optionally, the processor 203 is further configured to, before predicting the target gesture pattern of the user in the real space according to the gesture track: determining a hand image corresponding to any coordinate in the gesture track; performing background recognition on the hand image to judge whether the hands of the user in the hand image are attached to a designated surface or not; if yes, determining the normal direction of the appointed surface by adopting a computer vision algorithm; determining the sight direction of the binocular camera at the shooting moment of the hand image; and carrying out homography transformation correction on the coordinates according to the included angle between the normal direction and the sight direction to obtain corrected coordinates, and updating the gesture track according to the corrected coordinates.

Optionally, the processor 203 is further configured to, before predicting the target gesture pattern of the user in the real space according to the gesture track: judging whether the intelligent glasses move within the shooting time range of the gesture track according to the movement data of the intelligent glasses; if yes, correcting the gesture track according to the motion vector of the intelligent glasses in the shooting time range, and obtaining the corrected gesture track.

Optionally, when the processor 203 predicts the gesture pattern of the user in the real space according to the gesture track, the method is specifically used for: calculating the similarity between the gesture track and gesture patterns in a preset pattern library; and determining a gesture pattern with the highest similarity with the gesture track from the pattern library according to the calculated similarity, and taking the gesture pattern as a target gesture pattern corresponding to the gesture track.

Optionally, when the processor 203 executes the interaction instruction matched with the target gesture pattern, the method specifically is used for: determining an interaction mode corresponding to the target gesture pattern according to the corresponding relation between the interaction mode and the gesture pattern; displaying the target gesture pattern and an interaction mode corresponding to the target gesture pattern on a display screen of the intelligent glasses; and responding to the triggering operation of the appointed event, and executing the interaction instruction corresponding to the interaction mode.

Further, as shown in fig. 2, the smart glasses further include: communication component 204, and display component 205, among other components. Only some of the components are schematically shown in fig. 2, which does not mean that the smart glasses only comprise the components shown in fig. 2.

The memory of fig. 2 described above may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The communication assembly of fig. 2 is configured to facilitate wired or wireless communication between the device in which the communication assembly is located and other devices. The device in which the communication component is located may access a wireless network based on a communication standard, such as WiFi,2G, 3G, 4G, or 5G, or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component may be implemented based on Near Field Communication (NFC) technology, radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

Accordingly, the embodiment of the present application also provides a computer readable storage medium storing a computer program, where the computer program when executed can implement the steps that can be executed by the smart glasses in the above method embodiment.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A gesture track interaction method, which is suitable for smart glasses, wherein the smart glasses comprise binocular cameras, the method comprising:

responding to an initialization event of a user, and shooting the hand through the binocular camera in the process of rotating the hand by the user to obtain continuous multiple target hand images of the hand at different angles;

Determining a target hand joint tracking strategy from a plurality of preset hand joint tracking strategies according to the continuous target hand images; the different hand joint tracking strategies are used for tracking target key points on hands under different complete conditions, and the target key points on the hands under different complete conditions are not completely the same;

shooting hand actions of the user in a real space through the binocular camera to obtain continuous multiple groups of hand images; any set of hand images including left eye images and right eye images;

according to the multiple groups of hand images, recognizing gesture tracks of the user by utilizing the target hand joint tracking strategy;

predicting a target gesture pattern of the user in real space according to the gesture track;

executing an interaction instruction matched with the target gesture pattern;

according to the multiple groups of hand images, the gesture track of the user is identified by utilizing the target hand joint tracking strategy, and the gesture track comprises the following steps: determining target key points for recording tracks by utilizing the target hand joint tracking strategy; identifying coordinates of the target key points in a plurality of groups of hand images shot by the binocular cameras by adopting a binocular depth algorithm to obtain a coordinate sequence of the target key points; determining a gesture track of the user according to the coordinate sequence of the target key point;

Determining target key points for recording tracks by using the target hand joint tracking strategy comprises the following steps: when the target hand joint tracking strategy corresponds to the complete condition of hand soundness, responding to target key point configuration operation of a user, and determining the target key point; when the target hand joint tracking strategy corresponds to the complete condition of partial finger deficiency, selecting a key point closest to the missing part on the missing finger as a target key point; when the target hand joint tracking strategy corresponds to the complete condition of single finger deformity or multiple finger deformity, shielding the completely missing fingers, and responding to target key point configuration operation of a user aiming at the unshielded fingers to determine the target key points; and when the target hand joint tracking strategy corresponds to the complete condition of the overall incomplete hand, determining the limb ending of the user closest to the incomplete hand as the target key point.

2. The method of claim 1, wherein determining a target hand joint tracking strategy from a preset plurality of hand joint tracking strategies based on the continuous plurality of target hand images comprises:

Determining the complete condition of the hand according to the continuous multiple target hand images;

and determining a target hand joint tracking strategy from the preset various hand joint tracking strategies according to the corresponding relation between the preset complete condition and the hand joint tracking strategy.

3. The method of claim 2, wherein determining the integrity of the hand from the continuous plurality of target hand images comprises:

determining coordinates of at least one hand key point of the hand under different angles based on the continuous multiple target hand images by using a hand joint tracking algorithm;

establishing a hand model according to coordinates of hand key points of the hand under different angles;

comparing the hand model with a preset general hand model, and determining the complete condition of the hand according to the comparison result.

4. The method of claim 1, wherein identifying coordinates of the target keypoints in the plurality of sets of hand images captured by the binocular camera using a binocular depth algorithm to obtain a coordinate sequence of the target keypoints comprises:

identifying the target keypoints from the plurality of sets of hand images;

For a group of hand images shot at any moment, determining parallax of the target key points in a left eye hand image and a right eye hand image in any group of hand images;

and determining coordinates of the target key points at the shooting time of any group of hand images by adopting a binocular depth algorithm according to the parallax.

5. The method of claim 1, wherein determining the gesture trajectory of the user from the coordinate sequence of the target keypoints comprises:

determining coordinates meeting gesture motion starting conditions from the coordinate sequence of the target key points, and taking the coordinates as starting positions of gesture tracks; and determining coordinates meeting the gesture motion stop condition in the coordinate sequence as a stop position of gesture estimation;

and determining the gesture track of the user according to the starting position, the stopping position and the coordinates between the starting position and the stopping position.

6. The method of claim 5, wherein the gesture motion stop condition comprises: the depth value in the coordinates is greater than a set first depth threshold; the first depth threshold comprises: the depth value of the virtual desktop provided by the intelligent glasses; the gesture motion stop condition includes: the depth value in the coordinates is less than or equal to a set second depth threshold; the second depth threshold is determined from the first depth threshold.

7. The method of claim 1, wherein predicting the target gesture pattern of the user in the real space based on the gesture trajectory further comprises:

determining a hand image corresponding to any coordinate in the gesture track;

performing background recognition on the hand image to judge whether the hands of the user in the hand image are attached to a designated surface or not;

if yes, determining the normal direction of the appointed surface by adopting a computer vision algorithm;

determining the sight direction of the binocular camera at the shooting moment of the hand image;

and carrying out homography transformation correction on the coordinates according to the included angle between the normal direction and the sight direction to obtain corrected coordinates, and updating the gesture track according to the corrected coordinates.

8. The method of claim 1, wherein predicting the target gesture pattern of the user in the real space based on the gesture trajectory further comprises:

judging whether the intelligent glasses move within the shooting time range of the gesture track according to the movement data of the intelligent glasses;

If yes, correcting the gesture track according to the motion vector of the intelligent glasses in the shooting time range, and obtaining the corrected gesture track.

9. The method of any of claims 1-8, wherein predicting a gesture pattern of the user in the real space from the gesture trajectory comprises:

calculating the similarity between the gesture track and gesture patterns in a preset pattern library;

and determining a gesture pattern with the highest similarity with the gesture track from the pattern library according to the calculated similarity, and taking the gesture pattern as a target gesture pattern corresponding to the gesture track.

10. The method of any of claims 1-8, wherein executing the interaction instruction that matches the target gesture pattern comprises:

determining an interaction mode corresponding to the target gesture pattern according to the corresponding relation between the interaction mode and the gesture pattern;

displaying the target gesture pattern and an interaction mode corresponding to the target gesture pattern on a display screen of the intelligent glasses;

and responding to the triggering operation of the appointed event, and executing the interaction instruction corresponding to the interaction mode.

11. An intelligent eyeglass, comprising: the camera comprises a camera component, a memory and a processor; wherein the memory is for: store one or more computer instructions; the processor is configured to execute the one or more computer instructions to: invoking the camera assembly to perform the steps in the method of any of claims 1-10.

12. A computer readable storage medium, characterized in that the computer program, when executed by a processor, causes the processor to carry out the steps of the method of any one of claims 1-10.