CN114967943A

CN114967943A - Method and equipment for determining 6DOF (degree of freedom) pose based on 3D (three-dimensional) gesture recognition

Info

Publication number: CN114967943A
Application number: CN202210624185.8A
Authority: CN
Inventors: 曾杰; 李斌
Original assignee: Hisense Electronic Technology Shenzhen Co ltd
Current assignee: Hisense Electronic Technology Shenzhen Co ltd
Priority date: 2022-06-02
Filing date: 2022-06-02
Publication date: 2022-08-30

Abstract

The application relates to the technical field of virtual display, and provides a method and equipment for determining a 6DOF pose based on 3D gesture recognition, wherein four fingertips except a thumb are selected as mark points for positioning a 3DOF handle, the four fingertips are fixedly pressed at preset position points of the 3DOF handle, the thumb can move to press a key, and the operation of the 3DOF handle is not influenced; through a 3D gesture recognition technology, 3D coordinates of four fingertips under a virtual display device coordinate system are directly calculated, the 3D coordinates of the four fingertips under a 3DOF handle coordinate system are combined, the relative poses of a handle and the virtual display device are obtained through the alignment of the 3D-3D coordinates, the unified alignment of the handle and a virtual display device reference coordinate system is realized, the IMU integral pose of the handle and the 3D gesture estimation pose after alignment are jointly optimized and output, and therefore the 3DOF handle with low cost, low power consumption and simple structure is utilized, and the high-precision 6DOF pose is output.

Description

Method and equipment for determining 6DOF (degree of freedom) pose based on 3D (three-dimensional) gesture recognition

Technical Field

The application relates to the technical field of virtual display, in particular to a method and equipment for determining a 6DOF (degree of freedom) pose based on 3D gesture recognition.

Background

For Virtual display devices such as Virtual Reality (VR), Augmented Reality (AR), etc., a handle is usually used to implement conventional interaction.

Currently, commonly used handles include a 3 Degree of freedom (DOF) handle and a 6DOF handle. The 3DoF handle has a simple structure, low manufacturing cost and mature positioning technology, and the rotation posture is provided by mainly utilizing an Inertial Measurement Unit (IMU) on the 3DoF handle. The 3DoF handle realizes the basic functions of clicking, dragging and the like in the virtual world based on the rotary gesture, is often used together with virtual display equipment such as VR, AR and the like of the 3DoF, but the accumulated error of translation by IMU integral is larger, so that the translation position of the 3DOF handle cannot be used, and thus, when the 3DOF handle is used, complicated game actions cannot be made due to lack of the translation position. Thus, each manufacturer has successively introduced 6DoF handles that include a rotational pose and a translational position.

Aiming at the 6DOF handle, a visual positioning technology is mainly adopted to output a 6DoF pose. Specifically, the handle is provided with the LED lamp ring, and the 6DoF position of the handle relative to the AR, VR and other virtual display devices is output by adopting a visual positioning algorithm according to the LED lamp ring images acquired by the multi-view cameras on the AR, VR and other virtual display devices. However, the 6DoF handle is more complex in structure, higher in hardware cost and power consumption cost, than the 3DoF handle using an IMU to position the 3DoF pose.

Therefore, outputting the 6DOF pose by using the 3DOF handle with simple structure and low cost is a problem to be solved urgently.

Disclosure of Invention

The embodiment of the application provides a method and equipment for determining a 6DOF pose based on 3D gesture recognition, which are used for reducing the cost and power consumption of outputting the 6DOF pose by a handle.

On one hand, the embodiment of the application provides a method for determining a 6DOF pose based on 3D gesture recognition, which is applied to a 3DOF handle and comprises the following steps:

recognizing gestures in hand images acquired by a multi-view camera of the virtual display device, and determining first 3D coordinates of four fingertips fixed at preset position points of the 3DOF handle except thumbs under a virtual display device coordinate system;

acquiring second 3D coordinates of the four fingertips under a 3DOF handle coordinate system;

determining a relative pose relationship between the virtual display device and the 3DOF handpiece to align a reference coordinate system of the virtual display device and the 3DOF handpiece according to the first 3D coordinate and the second 3D coordinate;

integrating measurement data of the IMU of the 3DOF handle from the reference coordinate system alignment time point to determine an initial 6DOF pose of the 3DOF handle in the reference coordinate system;

and performing visual positioning on the 3DOF handle, and updating the initial 6DOF pose by using a visual 6DOF pose.

On the other hand, an embodiment of the present application provides a virtual display device, where the virtual display device includes a memory, a processor, a first communication interface, and a second communication interface, where the first communication interface, the second communication interface, and the memory are connected to the processor through a bus;

the virtual display device is connected with the 3DOF handle through the first communication interface, and the virtual display device is connected with the multi-view camera through the second communication interface;

the memory includes a data storage unit and a program storage unit, the program storage unit storing computer program instructions, the processor performing the following operations in accordance with the computer program instructions:

acquiring hand images acquired by the multi-view camera through the second communication interface, and storing the hand images in a data storage unit;

acquiring measurement data of the IMU of the 3DOF handle through the first communication interface, and storing the measurement data in a data storage unit;

recognizing gestures in a hand image acquired by the multi-view camera, and determining first 3D coordinates of four fingertips fixed at preset position points of the 3DOF handle except thumbs under a virtual display device coordinate system;

acquiring second 3D coordinates of the four fingertips in a 3DOF handle coordinate system;

In another aspect, embodiments of the present application provide a computer-readable storage medium storing computer-executable instructions for causing a computer device to perform a method for determining a 6DOF pose based on 3D gesture recognition provided by embodiments of the present application.

In the method and the equipment for determining the 6DOF pose based on 3D gesture recognition, the 3DOF handle with a simple structure and low cost can be used for outputting the 6DOF pose. The method carries out gesture recognition by means of a hand image acquired by a multi-view camera of a virtual display device, determines a first 3D coordinate of four fingertips fixed at a preset position point of a 3DOF handle except a thumb under a virtual display device coordinate system, acquires a second 3D coordinate of the four fingertips under the 3DOF handle coordinate system according to the structure of the 3DOF handle, determines a relative pose relation between the virtual display device and the 3DOF handle, thereby aligning reference coordinate systems of the virtual display device and the 3DOF handle, integrates measurement data of an IMU of the 3DOF handle from the time point of alignment of the reference coordinate systems, determines an initial 6DOF pose of the 3DOF handle in the reference coordinate system, and considers that the accumulated error of the IMU integral on a translation position is larger, the rotation pose is more accurate, therefore, the method carries out visual positioning on the 3DOF handle by means of the hand image acquired by the multi-view camera to obtain a visual 6DOF pose, because the vision positioning error is small, the initial 6DOF pose can be updated to reduce the accumulated error of IMU integration to the translation position, the usability of the translation position output by the 3DOF handle is improved, and the 6DOF pose containing the translation position and the rotation pose is further acquired.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

FIG. 2A is a schematic view of a 3DOF handle provided by an embodiment of the present application;

FIG. 2B is a schematic view of a 6DOF handle provided by embodiments of the present application;

fig. 2C is a schematic diagram of a virtual display device including a multi-view camera according to an embodiment of the present disclosure;

FIG. 2D is another schematic view of a 6DOF handle provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of an overall scheme for outputting a 6DOF pose using a 3DOF handle provided by an embodiment of the present application;

fig. 4 is a flowchart of a method for determining a 6DOF pose based on 3D gesture recognition according to an embodiment of the present application;

fig. 5 is a schematic view of a process of extracting 3D coordinates of four fingertips in a virtual display device coordinate system through gesture recognition according to an embodiment of the present disclosure;

FIG. 6 is a flowchart of a method for determining 3D coordinates of identified hand joint points according to an embodiment of the present application;

FIG. 7 is a schematic diagram of 21 hand joint points provided by an embodiment of the present application;

FIG. 8 is a flowchart illustrating a method for determining 3D coordinates of a hand joint point in a virtual display device coordinate system according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a process for improving 3D coordinate accuracy of hand joint points through gesture correction according to an embodiment of the present disclosure; FIG. 10 is a flowchart of a method for determining the pose of the 3DOF handle in the reference frame after alignment according to an embodiment of the present disclosure;

FIG. 11 is a flowchart of a method for updating IMU positioning using visual positioning according to an embodiment of the present application;

FIG. 12 is a flowchart of a complete method for determining 6DOF poses based on 3D gesture recognition according to an embodiment of the present disclosure;

fig. 13 is a hardware configuration diagram of a virtual display device according to an embodiment of the present application;

fig. 14 is a functional structure diagram of a virtual display device according to an embodiment of the present application.

Detailed Description

The virtual display devices such as AR and VR generally refer to head-mounted display devices (abbreviated as head displays or helmets, such as VR glasses and AR glasses) with independent processors, and have functions of independent operations, input and output. The virtual display equipment can be externally connected with a handle, and a user controls a virtual picture displayed by the virtual display equipment through the operation handle, so that conventional interaction is realized. Therefore, the handle is often sold in combination with virtual display devices such as AR and VR.

Taking a game scene as an example, see fig. 1, which is a schematic view of an application scene of a VR device and a handle provided in the embodiment of the present application. As shown in fig. 1, by utilizing the advantage of a large screen of a television, the virtual game picture of the VR device is projected on the television, and the entertainment is higher. The player controls the game picture displayed by the VR head through the handle, and makes the reflection on the limbs according to the change of the game scene, so that the immersive experience of the player is experienced, and the interest of the game is improved.

In the game scene shown in fig. 1, in the interaction process, the relative pose between the handle and the virtual display devices such as AR and VR is calculated through a handle positioning algorithm, so that the three-dimensional interaction of the virtual display devices in a three-dimensional space is realized, and the immersive experience is improved.

At present, according to different output poses, commonly used handles comprise a 3DOF handle and a 6DOF handle, wherein the 3DOF handle outputs a 3-dimensional rotation pose, the 6DOF handle outputs a 3-dimensional translation position and a 3-dimensional rotation pose, and compared with the 3DOF handle, the 6DOF handle can make more complex game actions and has stronger interestingness.

As shown in fig. 2A, a 3DOF handle schematic is provided for embodiments of the present application. The DOF handle provides 3DOF rotational poses for 3DOF VR, AR, etc. virtual display devices using an internal IMU. Specifically, according to the measurement data acquired by the IMU, a stable rotation attitude is output by adopting algorithms such as complementary filtering or Kalman (kalman) filtering, and therefore basic functions such as clicking and dragging in the virtual world are achieved based on the rotation attitude of the 3DoF handle. However, the accumulated error of the translation by the IMU integral is large, so that the translation position of the 3DOF handle is not used, and thus, when the 3DOF handle is used, complicated game motions cannot be made due to lack of the translation position, and the experience is poor.

As shown in fig. 2B, for a 6DOF handle schematic diagram provided in the embodiment of the present application, the 6DOF handle provides 6DOF poses (including dimensional rotation poses and translation positions) for virtual display devices such as AR and VR with 6DOF by using a visual positioning technology. As shown in fig. 2B, the 6DOF handle is one more ring of LED lamps than the normal 3DOF handle, and the white point hole is the location of each LED lamp. While virtual display devices such as AR, VR, etc. contain multiple cameras, typically 2 or 4, as shown in fig. 2C. A multi-view system on AR, VR and other virtual display equipment can acquire LED lamp ring images, a 6DOF (degree of freedom) handle is output relative to 6DoF poses of the AR, VR and other virtual display equipment by using a visual positioning algorithm, and after the pose is converted into a virtual world, unconstrained 6DoF interactive operation can be achieved.

Another 6DOF handle schematic is provided for the present embodiment, as shown in fig. 2D. The 6DFO handle can emit infrared light, and because the infrared light is generally difficult to observe, a multi-view camera of virtual display equipment such as AR, VR and the like is generally set to be in a mode of cyclically switching short exposure and natural exposure so as to observe the position of the 6DOF handle and output a 6DOF pose.

Although the 6DOF handle has more translation position output than the 3DOF handle, the 6DOF handle has more complicated structural design and circuit control than the 3DOF handle, higher cost and larger power consumption, is not beneficial to reducing the cost of using VR equipment and AR equipment, and limits the popularization and application of the VR equipment and the AR equipment.

In view of this, the embodiments of the present application provide a method and an apparatus for determining a 6DOF pose based on 3D gesture recognition, where a 3DOF handle is used, and the 6DOF pose is output through a 3D gesture recognition technology, so that the cost and power consumption of the 6DOF pose are reduced. According to the embodiment of the application, four fingertips except a thumb are selected as the mark points for positioning the 3DOF handle, the four fingertips are fixedly pressed at the preset position points of the 3DOF handle, and the thumb can move to press a key, so that the positioning precision is ensured, and the operation of the 3DOF handle is not influenced; and 3D coordinates of the four fingertips under a virtual display equipment coordinate system can be directly calculated through a 3D gesture recognition technology, 3D-3D coordinate alignment is realized by combining the 3D coordinates of the four fingertips under a 3DOF handle coordinate system, measurement data of an IMU of the aligned 3DOF handle is integrated, a relative 6D pose of the 3DOF handle and a head display is obtained, and the relative 6D pose is converted into an aligned reference coordinate system for outputting, so that the 3DOF handle with low cost, low power consumption and simple structure is utilized, and the high-precision 6DOF pose is output. Meanwhile, in order to ensure stable output of accurate fingertip 3D coordinates, a hand posture correction module is added in the 3D gesture recognition technology, a standard hand model of a holding handle is used as a reference, correction and optimization are carried out on a recognition result, accurate fingertip 3D coordinates are output, and accuracy of a 6DOF pose is improved.

Referring to fig. 3, a schematic diagram of an overall scheme for outputting a 6DOF pose using a 3DOF handle is provided for an embodiment of the present application. As shown in fig. 3, the virtual display device has a multi-camera system, can acquire hand images holding a 3DOF handle, and sending to a processor of the virtual display device, the processor performing gesture recognition on the received hand image, extracting four fingertip points fixed at preset position points of the 3DOF handle except the thumb, and calculating 3D coordinates of the four fingertips in a virtual display device coordinate system, and the preset position points of the 3DOF handle, where the four fingertips are located, can be determined according to the self structure of the 3DOF handle, so that, by means of the preset position points, the 3D coordinates of the four fingertips in a 3DOF handle coordinate system can be acquired, the relative pose relation between the virtual display device and the 3DOF handle is determined by aligning the 3D coordinates in the virtual display device coordinate system and the 3D coordinates in the 3DOF handle coordinate system, and the unification of the reference coordinate system of the virtual display device and the 3DOF handle is completed. Furthermore, from the time point when the reference coordinate system is aligned, measurement data of an IMU of the 3DOF handle is integrated, the initial 6DOF pose of the 3DOF handle in the reference coordinate system is determined, and the fact that the accumulated error of the IMU integration on the translation position is large and the rotation pose is accurate is considered.

In the embodiment of the application, the fact that the thumb needs to operate the handle key and is not beneficial to the calculation of the 3D gesture is considered, so that only other four fingertips fixed at the preset position point of the 3DOF handle are selected to perform 6DoF positioning, grooves are designed for the four fingertips according to human engineering in structure, the 3DOF handle is convenient to hold, and the calculation of the 3D gesture is convenient; on the other hand, the hand image acquired by the multi-view camera is used for carrying out visual positioning on the 3DOF handle, and the initial 6DOF pose of the 3DOF handle is optimized by using the visual positioning result, so that the precision of the 6DOF pose is improved.

Based on the overall scheme shown in fig. 3, applied to a 3DOF handle, the present application provides a flowchart of a method for determining a 6DOF pose based on 3D gesture recognition, and referring to fig. 4, the process is executed by a virtual display device connected to the 3DOF handle and having an independent processor, and mainly includes the following steps:

s401: recognizing gestures in hand images acquired by a multi-view camera of the virtual display device, and determining first 3D coordinates of four fingertips fixed at preset position points of a 3DOF handle except thumbs in a virtual display device coordinate system.

Generally, in order to facilitate the user to hold the handle, the handle is provided with a groove (i.e. a predetermined position point) on the structure, and the user can place the finger tip at the position of the groove. In consideration of the difference between the thumb and the other four fingers and the fingers for controlling the keys of the handle, the embodiment of the application places the remaining four fingertips except the thumb in the groove of the 3DOF handle, i.e. fixes the four fingertips except the thumb at the preset position points of the 3DOF handle.

Referring to fig. 5, a schematic process diagram for extracting first 3D coordinates of four fingertips in a virtual display device coordinate system through gesture recognition is provided in the embodiment of the present application. As shown in fig. 5, the main contents of the process include gesture area detection, hand joint point extraction, and first 3D coordinate determination, and the specific implementation flow mainly includes the following steps, referring to fig. 6:

s4011: from hand images acquired from a multi-camera, a gesture region holding a 3DOF handle is detected.

The outer surface of the virtual display device contains a plurality of cameras, each camera is oriented differently, and hand images at different angles can be acquired. Each hand image is input into a pre-trained object detection model, and a gesture region holding the 3DOF handle is detected by the object detection model, as shown in fig. 5.

The embodiment of the present application does not have a limiting requirement on the target detection model, and for example, a conventional Machine learning algorithm (e.g., Support Vector Machine (SVM)) may be adopted, and a deep learning algorithm (e.g., Convolutional Neural Networks (CNN) and YOLOv3 network) may also be adopted.

S4012: and performing gesture estimation on the gesture area, and extracting hand joint points.

In S4012, a hand joint detection model trained in advance is used to perform gesture estimation on each gesture region, and 21 hand joint points are extracted. Referring to fig. 7, a schematic diagram of 21 hand joint points is provided for the embodiment of the present application, where each hand point corresponds to a unique identifier. The technology for extracting the hand joint points is mature, and the part does not serve as the focus of the application and is not described again for this reason.

S4013: and determining the first 3D coordinates of each hand joint point in the coordinate system of the virtual display device by a multi-mesh matching algorithm.

In the embodiment of the application, according to hand joint points extracted from gesture areas corresponding to a plurality of cameras, depth information of each hand joint point can be determined, and a first 3D coordinate of each hand joint point in a virtual display device coordinate system can be determined by combining internal parameters calibrated in advance by each camera. The specific implementation process is shown in fig. 8, and mainly includes the following steps:

s4013_ 1: and respectively matching the hand joint points extracted from the gesture area corresponding to the main camera with the hand joint points extracted from the gesture areas corresponding to the other cameras.

The multi-camera is a hand image shot from different angles, one of the multi-camera is selected as a main camera and the other cameras are selected as auxiliary cameras according to the richness of hand information contained in the shot hand image, and hand joint points extracted from a gesture area corresponding to the main camera are respectively matched with hand joint points extracted from gesture areas corresponding to the other cameras.

S4013_ 2: and determining the depth information of each hand joint point according to each matching result.

And calculating the distance from each hand joint point to the corresponding camera according to the matching results of the hand joint points extracted from the gesture area corresponding to the main camera and the hand joint points extracted from the gesture areas corresponding to other cameras.

S4013_ 3: and determining a first 3D coordinate of each hand joint point in a virtual display device coordinate system according to internal parameters calibrated in advance by the multi-view camera, the depth information of each hand joint point and the image coordinate of each hand joint point in the corresponding gesture area.

In S4013_3, the image coordinates of each hand joint point in the gesture area may be directly read out, and the depth information is taken as a Z axis perpendicular to the virtual display device, and the first 3D coordinates of each hand joint point in the virtual display device coordinate system may be determined by combining the internal parameters pre-calibrated by the multi-view camera.

S4014: first 3D coordinates of four fingertips fixed at preset position points of a 3DOF handle are acquired from each hand joint point.

For example, also taking fig. 7 as an example, since the marks of the four fingertips except the thumb are 8, 12, 16, and 20, respectively, as seen from the marks of the 21 hand joint points, the first 3D coordinates of the four fingertips fixed at the preset position point of the 3DOF handle can be obtained from the marks of the hand joint points from the first 3D coordinates of the 21 hand joint points.

Considering that the hand joint points extracted by gesture recognition may increase the error of the first 3D coordinates of the hand joint points due to a deviation of the way the hand is trembled or the handle is held, in some embodiments, the four fingertips are determined to be before the first 3D coordinates of the virtual display device coordinate system, and further include gesture 3D correction content, as shown in fig. 9. During specific implementation, the detected gesture is optimized by adopting a least square method according to a pre-established standard gesture reference model, so that the determination error of the first 3D coordinates of the four fingertips caused by hand jitter or wrong gestures is reduced, and the accuracy of determining the 6DOF pose is improved.

S402: a second 3D coordinate of the four fingertips in the 3DOF handle coordinate system is acquired.

In the embodiment of the application, the preset position points of the 3DOF handle where the four fingertips are located can be determined according to the self structure of the 3DOF handle, so that the second 3D coordinates of the four fingertips in the 3DOF handle coordinate system can be acquired by means of the preset position points.

In using the virtual display device and the 3DOF handle, the virtual display device and the 3DOF handle can move independently as two independent devices, and therefore, the virtual display device and the 3DOF handle have respective reference coordinate systems, and need to be aligned. See S403 for a specific procedure.

S403: determining a relative pose relationship between the virtual display device and the 3DOF handle to align a reference coordinate system of the virtual display device and the 3DOF handle according to the first 3D coordinate and the second 3D coordinate.

The first 3D coordinates of the four fingertips under the coordinate system of the virtual display device are assumed to be

The second 3D coordinates in the 3DOF handle coordinate system are respectively

Determining a relative pose relationship between the virtual display device and the 3DOF handle by aligning a first 3D coordinate in a virtual display device coordinate system and a second 3D coordinate in a 3DOF handle coordinate system

The formula is as follows:

by the above relative pose relationship

The pose of the 3DOF handle in the first reference coordinate system can be converted into a second reference coordinate system of the virtual display device, and the pose of the virtual display device in the second reference coordinate system can also be converted into the first reference coordinate system of the 3DOF handle, so that the alignment of the reference coordinate systems is realized.

S404: from the reference coordinate system alignment time point, the measurement data of the IMU of the 3DOF handpiece is integrated, determining the initial 6DOF pose of the 3DOF handpiece in the reference coordinate system.

After the reference coordinate systems of the virtual display device and the 3DOF handle are aligned, the poses of the virtual display device and the 3DOF handle can be determined under the same reference coordinate system (e.g., a second reference coordinate system in which the virtual display device is located). The position of the virtual display device in the reference coordinate system can be directly read out through a positioning device in the virtual display device, the position of the 3DOF handle can be predicted from the alignment time point of the reference coordinate system by adopting Kalman filtering, namely, the measurement data of the IMU of the 3DOF handle is integrated in the aligned reference coordinate system, and the initialization positioning is completed.

The process of determining the pose of the 3DOF handle in the reference coordinate system after alignment is shown in fig. 10, and mainly includes the following steps:

s4041: and acquiring acceleration measurement values of an accelerometer in the IMU at a time point aligned with a reference coordinate system, and performing secondary integration on the acceleration measurement values in a time dimension to obtain a translation position of the 3DOF handle in the reference coordinate system.

According to the mathematical relationship among the acceleration, the speed and the displacement, from the alignment time point of the reference coordinate system, the acceleration measurement value collected by the accelerometer in the IMU is integrated for the first time in the time dimension, so that the speed information of the 3DOF handle can be obtained, and the translational position (i.e. the displacement information) of the 3DOF handle in the reference coordinate system can be obtained after the speed is integrated for the first time (i.e. the acceleration measurement value is integrated for the second time).

S4042: and acquiring angular velocity measurement values of a gyroscope in the IMU at the alignment time point of a reference coordinate system, and performing primary integration on the angular velocity measurement values in the time dimension to obtain the rotation posture of the 3DOF handle in the reference coordinate system.

From the mathematical relationship between the rotation angle and the angular velocity, it can be known that the rotation posture of the 3DOF handle in the reference coordinate system (i.e. the three-axis rotation angle) can be obtained by integrating the angular velocity measurement values collected by the gyroscope in the IMU once in the time dimension from the alignment time point of the reference coordinate system.

S4043 determines an initial 6DOF pose of the 3DOF handle in the reference coordinate system from the translational position and the rotational pose.

Wherein, the front three-dimension of the 6DOF pose is a translation position, and the rear three-dimension is a rotation pose.

However, since the IMU integral has drift, the longer the accumulation time is, the larger the offset is, and the smaller the influence of the offset on the rotational attitude and the larger the influence on the translational position, the inaccurate translational position is caused, and therefore, the initial 6DOF pose needs to be corrected.

S405: and performing visual positioning on the 3DOF handle, and updating the initial 6DOF pose by using the visual 6DOF pose.

In the embodiment of the application, in order to offset the accumulated error of the IMU integral positioning, the 3DOF handle can be visually positioned, and the initial 6DOF pose is updated with the visual 6DOF pose. The specific implementation process is shown in fig. 11, and mainly includes the following steps:

s4051: and tracking the 3DOF handle by using the hand image acquired by the multi-view camera from the time point aligned by the reference coordinate system, and re-determining the relative pose relationship between the virtual display equipment and the 3DOF handle.

In the embodiment of the application, after a gesture region is detected from hand images acquired by a multi-camera, from the time point when a reference coordinate system is aligned, the 3DOF handle is tracked by using the hand images, the gesture region at the current moment is determined again, the first 3D coordinates of the hand joint points extracted from the gesture region at the current moment are determined, and the relative pose relationship between the virtual display device at the current moment and the 3DOF handle is determined again by combining the second 3D coordinates of each hand joint point at the current moment and using formula 1.

In S4051, compared with the gesture area detection for each frame of hand image, the amount of calculation can be saved by the tracking method, and the positioning speed can be increased.

S4052: and determining the visual 6DOF pose of the 3DOF handle in the reference coordinate system according to the 6DOF pose of the virtual display equipment in the reference coordinate system and the new relative pose relationship.

In S4052, the 6DOF pose of the virtual display device in the reference coordinate system can be directly read by a positioning device in the virtual display device, and the visual 6DOF pose of the 3DOF handle in the reference coordinate system can be determined by combining the new relative pose relationship at the current time.

S4053: the initial 6DOF pose is updated with the visual 6DOF pose.

Generally, the measurement frequency of the IMU is higher than the acquisition frame rate of the camera, so that the IMU has already performed an integration process within the time range of the camera acquiring two adjacent frames of hand images, and thus, the initial 6DOF pose of the 3DOF handle can be updated with each frame of visual positioning result of the multi-view camera. Because the accuracy of the 6DOF pose of the vision positioning is higher, the accumulated error of the IMU integral on the translation position can be accurately corrected, and the positioning accuracy is improved.

In an alternative embodiment, the method for updating the IMU positioning results with the visual positioning results may be implemented by introducing kalman filtering. The Kalman filtering is an error estimation algorithm for solving the system variance minimum by using a state equation according to predicted data and observed data input by a linear system, and the kernel idea of the Kalman filtering is as follows: firstly, a random dynamic variable in the system is selected, a prediction model is established, then an optimal estimation value is calculated by a state equation according to real-time observation data of the system, the calculation process of continuous prediction (prediction) -update (update) is realized, and the method has the advantages of small data processing amount and strong real-time performance.

In the prediction section of the embodiments of the present application: and respectively integrating the angular velocity measurement value and the acceleration measurement value of the IMU in the 3DOF handle, so that the initial 6DOF pose of the 3DOF handle in a reference coordinate system can be roughly calculated. In addition, during the integration process, the covariance matrix of the prediction error is also calculated iteratively: p _k+1 ＝FP _k F ^T + Q. Wherein F represents a pose prediction matrix, Q is a Gaussian white noise matrix, P _k Is the covariance matrix at the previous time instant. The longer the prediction time, the larger the error due to drift, because of white gaussian noise and random walk.

In the update part of the embodiment of the present application: after the visual positioning is successful, the relative pose relationship between the virtual display equipment and the 3DOF handle can be obtained, the 6DOF pose of the 3DOF handle in the reference coordinate system can be reversely deduced by combining the 6DOF pose of the virtual display equipment in the reference coordinate system, the 6DOF pose obtained by the visual positioning has high precision, and the method can be used for updating the IMU integral pose of the handle. In the updating process, Kalman filtering calculates Kalman gain, the measured value is updated on the predicted value, and noise interference is suppressed.

It should be noted that the implementation of the embodiment of the present application, which uses linear kalman filtering to perform the fusion positioning of the vision and the IMU, is only an example and is not a limiting requirement of the embodiment of the present application. For example, a non-linear algorithm may also be employed to optimize IMU positioning results with visual positioning results. In specific implementation, an IMU pre-integration theory is adopted, and the 6DOF pose of the 3DOF handle is optimized by using the reprojection error of the 6DOF pose positioned by N times of visual positioning and an IMU pre-integration residual error in a combined mode.

In some embodiments, since the IMU integration continuously predicts the 6DOF pose of the handle, a low probability of handle tracking anomaly does not affect the positioning result of the output handle, only affects the positioning accuracy, but if the long-time tracking fails, it indicates that the handle has moved out of the camera view, and at this time, the translational positioning accuracy is severely affected, and the optimization of the rotational pose can be backed. Specifically, when the hand image acquired by the multi-view camera fails to track the 3DOF handle, and the tracking failure duration is greater than a set time threshold, the rotation information in the initial 6DOF pose is optimized, and the 3DOF pose of the 3DOF handle in the reference coordinate system is obtained. But when the 3DOF handle moves into the camera view field again, the 3DOF handle is tracked again, so that the kalman filtering process is restarted, and the 6DoF pose of the 3DOF handle is continuously output.

Referring to fig. 12, a flowchart of a complete method for outputting a 6DOF pose by using a 3DOF handle according to an embodiment of the present application mainly includes the following steps:

s1201: and acquiring hand images acquired by the multi-view camera.

S1202: recognizing gestures in the hand image, and determining first 3D coordinates of four fingertips fixed at preset position points of the 3DOF handle except thumbs in a virtual display device coordinate system.

S1203: and determining whether the visual initialization positioning is successful, if not, executing S1204, and if so, executing S1207.

And if the first reference coordinate system of the 3DOF handle is aligned with the second reference coordinate system of the virtual display device, the visual initialization positioning is successful, otherwise, the failure of the visual initialization positioning is indicated.

S1204: a second 3D coordinate of the four fingertips in the 3DOF handle coordinate system is acquired.

S1205: determining a relative pose relationship between the virtual display device and the 3DOF handle to align a reference coordinate system of the virtual display device and the 3DOF handle according to the first 3D coordinate and the second 3D coordinate.

S1206: starting from the time point when the reference coordinate system is aligned, starting Kalman filtering to integrate the measurement data of the IMU of the 3DOF handle, and determining the initial 6DOF pose of the DOF handle in the reference coordinate system.

S1207: and determining whether the visual tracking positioning is successful, if so, executing S1208, otherwise, executing S1211.

If the hand image collected by the multi-view camera is tracked to the 3DOF handle, the visual tracking positioning is successful, otherwise, the visual tracking positioning fails.

S1208: the relative pose relationship between the virtual display device and the 3DOF handle is re-determined.

S1209: and determining the visual 6DOF pose of the 3DOF handle in the reference coordinate system according to the 6DOF pose of the virtual display equipment in the reference coordinate system and the new relative pose relationship.

S1210: and updating the initial 6DOF pose by using the visual 6DOF pose by adopting Kalman filtering.

S1211: and when the failure duration of the visual positioning and tracking is greater than a set time threshold, optimizing the rotation information in the initial 6DOF pose to obtain a 3DOF pose of the 3DOF handle in a reference coordinate system.

In the method for determining the 6DOF pose based on 3D gesture recognition, gesture recognition is performed by means of hand images acquired by a multi-view camera of the virtual display device, first 3D coordinates of four fingertips fixed at a preset position point of the 3DOF handle except a thumb under a coordinate system of the virtual display device are determined, and the thumb can move to press a key, so that the 3DOF handle is not influenced to be operated while the positioning accuracy is ensured; and according to the structure of the 3DOF handle, acquiring a second 3D coordinate of four fingertips under a 3DOF handle coordinate system, determining a relative pose relation between the virtual display device and the 3DOF handle, aligning a reference coordinate system of the virtual display device and the 3DOF handle, integrating measurement data of an IMU of the 3DOF handle from the alignment time point of the reference coordinate system, determining an initial 6DOF pose of the 3DOF handle in the reference coordinate system, and considering that the IMU integral has a larger accumulated error of a translation position and the rotation pose is more accurate, so the method utilizes the hand image acquired by the multi-view camera to visually position the 3DOF handle to obtain a visual 6DOF pose, can be used for updating the initial 6DOF pose to reduce the accumulated error of the IMU integral of the translation position and improve the usability of the translation position output by the 3DOF handle, and then acquire a 6DOF pose containing a translational position and a rotational pose.

Based on the same technical concept, the embodiment of the application provides a virtual display device, which can be a VR device or an AR device, can implement the method steps for determining the 6DOF pose based on 3D gesture recognition in the above embodiments, and can achieve the same technical effects.

Referring to fig. 13, the virtual display device includes a processor 1301, a memory 1302, a multi-view camera 1303 and a communication interface 1304, where the communication interface 1304, the multi-view camera 1303, the memory 1302 and the processor 1301 are connected through a bus 1305;

the memory 1302 includes data storage units and program storage units that store computer program instructions according to which the processor 1301 performs the following operations:

the virtual display equipment is connected with the 3DOF handle through the communication interface 1304, obtains the measurement data of the IMU of the 3DOF handle and stores the measurement data in a data storage unit;

acquiring hand images acquired by the multi-view camera 1303 and storing the hand images in a data storage unit;

Optionally, the processor 1301 performs visual positioning on the 3DOF handle, and updates the initial 6DOF pose with a visual 6DOF pose, specifically:

tracking the 3DOF handle by using the hand image acquired by the multi-view camera from the time point of alignment of the reference coordinate system, and re-determining the relative pose relationship between the virtual display equipment and the 3DOF handle;

determining a visual 6DOF pose of the 3DOF handle in the reference coordinate system according to the 6DOF pose of the virtual display device in the reference coordinate system and the new relative pose relationship;

updating the initial 6DOF pose with the visual 6DOF pose.

Optionally, when the hand image acquired by the multi-view camera fails to track the 3DOF handle and the tracking failure duration is greater than a set time threshold, the processor 1301 further performs:

optimizing the rotation information in the initial 6DOF pose to obtain a 3DOF pose of the 3DOF handle in the reference coordinate system.

Optionally, before determining the first 3D coordinates of the four fingertips in the virtual display device coordinate system, the processor 1301 further performs:

and optimizing the detected gesture by adopting a least square method according to a pre-established standard gesture reference model so as to reduce the determination error of the first 3D coordinates of the four fingertips caused by hand shaking or wrong gestures.

Optionally, the processor 1301 identifies a gesture in a hand image acquired by a multi-view camera of the virtual display device, and determines a first 3D coordinate of four fingertips fixed at a preset position point of the 3DOF handle in a virtual display device coordinate system, where the first 3D coordinate is specifically:

detecting a gesture area holding the 3DOF handle from a hand image acquired by the multi-view camera;

performing gesture estimation on the gesture area, and extracting hand joint points;

determining a first 3D coordinate of each hand joint point in the virtual display device coordinate system through a multi-view matching algorithm; first 3D coordinates of four fingertips fixed at preset position points of the 3DOF handle are acquired from each hand joint point.

Optionally, the processor 1301 determines, through a multi-view matching algorithm, a first 3D coordinate of each hand joint point in the virtual display device coordinate system, and the specific operations are as follows:

matching the hand joint points extracted from the gesture area corresponding to the main camera with the hand joint points extracted from the gesture areas corresponding to the other cameras respectively;

determining the depth information of each hand joint point according to each matching result;

and determining a first 3D coordinate of each hand joint point under the virtual display equipment coordinate system according to the internal parameters calibrated in advance by the multi-view camera, the depth information of each hand joint point and the image coordinate of each hand joint point in the corresponding gesture area.

Optionally, the processor 1301 integrates measurement data of the IMU of the 3DOF handle from the time point when the reference coordinate system is aligned, and determines an initial 6DOF pose of the 3DOF handle in the reference coordinate system, specifically:

acquiring an acceleration measurement value of an accelerometer in the IMU at the alignment time point of the reference coordinate system, and performing secondary integration on the acceleration measurement value in a time dimension to obtain a translation position of the 3DOF handle in the reference coordinate system;

acquiring angular velocity measurement values of a gyroscope in the IMU at the alignment time point of the reference coordinate system, and performing primary integration on the acceleration measurement values in a time dimension to obtain a rotation posture of the 3DOF handle in the reference coordinate system;

determining an initial 6DOF pose of the 3DOF handle in the reference coordinate system from the translational position and the rotational pose.

Optionally, the determination formula of the relative pose relationship between the virtual display device and the 3DOF handle is:

wherein the content of the first and second substances,

respectively representing first 3D coordinates of said four fingertips in said virtual display device coordinate system,

respectively representing second 3D coordinates of the four fingertips in the 3DOF handle coordinate system.

It should be noted that fig. 13 is only an example, and the hardware necessary for the virtual display device to execute the steps of the method for determining the 6DOF pose based on 3D gesture recognition provided in the embodiment of the present application is given, and not shown, the virtual display device may further include conventional hardware such as left and right glasses, a speaker, and a microphone.

The Processor referred to in fig. 13 in this Application may be a Central Processing Unit (CPU), a general purpose Processor, a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), an Application-specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof.

Referring to fig. 14, a functional block diagram of a virtual display device capable of implementing a method for determining a 6DOF pose based on 3D gesture recognition according to an embodiment of the present disclosure is shown, where the virtual display device includes a visual positioning module 1401, an acquisition module 1402, a coordinate system alignment module 1403, an IMU positioning module 1404, and a pose updating module 1405, where:

a visual positioning module 1401, configured to recognize gestures in a hand image collected by a multi-view camera of a virtual display device, and determine first 3D coordinates of four fingertips fixed at a preset position point of the 3DOF handle, excluding a thumb, in a virtual display device coordinate system;

an acquiring module 1402, configured to acquire second 3D coordinates of the four fingertips in a 3DOF handle coordinate system;

a coordinate system alignment module 1403 for determining the relative pose relationship between the virtual display device and the 3DOF handle according to the first 3D coordinate and the second 3D coordinate to align the reference coordinate systems of the virtual display device and the 3DOF handle;

an IMU positioning module 1404 for integrating measurement data of the IMU of the 3DOF handle from the reference coordinate system alignment time point to determine an initial 6DOF pose of the 3DOF handle in the reference coordinate system;

a pose update module 1405 for visually positioning the 3DOF handle, updating the initial 6DOF pose with a visual 6DOF pose.

The functional modules are mutually matched, so that the method steps of determining the 6DOF pose based on 3D gesture recognition can be realized, and the same technical effect can be achieved. The specific implementation of each functional module is referred to the foregoing embodiments, and is not repeated here.

Embodiments of the present application also provide a computer-readable storage medium for storing instructions that, when executed, may implement the methods of the foregoing embodiments.

The embodiments of the present application also provide a computer program product for storing a computer program, where the computer program is used to execute the method of the foregoing embodiments.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for determining a 6DOF pose based on 3D gesture recognition is applied to a 3DOF handle and comprises the following steps:

2. The method of claim 1, wherein visually positioning the 3DOF handle, updating the initial 6DOF pose with a visual 6DOF pose, comprises:

updating the initial 6DOF pose with the visual 6DOF pose.

3. The method of claim 2, wherein when tracking of the 3DOF handle with a hand image acquired by the multi-view camera fails and a tracking failure duration is greater than a set time threshold, the method further comprises:

4. The method of claim 1, wherein prior to determining the first 3D coordinates of the four fingertips in the virtual display device coordinate system, the method further comprises:

5. The method of claim 1, wherein the recognizing gestures in hand images captured by a multi-view camera of a virtual display device, determining first 3D coordinates of four fingertips fixed at preset location points of the 3DOF handle except thumbs under a virtual display device coordinate system, comprises:

detecting a gesture region holding the 3DOF handle from a hand image acquired by the multi-view camera;

determining a first 3D coordinate of each hand joint point in the virtual display device coordinate system through a multi-view matching algorithm;

first 3D coordinates of four fingertips fixed at preset position points of the 3DOF handle are acquired from each hand joint point.

6. The method of claim 5, wherein said determining, by a multi-mesh matching algorithm, first 3D coordinates of each hand joint point in the virtual display device coordinate system comprises:

and determining a first 3D coordinate of each hand joint point under the virtual display equipment coordinate system according to internal parameters calibrated in advance by the multi-view camera, depth information of each hand joint point and image coordinates of each hand joint point in a corresponding gesture area.

7. The method of claim 1, wherein the integrating measurement data of the IMU of the 3DOF handpiece from the reference coordinate system alignment time point to determine an initial 6DOF pose of the 3DOF handpiece in the reference coordinate system comprises:

8. The method of any of claims 1-7, wherein the relative pose relationship between the virtual display device and the 3DOF handle is determined by the formula:

wherein the content of the first and second substances,

respectively represent the fourth finger tip under the 3DOF handle coordinate systemTwo 3D coordinates.

9. The virtual display device is characterized by comprising a memory, a processor, a multi-view camera and a communication interface, wherein the communication interface, the multi-view camera and the memory are connected with the processor through a bus;

the virtual display equipment is connected with the 3DOF handle through the communication interface, acquires measurement data of the IMU of the 3DOF handle and stores the measurement data in a data storage unit;

acquiring hand images acquired by the multi-view camera and storing the hand images in a data storage unit;

10. The virtual display device of claim 9, wherein the processor visually positions the 3DOF handle, updates the initial 6DOF pose with a visual 6DOF pose, by:

updating the initial 6DOF pose with the visual 6DOF pose.