CN111696140B

CN111696140B - Monocular-based three-dimensional gesture tracking method

Info

Publication number: CN111696140B
Application number: CN202010387724.1A
Authority: CN
Inventors: 吴涛; 周锋宜
Original assignee: Qingdao Xiaoniao Kankan Technology Co Ltd
Current assignee: Qingdao Xiaoniao Kankan Technology Co Ltd
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2024-02-13
Anticipated expiration: 2040-05-09
Also published as: CN111696140A

Abstract

The invention provides a monocular-based three-dimensional gesture tracking method, which comprises the following steps: training a hand detection model and a skeleton point identification model, starting the hand detection model and a tracking module according to the detection number of hands in the previous frame image, identifying skeleton points in a region of interest of a current frame in Trackhand through the skeleton point identification model, and performing smoothing filter processing on the identified skeleton points; the data of the head part relative to the position and the gesture in each frame of image are counted, the data of the head part are stored in a queue track of the tracking module in real time, the three-dimensional skeleton coordinates of skeleton points after smoothing filter processing are determined by combining the data of the head part in the track, rendering processing is carried out on the three-dimensional skeleton coordinates to finish gesture tracking, one single-eye camera is used for replacing two infrared double-eye cameras, the cost is reduced, even if the two single-eye cameras are installed, the cost is lower than that of one infrared double-eye camera, the whole energy consumption and heat dissipation are reduced, the whole head wearing quality is lightened, and the head wearing comfort is improved.

Description

Monocular-based three-dimensional gesture tracking method

Technical Field

The invention relates to the field of computer vision, in particular to a monocular-based three-dimensional gesture tracking method.

Background

In order to enhance the sense of immersion of virtual-real combination of VR/AR/MR, the VR/AR/MR has better experience, a human-computer interaction module is indispensable, and particularly, the high-precision real-time restoration of the 3D gesture of the hand in the VR/AR/MR scene greatly influences the sense of immersion of the user experience in the VR/AR/MR scene.

At present, in the VR/AR/MR field, on the mainstream headset, a gesture recognition tracker needs to be additionally added, and in the traditional method, 2 infrared binocular cameras or depth cameras are additionally and independently added to realize finger tracking, but in the VR/AR/MR field, the following problems may exist: 1. additional costs are added; 2. the additional power consumption is increased, and the main stream of the headset is in the form of an integrated machine, namely, the headset is automatically powered by a battery, so that the power consumption of the whole system greatly influences the time of user interaction; 3. heat dissipation can also be a significant challenge while increasing power consumption; 4. the structural design increases the complexity of the structural design and the challenges of ID, and violates the development goals that the head-wearing integrated machine is small in size, portable to wear and free of uncomfortable feeling when worn for a long time by a user; 5. the FOV of the current mature and popular depth camera is generally smaller than about 90 degrees, and the FOV required by the head wear is generally about 110 degrees, namely, the conventional method adopts the depth camera, so that some motion tracks of hands are very easy to be not tracked.

Therefore, there is a need for a monocular-based three-dimensional gesture tracking method that saves cost, reduces power consumption, reduces heat dissipation, increases the visible area, and can reduce the weight of the head and improve the comfort of the head.

Disclosure of Invention

In view of the above problems, the present invention aims to provide a three-dimensional gesture tracking method based on monocular, so as to solve the problems of high cost, large power consumption, high heat dissipation of an infrared binocular camera, increased complexity of the whole head-mounted structure design, increased head-mounted volume, discomfort caused by long-time wearing of a user, small visual angle, and very easy incomplete tracking of the motion trail of hands in the existing method.

The invention provides a monocular-based three-dimensional gesture tracking method, which is characterized by comprising the following steps of:

training a hand detection model and a bone point identification model to enable the hand detection model to automatically lock a hand region of an image to serve as a region of interest, and enabling the bone point identification model to automatically identify bone points in the region of interest;

starting the hand detection model and the tracking module according to the detection number of the hands in the previous frame of image to acquire an interested region of the current frame, and storing the data information of the current frame into a tracking queue Trackhand of the tracking module; the data information of the current frame at least comprises an interested region of the current frame;

performing skeleton point identification on the interested region of the current frame in the Trackhand through the skeleton point identification model, and performing smooth filtering processing on the identified skeleton points according to historical data in the Trackhand;

and storing the data of the head part about the position and the gesture in each frame of image into a queue Trackhead of the tracking module in real time, and determining the three-dimensional skeleton coordinates of skeleton points after the smoothing filter processing by combining the data of the head part in the Trackhead so as to finish gesture tracking.

Preferably, in training the hand detection model and the skeletal point recognition model,

collecting hand image data of at least 100 users by using a head tracking camera as an action behavior case;

and inputting the action behavior cases into the hand detection model and the bone point recognition model for model training.

Preferably, in the process of starting the hand detection model and the tracking module according to the detection number of the hands in the previous frame of image,

if the detection number is 0 or 1, starting the hand detection model and the tracking module;

if the detection number is 2, only the tracking module is started.

Preferably, in the process of identifying the bone points of the region of interest of the current frame in the tracking,

the region of interest comprises the position coordinates of the hand in the image and the size of the region corresponding to the hand;

the number of the skeleton points is 21.

Preferably, in the process of acquiring the region of interest of the current frame, the method further includes:

and estimating the region of interest of the next frame according to the region of interest of the current frame based on an optical flow tracking algorithm so as to provide a reference for bone point identification of the next frame.

Preferably, in determining the three-dimensional bone coordinates of the smoothed bone point in combination with the data of the header in the Trackhead,

reading the data of the head in the track, and acquiring a transfer matrix T and a rotation matrix R of the current frame relative to the head of the previous frame;

and determining the three-dimensional coordinates of the skeleton points according to preset calibration parameters of the tracking camera, the transfer matrix T and the rotation matrix R.

Preferably, the calibration parameters of the preset tracking cameraWherein,

fx, fy represents the pixel focal length of the tracking camera, cx, cy represents the position of the tracking camera optical axis at the image coordinates.

Preferably, in determining the three-dimensional coordinates of the bone points according to the preset calibration parameters of the tracking camera, the transfer matrix T and the rotation matrix R,

selecting any one of the bone points to perform three-dimensional coordinate calculation;

each bone point is calculated in turn until all bone points of both hands are calculated.

Preferably, in selecting any one of the bone points for three-dimensional coordinate calculation,

selecting a point P in the skeleton points, acquiring coordinate information of the point P, and enabling the three-dimensional coordinates of the last frame of the point P to beTwo-dimensional image coordinates +.>Two-dimensional image coordinates are known to be +.> Assume that the three-dimensional coordinates of the preset point P in the current frame are: />

From the coordinate information of the point P And p2=r p1+t; then->+T; wherein the K is ^-1 For the calibration parameters->Is a matrix of inverse of (a).

Acquiring three-dimensional skeleton coordinates of the point P in the current frameAnd completing the conversion from the two-dimensional bone coordinates to the three-dimensional bone coordinates.

Preferably, in determining three-dimensional bone coordinates of the smoothed bone points in combination with the data of the head in the Trackhead, to complete gesture tracking,

fusing the smoothly filtered skeleton points with the tracking data of the head, and moving the fused data from a camera coordinate system to a coordinate system worn by the VR virtual reality head to form three-dimensional gesture information;

and transmitting the three-dimensional gesture information to a game engine, rendering and then transmitting back to the VR virtual head in real time for display processing, so as to complete gesture tracking.

According to the technical scheme, the three-dimensional gesture tracking method based on the monocular is characterized in that the hand detection model and the bone point recognition model are trained to obtain the region of interest in the hand image shot by the monocular, the region of interest of the next frame of image is estimated according to the region of interest of the previous frame of image, bone point recognition is carried out on the region of interest, meanwhile, head motion data is obtained, three-dimensional calculation is carried out on two-dimensional image data captured by the monocular by combining the head data, so that three-dimensional coordinates of all bone points are determined, two-dimensional to three-dimensional conversion is completed, therefore, the display of three-dimensional coordinates of the hand area can be realized by one common monocular, the cost of the two infrared binocular cameras is reduced by replacing one monocular camera, the cost of the two monocular cameras is lower than that of the infrared binocular camera, the whole energy consumption and heat dissipation are reduced, the monocular camera is mounted on the head, the structural design is reduced, the whole quality of the head is lightened, the comfort of the head is improved, the user does not feel uncomfortable even if wearing the head is carried for a long time, in addition, the display of the monocular camera has a large FOV, and the motion trail of the whole hand can be captured comprehensively.

Drawings

Other objects and attainments together with a more complete understanding of the invention will become apparent and appreciated by referring to the following description taken in conjunction with the accompanying drawings. In the drawings:

FIG. 1 is a schematic diagram of a three-dimensional gesture tracking method based on a single purpose according to an embodiment of the present invention;

fig. 2 is a schematic diagram of hand skeleton points in a single-purpose three-dimensional gesture tracking method according to an embodiment of the present invention.

Detailed Description

At present, in the VR/AR/MR field, a gesture recognition tracker needs to be additionally added on a mainstream headset, namely, 2 infrared binocular cameras or depth cameras are added as gesture recognition trackers, which has the following problems: 1. additional costs are added; 2. the additional power consumption is increased, and the main stream of the headset is in the form of an integrated machine, namely, the headset is automatically powered by a battery, so that the power consumption of the whole system greatly influences the time of user interaction; 3. heat dissipation can also be a significant challenge while increasing power consumption; 4. the structural design increases the complexity of the structural design and the challenges of ID, and violates the development goals that the head-wearing integrated machine is small in size, portable to wear and free of uncomfortable feeling when worn for a long time by a user; 5. the FOV of the current mature and popular depth camera is generally smaller than about 90 degrees, and the FOV required by the head wear is generally about 110 degrees, namely, the conventional method adopts the depth camera, so that some motion tracks of hands are very easy to be not tracked.

In view of the foregoing, the present invention provides a three-dimensional gesture tracking method based on monocular, and specific embodiments of the present invention will be described in detail with reference to the accompanying drawings.

In order to illustrate the single-purpose three-dimensional gesture tracking method provided by the invention, fig. 1 illustrates a single-purpose three-dimensional gesture tracking method according to an embodiment of the invention.

The following description of the exemplary embodiment(s) is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses. Techniques and equipment known to those of ordinary skill in the relevant art may not be discussed in detail, but should be considered part of the specification where appropriate.

As shown in fig. 1, the single-purpose three-dimensional gesture tracking method provided by the invention comprises the following steps:

s110: training a hand detection model and a bone point identification model to enable the hand detection model to automatically lock a hand region of an image to serve as a region of interest, and enabling the bone point identification model to automatically identify bone points in the region of interest;

s120: firstly judging the detection number of hands in the previous frame of image, starting the hand detection model and the tracking module according to the detection number of the hands in the previous frame of image so as to acquire the region of interest of the current frame, and storing the data information of the current frame into a tracking queue Trackhand of the tracking module;

s130: acquiring a region of interest on a current frame image from the Trackhand, identifying a bone point of the region of interest of the current frame in the Trackhand through the bone point identification model, and carrying out smoothing filter processing on the identified bone point according to historical data in the Trackhand;

s140: and counting data of the head part on the position and the gesture in each frame of image, storing the data of the head part into a queue Trackhead of the tracking module in real time, and determining three-dimensional skeleton coordinates of skeleton points after smoothing filter processing by combining the data of the head part in the Trackhead so as to finish gesture tracking.

As shown in fig. 1, in step S110, a head tracking camera is first used to collect hand image data of at least 100 users as action behavior cases, and after the action behavior cases are collected, the action behavior cases are input into a hand detection model and a skeleton point recognition model for model training; the head tracking is not particularly limited, and can be an infrared binocular camera, a monocular camera and an acquisition method, wherein the acquisition method is not particularly limited, can be a traditional method for acquiring hand image data marked with skeleton points by manually marking the skeleton points, can also be a front-edge intelligent mechanized acquisition, in the embodiment, the head tracking camera worn on the head is adopted for acquisition, and then the skeleton points of the hand image are marked, so that the skeleton points are marked accurately, and a hand detection model and a skeleton point identification model are trained more accurately; if 1 head tracking camera is used for head position tracking, each frame of the input hand detection model and the bone point identification model is one image data, and if a plurality of head tracking cameras are used for position tracking, each frame of the input hand detection model and the bone point identification model is a plurality of image data, and in the embodiment, a single head tracking camera is adopted; the number of collected hand image data is not particularly limited, and the model accuracy is higher as the number is larger, in this embodiment, at least 100 hand images of users are collected as action behavior cases, so that the accuracy of the hand detection model and the skeletal point identification model is higher.

As shown in fig. 1, in the three-dimensional gesture tracking method based on a single object provided by the invention, in step S120, the number of detected hands in the previous frame of image is firstly determined, if the number of detected hands is 0 or 1, the hand detection model and the tracking module are started, if the number of detected hands is 2, the tracking module is started only, that is, one person has at most two hands, if the number of detected hands is 2, it is proved that the two hands in the previous frame are in the picture, only the data information of the current frame is required to be stored in the tracking module to wait for the subsequent skeletal point recognition, and if no hand or only one hand is required in the previous frame, the interested region of the current frame is required to be acquired first in the current frame, so that the subsequent skeletal point recognition is performed on the interested region.

As shown in fig. 1 and fig. 2 together, in step S120, the region of interest includes the position coordinates of the hand in the image and the size of the region corresponding to the hand, the total number of skeleton points of the hand is 21, and in the process of acquiring the region of interest of the current frame and identifying skeleton points, the tracking module is started, the data information of the current frame is stored in the tracking queue tracking of the tracking module, and meanwhile, the region of interest of the next frame is estimated according to the region of interest of the current frame based on the optical flow tracking algorithm to provide a reference for identifying skeleton points of the next frame.

As shown in fig. 1, in the single-purpose three-dimensional gesture tracking method provided by the invention, in step S130, a region of interest of a hand on a current frame image is obtained from a track, bone point recognition of the hand is performed on the region of interest of image data through a bone point recognition model, and then each bone point is subjected to smoothing filter processing by comparing with historical data of each bone point, so that the possibility that a certain bone point in a certain frame is not recognized stably is avoided, and the accuracy and stability of hand bone point recognition are improved; the historical data here refers to all the data stored in the trace queue trace every time the step S120 is performed, that is, the set of the data information of the current frame stored every time, and may also be said to be all the data stored in the trace queue trace every time the data information of the current frame is stored when all the data information of the current frame is obtained from the region of interest of the current frame.

As shown in fig. 1, in the single-purpose three-dimensional gesture tracking method provided by the invention, in step S140, data about positions and postures of a head in each frame of image is counted, the data of the head is stored in a queue Trackhead of the tracking module in real time, and three-dimensional skeleton coordinates of the skeleton point are calculated by combining the data in the Trackhead, so that two-dimensional to three-dimensional gesture tracking is completed.

As shown in fig. 1, in the single-purpose three-dimensional gesture tracking method provided by the invention, in step S140, in the process of determining the three-dimensional skeleton coordinates of the skeleton point after smoothing filtering processing by combining the data of the head in the Trackhead, firstly, the data of the head in the Trackhead is read, and a transfer matrix T and a rotation matrix R of the current frame relative to the head of the previous frame are obtained; determining three-dimensional coordinates of skeleton points according to preset calibration parameters of a tracking camera, a transfer matrix T and a rotation matrix R; wherein, the calibration parameters of the preset tracking cameraWherein fx, fy represents the pixel focal length of the tracking camera, cx, cy represents the position of the optical axis of the tracking camera at the image coordinates; when the coordinate calculation is performed, one bone point can be selected for three-dimensional calculation, then three-dimensional coordinates of the rest forty-one bone points (each hand comprises 21 bone points) in the two hands are sequentially calculated according to a consistent calculation method until the three-dimensional coordinates of all bone points in the two hands are calculated, and then subsequent rendering processing is performed to complete gesture tracking of the two hands.

As shown in fig. 1, in step S140, in the three-dimensional gesture tracking method based on a single object provided by the present invention, in the process of selecting any one of bone points to perform three-dimensional coordinate calculation,

the specific operation mode is not particularly limited, in this embodiment, a point P in the skeleton points is selected first, coordinate information of the point P is obtained, and the last frame of the point P is made threeDimensional coordinates ofTwo-dimensional image coordinates +.>Two-dimensional image coordinates are known to be +.>Assume that the three-dimensional coordinates of the preset point P in the current frame are: />Here, P1, L1, and L2 are all matrices, and X1, Y1, Z1, u1, and v1 are all row matrices or column matrices, which are also referred to as matrix operations below.

And then the coordinate information of the point P can be obtained

And p2=r p1+t; (3)

derived from (1) (2) (3)Therefore (S)> Wherein the K is ^-1 For the calibration parameter +.>An inverse matrix of (a);

acquiring three-dimensional skeleton coordinates of point P in current frameThe conversion of the two-dimensional bone coordinates into the three-dimensional bone coordinates is completed, and the above calculation omits an intermediate process, which is specifically performed by the head-mounted internal chip.

As shown in fig. 1, in the single-purpose three-dimensional gesture tracking method provided by the present invention, in step S140, three-dimensional bone coordinates of the bone points after the smoothing process are determined in combination with the data of the head in the Trackhead, so as to complete the gesture tracking process,

fusing the smoothly filtered skeleton points with the tracking data of the head, and moving the fused data from a camera coordinate system to a coordinate system worn by the VR virtual reality head to form three-dimensional gesture information; and then the three-dimensional gesture information is transmitted to a game engine, rendered and then transmitted back to the VR virtual head to be displayed, so that a user can see the situation picture on a display of the VR virtual reality head, and gesture tracking is completed.

According to the method for tracking the three-dimensional gestures based on the single purpose, the hand detection model and the bone point recognition model are trained to obtain the region of interest in the hand image shot by the single purpose camera, the region of interest of the next frame of image is estimated according to the region of interest of the previous frame of image, then the bone point recognition is carried out on the region of interest, meanwhile, head motion data is obtained, three-dimensional calculation is carried out on the two-dimensional image data captured by the single purpose camera by combining the head data, so that the three-dimensional coordinates of all the bone points are determined, the conversion from two-dimensional to three-dimensional is completed, therefore, the display of the three-dimensional coordinates of the hand area can be realized by one common single purpose camera, the cost of the camera is reduced, the whole energy consumption and the heat dissipation are reduced even if the two single purpose cameras are installed on the head, the structural design is reduced, the whole quality of the head is lightened, the comfort of the head is improved, the user does not feel uncomfortable even if wearing the single purpose camera for a long time, in addition, the single purpose camera is larger in addition, and the hand motion trail can be captured more comprehensively.

The proposed single-purpose based three-dimensional gesture tracking method according to the present invention is described above by way of example with reference to the accompanying drawings. However, it will be appreciated by those skilled in the art that various modifications may be made to the monocular-based three-dimensional gesture tracking method set forth in the foregoing disclosure without departing from the teachings of the present disclosure. Accordingly, the scope of the invention should be determined from the following claims.

Claims

1. A monocular-based three-dimensional gesture tracking method, comprising:

starting the hand detection model and the tracking module according to the detection number of hands in the previous frame image, acquiring an interested region of the current frame, and storing the data information of the current frame into a tracking queue Trackhand of the tracking module; the data information of the current frame at least comprises an interested region of the current frame;

storing the data of the head part about the position and the gesture in each frame of image into a queue Trackhead of the tracking module in real time, and determining the three-dimensional skeleton coordinates of skeleton points after the smoothing filter processing by combining the data of the head part in the Trackhead so as to finish gesture tracking;

in determining three-dimensional bone coordinates of the smoothed bone points in combination with the data of the head in the Trackhead,

reading the data of the head in the Trackhead, and acquiring a transfer matrix T and a rotation matrix R of the head of the current frame relative to the head of the previous frame;

2. The method of claim 1, wherein, in training the hand detection model and the skeletal point recognition model,

3. The method of claim 1, wherein in the process of starting the hand detection model and the tracking module according to the number of hands detected in the previous frame of image,

if the detection number is 2, only the tracking module is started.

4. The method of claim 1, wherein during skeletal point recognition of a region of interest of a current frame in the Trackhand,

the number of the skeleton points is 21.

5. The monocular-based three-dimensional gesture tracking method of claim 1, further comprising, in acquiring the region of interest of the current frame:

6. The method of claim 1, wherein the method comprises the steps of,

calibration parameters of the preset tracking cameraWherein,

7. The method for tracking a three-dimensional gesture based on a single object according to claim 1, wherein in determining the three-dimensional coordinates of the skeletal points according to preset calibration parameters of a tracking camera, the transfer matrix T and the rotation matrix R,

8. The method of claim 6, wherein, in selecting any one of the skeletal points for three-dimensional coordinate calculation,

From the coordinate information of the point P And p2=r p1+t; then->Wherein the K is ^-1 For the calibration parametersAn inverse matrix of (a);

9. The method of claim 1, wherein in determining three-dimensional bone coordinates of the smoothed bone points in combination with data of the head in the Trackhead to complete gesture tracking,