CN111696140A

CN111696140A - Monocular-based three-dimensional gesture tracking method

Info

Publication number: CN111696140A
Application number: CN202010387724.1A
Authority: CN
Inventors: 吴涛; 周锋宜
Original assignee: Qingdao Xiaoniao Kankan Technology Co Ltd
Current assignee: Qingdao Xiaoniao Kankan Technology Co Ltd
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2020-09-22
Anticipated expiration: 2040-05-09
Also published as: CN111696140B

Abstract

The invention provides a monocular-based three-dimensional gesture tracking method, which comprises the following steps: training a hand detection model and a skeleton point identification model, starting the hand detection model and a tracking module according to the detection number of hands in the previous frame of image, identifying skeleton points of an interested area of a current frame in Trackhand through the skeleton point identification model, and performing smooth filtering processing on the identified skeleton points; the method comprises the steps of counting data of the head in each frame of image about the position and the posture, storing the data of the head in a queue Trackhead of a tracking module in real time, determining three-dimensional bone coordinates of bone points after smoothing filtering processing by combining the data of the head in the Trackhead, rendering the three-dimensional bone coordinates to finish gesture tracking, replacing two infrared binocular cameras with one monocular camera, reducing the cost of installing the two monocular cameras compared with the infrared binocular cameras, reducing the overall energy consumption and heat dissipation, lightening the overall quality of the head wear, and improving the head wear comfort.

Description

Monocular-based three-dimensional gesture tracking method

Technical Field

The invention relates to the field of computer vision, in particular to a monocular-based three-dimensional gesture tracking method.

Background

In order to enhance the immersion of the virtual-real combination of VR/AR/MR and make the VR/AR/MR have a better experience, the human-computer interaction module is indispensable, and especially the high-precision real-time restoration of the 3D gesture of the hand in the VR/AR/MR scene greatly influences the experience immersion of the user in the VR/AR/MR scene.

At present, in the VR/AR/MR field, on a mainstream all-in-one headset, a gesture recognition tracker needs to be additionally added, and in the conventional method, 2 infrared binocular cameras are additionally and separately added, or a depth camera is used for realizing finger tracking, but in the VR/AR/MR field, the following problems can exist: 1. additional costs are added; 2. extra power consumption is added, and the existing mainstream head is worn in an all-in-one machine mode, namely, power is supplied by a battery independently, so that the power consumption of the whole system greatly influences the interaction time of a user; 3. while increasing power consumption, heat dissipation can also be a significant challenge; 4. the structural design increases the complexity of the structural design and the ID challenge, and violates the development targets that the head-wearing integrated machine is small in size and light to wear, and a user does not feel uncomfortable after wearing the head-wearing integrated machine for a long time; 5. the FOV of the existing mature and popular depth cameras is smaller and is generally about 90 degrees, the FOV required by the head-wearing is generally about 110 degrees, namely, the traditional method adopting the depth camera can not track some motion tracks of hands easily.

Therefore, there is a need for a monocular-based three-dimensional gesture tracking method that saves cost, reduces power consumption, reduces heat dissipation, increases a visible area, reduces a head weight, and improves head comfort.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a monocular-based three-dimensional gesture tracking method, so as to solve the problems of the existing method that an infrared binocular camera has high cost, large power consumption, high heat dissipation, increased complexity of the whole head-mounted structure design, increased head-mounted volume, discomfort caused by long-time wearing of a user, small visual angle, and incomplete tracking of the movement trajectory of the hand.

The invention provides a monocular-based three-dimensional gesture tracking method which is characterized by comprising the following steps:

training a hand detection model and a skeleton point identification model, so that the hand detection model automatically locks a hand area of an image as an area of interest, and the skeleton point identification model automatically identifies skeleton points in the area of interest;

starting the hand detection model and the tracking module according to the detection number of the hands in the previous frame of image to acquire the region of interest of the current frame, and storing the data information of the current frame into a tracking queue Trackhand of the tracking module; the data information of the current frame at least comprises an interested area of the current frame;

carrying out bone point identification on the region of interest of the current frame in the Trackhand through the bone point identification model, and carrying out smooth filtering processing on the identified bone points according to historical data in the Trackhand;

and storing data of the head part of each frame image about the position and the posture into a queue Trackhead of the tracking module in real time, and determining the three-dimensional bone coordinates of the bone points subjected to the smoothing filtering treatment by combining the data of the head part of the Trackhead so as to finish gesture tracking.

Preferably, in the process of training the hand detection model and the bone point identification model,

acquiring hand image data of at least 100 users as action behavior cases by adopting a head tracking camera;

and inputting the action behavior case into the hand detection model and the bone point recognition model for model training.

Preferably, in the process of starting the hand detection model and the tracking module according to the detected number of hands in the last frame of image,

if the number of the detection is 0 or 1, starting the hand detection model and the tracking module;

if the number of the detection is 2, only the tracking module is started.

Preferably, in the process of identifying the bone point of the region of interest of the current frame in Trackhand,

the region of interest comprises position coordinates of the hand in the image and a region size corresponding to the hand;

the number of the bone points is 21.

Preferably, in the process of acquiring the region of interest of the current frame, the method further includes:

and estimating the interested area of the next frame according to the interested area of the current frame based on an optical flow tracking algorithm so as to provide reference for carrying out bone point identification on the next frame.

Preferably, in the process of determining the three-dimensional bone coordinates of the bone points after the smoothing filtering processing by combining the data of the Trackhead,

reading the data of the Trackhead middle header, and acquiring a transfer matrix T and a rotation matrix R of the current frame relative to the previous frame header;

and determining the three-dimensional coordinates of the skeleton points according to preset calibration parameters of the tracking camera, the transfer matrix T and the rotation matrix R.

Preferably, the preset calibration parameters of the tracking camera

Wherein the content of the first and second substances,

fx, fy denotes the focal length of the pixel of the tracking camera, and cx, cy denotes the position of the optical axis of the tracking camera at the image coordinates.

Preferably, in the process of determining the three-dimensional coordinates of the bone points according to the preset calibration parameters of the tracking camera, the transfer matrix T and the rotation matrix R,

selecting any one of the skeleton points to perform three-dimensional coordinate calculation;

each skeletal point is calculated in turn until all skeletal points of both hands have been calculated.

Preferably, in selecting any one of the bone points for three-dimensional coordinate calculation,

selecting among said bone pointsObtaining the coordinate information of the point P, and making the three-dimensional coordinate of the last frame of the point P be

The two-dimensional image coordinates are

The coordinates of the two-dimensional image are known to be

Assuming that the three-dimensional coordinates of the preset point P in the current frame are as follows:

from the coordinate information of the point P

And P2 ═ R × P1+ T; then

+ T; wherein, K is^-1For the calibration parameters

The inverse matrix of (c).

Obtaining the three-dimensional skeleton coordinates of the point P in the current frame

And completing the conversion from the two-dimensional bone coordinate to the three-dimensional bone coordinate.

Preferably, in the process of determining the three-dimensional bone coordinates of the bone points after the smooth filtering processing by combining the head data in the Trackhead to complete the gesture tracking,

fusing the smoothly filtered bone points with the tracking data of the head, and moving the fused data from a camera coordinate system to a coordinate system worn by the VR virtual reality to form three-dimensional gesture information;

and transmitting the three-dimensional gesture information to a game engine, rendering the three-dimensional gesture information, and then transmitting the three-dimensional gesture information back to the VR virtual head in real time for display processing to finish gesture tracking.

From the above technical solutions, the monocular-based three-dimensional gesture tracking method provided by the present invention obtains the region of interest in the hand image captured by the monocular camera by training the hand detection model and the skeleton point recognition model, estimates the region of interest of the next frame of image according to the region of interest of the previous frame of image, performs skeleton point recognition on the region of interest, simultaneously obtains the head motion data, performs three-dimensional calculation on the two-dimensional image data captured by the monocular camera in combination with the head data, thereby determining the three-dimensional coordinates of each skeleton point, and completes the conversion from two dimensions to three dimensions, so that the display of the three-dimensional coordinates of the hand region can be realized with one common monocular camera, the cost of the camera is reduced with one monocular camera replacing two infrared binocular cameras, even if two monocular cameras are installed, the cost is lower than that of one infrared binocular camera, reduce whole energy consumption and heat dissipation to the monocular camera is installed to wear to the head, reduces structural design, alleviates the whole quality of wearing, increases the travelling comfort of wearing, even the user wears for a long time and also can not feel uncomfortable, and monocular camera FOV is great in addition, can be more comprehensive catch hand motion trail.

Drawings

Other objects and results of the present invention will become more apparent and more readily appreciated as the same becomes better understood by reference to the following specification taken in conjunction with the accompanying drawings. In the drawings:

FIG. 1 is a schematic diagram of a single-purpose three-dimensional gesture tracking method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating hand skeleton points in a monocular-based three-dimensional gesture tracking method according to an embodiment of the present invention.

Detailed Description

At present, in the VR/AR/MR field, a gesture recognition tracker needs to be additionally added to a mainstream head-mounted all-in-one machine, namely 2 infrared binocular cameras or depth cameras are added to serve as the gesture recognition tracker, and the following problems exist: 1. additional costs are added; 2. extra power consumption is added, and the existing mainstream head is worn in an all-in-one machine mode, namely, power is supplied by a battery independently, so that the power consumption of the whole system greatly influences the interaction time of a user; 3. while increasing power consumption, heat dissipation can also be a significant challenge; 4. the structural design increases the complexity of the structural design and the ID challenge, and violates the development targets that the head-wearing integrated machine is small in size and light to wear, and a user does not feel uncomfortable after wearing the head-wearing integrated machine for a long time; 5. the FOV of the existing mature and popular depth cameras is smaller and is generally about 90 degrees, the FOV required by the head-wearing is generally about 110 degrees, namely, the traditional method adopting the depth camera can not track some motion tracks of hands easily.

In view of the above problems, the present invention provides a three-dimensional gesture tracking method based on monocular, and the following describes in detail an embodiment of the present invention with reference to the accompanying drawings.

For explaining the single-purpose-based three-dimensional gesture tracking method provided by the present invention, fig. 1 shows an exemplary indication of the single-purpose-based three-dimensional gesture tracking method according to the embodiment of the present invention.

The following description of the exemplary embodiment(s) is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. Techniques and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but are intended to be considered a part of the specification where appropriate.

As shown in fig. 1, the method for tracking a three-dimensional gesture based on a single purpose provided by the present invention includes:

s110: training a hand detection model and a skeleton point identification model to enable the hand detection model to automatically lock a hand area of an image as an interesting area and enable the skeleton point identification model to automatically identify skeleton points in the interesting area;

s120: judging the number of detected hands in the previous frame of image, starting the hand detection model and the tracking module according to the number of detected hands in the previous frame of image to acquire the region of interest of the current frame, and storing the data information of the current frame into a tracking queue Trackhand of the tracking module;

s130: acquiring an interested area on the current frame image from the Trackhand, carrying out bone point identification on the interested area of the current frame in the Trackhand through the bone point identification model, and carrying out smooth filtering processing on the identified bone point according to historical data in the Trackhand;

s140: and counting data of the head part in each frame image about the position and the posture, storing the data of the head part into a queue Trackhead of the tracking module in real time, and determining three-dimensional bone coordinates of the bone points subjected to smoothing filtering processing by combining the data of the head part in the Trackhead so as to finish gesture tracking.

As shown in fig. 1, in the three-dimensional gesture tracking method based on a single purpose provided by the present invention, in step S110, a head tracking camera is first adopted to collect hand image data of at least 100 users as action behavior cases, and after the action behavior cases are collected, the action behavior cases are input into a hand detection model and a bone point recognition model for model training; the head tracking is not particularly limited, the head tracking can be an infrared binocular camera or a monocular camera, the acquisition method is not particularly limited, the hand image data marked with skeleton points can be acquired by a traditional method for marking the skeleton points manually, the mechanical acquisition with front-edge intelligence can also be adopted, in the embodiment, the head tracking camera worn on the head is adopted for acquisition, then the skeleton points of the hand image are marked, so that the skeleton points are accurately marked, and a hand detection model and a skeleton point identification model are more accurately trained; if 1 head tracking camera is used for head position tracking, each frame of the transmitted hand detection model and the skeleton point identification model is image data, and if a plurality of head tracking cameras are used for position tracking, each frame of the transmitted hand detection model and the skeleton point identification model is image data, and in the embodiment, a single head tracking camera is adopted; the number of collected hand image data is not particularly limited, and the more the number is, the higher the model accuracy is, in this embodiment, at least 100 hand images of the user are collected as action behavior cases, so that the precision of the hand detection model and the skeleton point identification model is higher.

As shown in fig. 1, in step S120, the detection number of hands in the previous frame of image is first determined, if the detection number is 0 or 1, the hand detection model and the tracking module are started, if the detection number is 2, only the tracking module is started, that is, one person has at most two hands, and if the detection number is 2, it is verified that both hands in the previous frame are in the picture, only data information of the current frame needs to be stored in the tracking module to wait for subsequent skeletal point identification, and if there is no hand or only one hand in the previous frame, the region of interest of the current frame needs to be obtained in the current frame first, so that the skeletal point identification is performed on the region of interest subsequently.

As shown in fig. 1 and fig. 2, in the step S120, the region of interest includes position coordinates of the hand in the image and a size of a region corresponding to the hand, 21 skeletal points of the hand are provided, and in the process of acquiring the region of interest of the current frame and identifying the skeletal points, the tracking module is started, data information of the current frame is stored in a tracking queue Trackhand of the tracking module, and meanwhile, the region of interest of the next frame is estimated according to the region of interest of the current frame based on an optical flow tracking algorithm, so as to provide a reference for performing skeletal point identification on the next frame.

As shown in fig. 1, in the single-objective three-dimensional gesture tracking method provided by the present invention, in step S130, an area of interest of a hand on a current frame image is obtained from Trackhand, skeleton point recognition of the hand is performed on the area of interest of image data through a skeleton point recognition model, and then smooth filtering processing is performed on each skeleton point by comparing with historical data of each skeleton point, so that the possibility that recognition of a certain skeleton point in a certain frame is not stable is avoided, and the hand skeleton point recognition accuracy and stability are improved; the historical data refers to all existing data stored in the tracking queue Trackhand every time step S120 is performed, that is, a set of data information of the current frame stored every time, and the historical data may also be all data stored in the tracking queue Trackhand, where all the data is data information of the current frame stored when the region of interest of the current frame is obtained.

As shown in fig. 1, in the three-dimensional gesture tracking method based on a single object provided by the present invention, in step S140, data of the head regarding the position and the posture in each frame image is counted, the data of the head is stored in the queue Trackhead of the tracking module in real time, and the three-dimensional bone coordinates of the bone point are calculated by combining the data in the Trackhead, so as to complete two-dimensional to three-dimensional gesture tracking.

As shown in fig. 1, in the three-dimensional gesture tracking method based on a single purpose provided by the present invention, in step S140, in the process of determining the three-dimensional bone coordinates of the bone point after the smoothing filtering processing by combining the data of the head in the Trackhead, the data of the head in the Trackhead is read first, and the transfer matrix T and the rotation matrix R of the current frame relative to the previous frame head are obtained; determining the three-dimensional coordinates of the skeleton points according to preset calibration parameters of the tracking camera, a transfer matrix T and a rotation matrix R; wherein the preset calibration parameters of the tracking camera

Wherein fx, fy represents the focal length of the pixel of the tracking camera, and cx, cy represents the position of the optical axis of the tracking camera in the image coordinate; when the coordinate calculation is carried out, one bone point can be selected for carrying out three-dimensional calculation, then the three-dimensional coordinates of the rest forty bone points (each hand comprises 21 bone points) in the two bones are calculated in sequence according to a consistent calculation method until the three-dimensional coordinates of all the bone points on the two hands are calculated, and then the subsequent rendering processing is carried out, so that the gesture tracking of the two hands is completed.

As shown in fig. 1, in the three-dimensional gesture tracking method based on a single object provided by the present invention, in step S140, in the process of selecting any one of the skeleton points to perform three-dimensional coordinate calculation,

the specific operation manner is not specifically limited, in this embodiment, a point P in the skeleton points is first selected, coordinate information of the point P is obtained, and the three-dimensional coordinate of the previous frame of the point P is made to be the three-dimensional coordinate

The two-dimensional image coordinates are

The coordinates of the two-dimensional image are known to be

here, P1, L1, and L2 are matrices, and X1, Y1, Z1, u1, and v1 are row matrices or column matrices, and are also matrix operations hereinafter.

Then, the coordinate information of the point P can be obtained

And P2 ═ R × P1+ T; ③ of

Derived from formula ①②③

Therefore, the temperature of the molten steel is controlled,

wherein, K is^-1For the calibration parameter

The inverse matrix of (d);

obtaining the three-dimensional skeleton coordinate of the point P in the current frame

The conversion from two-dimensional bone coordinates to three-dimensional bone coordinates is completed, and the calculation omits an intermediate process, and the calculation process is specifically executed by the head-mounted internal chip.

As shown in fig. 1, in the three-dimensional gesture tracking method based on a single purpose provided by the present invention, in step S140, the three-dimensional skeleton coordinates of the skeleton point after the smoothing filtering process are determined by combining the head data in the Trackhead, so as to complete the gesture tracking process,

fusing the smoothly filtered bone points with the tracking data of the head, and moving the fused data from a camera coordinate system to a coordinate system worn by the VR virtual reality to form three-dimensional gesture information; and then the three-dimensional gesture information is transmitted to a game engine, rendered and then transmitted back to the VR virtual head in real time for display processing, so that a user can see own situation picture on a displayer worn on the VR virtual head, and gesture tracking is completed.

It can be seen from the foregoing embodiments that, in the monocular-based three-dimensional gesture tracking method provided by the present invention, the hand detection model and the skeleton point recognition model are trained to obtain the region of interest in the hand image captured by the monocular camera, the region of interest of the next frame of image is estimated according to the region of interest of the previous frame of image, then the skeleton point recognition is performed on the region of interest, and simultaneously the head motion data is obtained, the two-dimensional image data captured by the monocular camera is three-dimensionally calculated in combination with the head data, thereby determining the three-dimensional coordinates of each skeleton point, and completing the two-dimensional to three-dimensional conversion, so that the three-dimensional coordinates of the hand region can be displayed with one common monocular camera, reducing the cost of the camera, even if two monocular cameras are installed, the cost is lower than that of one infrared binocular camera, the overall energy consumption and heat dissipation are reduced, and the monocular camera is, reduce structural design, alleviate the whole quality of wearing, increase the travelling comfort of wearing, even the user wears for a long time and also can not feel uncomfortable, monocular camera FOV is great in addition, can be more comprehensive catch hand motion trail.

The proposed monocular-based three-dimensional gesture tracking method according to the present invention is described above by way of example with reference to the accompanying drawings. However, it should be understood by those skilled in the art that various modifications can be made to the monocular-based three-dimensional gesture tracking method of the present invention without departing from the scope of the present invention. Therefore, the scope of the present invention should be determined by the contents of the appended claims.

Claims

1. A monocular-based three-dimensional gesture tracking method is characterized by comprising the following steps:

starting the hand detection model and the tracking module according to the detection number of the hands in the previous frame of image, acquiring the region of interest of the current frame, and storing the data information of the current frame into a tracking queue Trackhand of the tracking module; the data information of the current frame at least comprises an interested area of the current frame;

2. The monocular based three-dimensional gesture tracking method of claim 1, wherein, in the process of training the hand detection model and the skeletal point recognition model,

3. The monocular based three-dimensional gesture tracking method according to claim 1, wherein, in the process of starting the hand detection model and tracking module according to the detected number of hands in the previous frame image,

if the number of the detection is 2, only the tracking module is started.

4. The monocular based three-dimensional gesture tracking method according to claim 1, wherein, in the process of identifying the bone point of interest of the current frame in Trackhand,

the number of the bone points is 21.

5. The monocular based three-dimensional gesture tracking method according to claim 1, further comprising, in the process of acquiring the region of interest of the current frame:

6. The monocular based three-dimensional gesture tracking method according to claim 1, wherein, in determining the three-dimensional bone coordinates of the smoothly filtered bone point in conjunction with the data of the head in the Trackhead,

7. The monocular based three-dimensional gesture tracking method of claim 6, wherein

Calibration parameters of the preset tracking camera

Wherein the content of the first and second substances,

8. The single-purpose three-dimensional gesture tracking method according to claim 6, wherein in the process of determining the three-dimensional coordinates of the skeleton points according to the preset calibration parameters of the tracking camera, the transfer matrix T and the rotation matrix R,

9. A monocular based three-dimensional gesture tracking method according to claims 6-8, wherein in selecting any one of the skeletal points for three-dimensional coordinate calculation,

selecting a point P in the skeleton points, acquiring coordinate information of the point P, and enabling the three-dimensional coordinate of the last frame of the point P to be

The two-dimensional image coordinates are

The coordinates of the two-dimensional image are known to be

from the coordinate information of the point P

And P2 ═ R × P1+ T; then

Wherein, K is^-1For the calibration parameters

The inverse matrix of (c).

10. The monocular based three-dimensional gesture tracking method according to claim 1, wherein, in the process of determining the three-dimensional skeleton coordinates of the skeleton point after the smooth filtering process by combining the head data in the Trackhead to complete the gesture tracking,