CN116403280A

CN116403280A - Monocular camera augmented reality gesture interaction method based on key point detection

Info

Publication number: CN116403280A
Application number: CN202310309434.9A
Authority: CN
Inventors: 张玉梅; 肖跃灵; 吴晓军; 李鼎钺; 戎宇莹; 赵焱青; 刘诗轩
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2023-03-28
Filing date: 2023-03-28
Publication date: 2023-07-07

Abstract

A monocular camera augmented reality gesture interaction method based on key point detection comprises the steps of obtaining an input image, detecting hand key point coordinates, determining character string data, transmitting the character string data S, constructing a virtual hand model, storing the character string data S, performing virtual hand motion, triggering gesture interaction of augmented reality and evaluating performance. According to the invention, the character string data are determined, the character string data S are stored, the virtual hand motion is triggered, the gesture interaction of the augmented reality is realized, the hand key point detection network is used, the network is applied to the gesture tracking recognition of the monocular camera of the augmented reality, the characteristics of the hand of the current frame are effectively screened, and meanwhile, the data standardization is carried out on the tracked hand key point information, so that the data are more standard, the motion of the hand of the virtual world is conveniently driven, and the real-time performance is high. The gesture tracking method has the advantages of higher instantaneity, stronger immersion, low equipment cost and the like, and can be used for gesture tracking and recognition in different backgrounds.

Description

Monocular camera augmented reality gesture interaction method based on key point detection

Technical Field

The invention belongs to the technical field of augmented reality interaction, and particularly relates to a method for tracking, identifying and transmitting data by gestures.

Background

In recent years, with the gradual popularization of virtual reality equipment, virtual reality interaction has become a very active topic and research hot spot, economy and society develop at a high speed, almost everyone is under more or less pressure and anxiety, in the virtual reality world, people can temporarily draw away from the current emotion world and enter a brand new and novel virtual world, for virtual reality experience, the immersion of the virtual reality world is very important, the immersion comes from the immersion and naturalness of interaction to a great extent, and the interaction experience closer to reality can bring stronger immersion to people, so that the work of researching closer to natural interaction has very profound significance.

The interaction is mostly related to the hand, and the hand movement of human is mainly to control the movement of fingers through muscles, and the muscles and tendons are driven by nerves to move bones. The human hand bone comprises carpal bones, metacarpal bones and phalanges, the joints between the multi-metacarpal joints and the phalanges of the finger joints are mainly provided with functions of bending, stretching, retracting, expanding and rotating, the gesture of the hand is greatly dependent on the position of the hand joint, if the position of the hand joint point in reality can be obtained through a monocular camera and is transmitted into a virtual hand in the virtual world, various forms of the hand of the virtual hand can be controlled, and the hand movement in the virtual world can be directly controlled through the movement of the hand in reality.

The interaction modes of virtual reality are various, and most traditional interaction is performed through a game handle, but the interaction modes are that various buttons on the handle are pressed to complete interaction, and the technical problems of great difference from direct interaction of hands in the real world and low immersion are solved; based on the interaction of the data glove, the immersion feeling is strong, but the problem of higher experience cost exists, and the wide popularization has great difficulty.

The interaction based on monocular camera machine vision is a new research direction in the field of gesture interaction, and is introduced into machine learning to enable the interaction to be closer to an original target, namely artificial intelligence. The deep learning method comprises an artificial neural network, a convolutional neural network and a cyclic neural network, and the deep learning can automatically learn the characteristics in big data. Currently, deep learning is capable of effectively performing gesture tracking in the field of gesture interaction.

In the field of extended display interaction technology, a technical problem to be solved urgently at present is to provide a gesture interaction method with higher accuracy, stronger immersion and lower equipment cost.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the defects of the prior art, and provide the monocular camera augmented reality gesture interaction method based on key point detection, which has the advantages of higher accuracy, stronger immersion and lower equipment cost.

The technical scheme adopted for solving the technical problems comprises the following steps.

(1) Acquiring an input image

Taking a real-time image shot by a monocular camera as an input image, wherein the width w of the input image is at least 400 pixels, and the height h is at least 200 pixels.

(2) Detecting hand keypoint coordinates

The hand key point detection network is used for obtaining hand key point coordinates of an input image according to the following method:

1) Setting a hand confidence coefficient threshold value theta, theta epsilon (0, 1), and starting the hand detection model when the hand confidence coefficient is lower than theta.

2) Hand detection is carried out on the input image, from left to right, the hand numbers are respectively

number

0 and 1, n is 0 or 1, and n is H _n ，H _n Including a left hand tag l or a right hand tag r.

3) And positioning the detected hand, and cutting out the hand area.

4) Inputting each hand region into a hand key point detection network to detect hand key points, and outputting H _n The coordinates of the 21 key points of the hand are as follows:

key point 0 is wrist and marked as

The key points 1 to 4 are four joints from the root of the thumb to the fingertip, and are marked as +.>

The key points 5 to 8 are four joint points from the root of the index finger to the tip of the finger, and are marked as

The key points 9 to 12 are four nodes from the root of the middle finger to the tip of the finger, and are marked as

The key points 13 to 16 are four joint points from the root of the ring finger to the tip of the finger, and are marked as

The key points 17 to 20 are four joint points from the root of the little finger to the fingertip, and are marked as

Will H _n The hand key point j is marked as

The x, y and z axes are respectively +.>

Wherein->

The x-axis and y-axis coordinates of (2) are the relative coordinates of the key point 0 on the input image, and the z-axis coordinates of (4)>

Is a minimum value, marked as n-hand H _n Is>

If the number is negative, the root of the wrist is far away from the camera, otherwise, the wrist is close to the camera.

(3) Determining character string data

Press-down type pair H _n Converting the coordinates of key points of the hands into x, y and z axes after conversion respectively

Calculating the n-hand H by pressing _n The distance between the key point 5 and the key point 17 as the palm width L _n ：

The character string data S is obtained as follows:

where k e {0,1,..20 }.

(4) Transmitting character string data S

And transmitting the character string data S to the Unity engine through a user datagram protocol.

(5) Constructing virtual hand models

And drawing the bone position by using a Unity engine, and adding the bone rotation angle and the relative displacement on the key points.

(6) Storing character string data S

1) The hand coordinate data is transmitted to the Unity engine.

2) Standardized character string data S' is obtained as follows:

3) Storing normalized string data S'

The normalized string data S' is stored to the left-hand string h as follows _l And right hand string h _r In (a):

wherein N is a null value.

(7) Virtual hand movement

1) Comma is used for writing the left-hand character string h _l And right hand string h _r Split into left-hand string arrays F containing 64 substrings _l ' and Right-hand string array F _r ' left-hand string array F is set using the float Parse function _l ' and Right-hand string array F _r ' conversion to left-hand floating point arrays F, respectively _l And right-hand floating point array F _r ：

Wherein, the left hand character string h _l N is not assigned to the left-hand floating point array F _l The method comprises the steps of carrying out a first treatment on the surface of the Right hand string h _r When N is set, the N is not assigned to the right-hand floating point array F _r 。

2) The virtual hand as a whole moves with the camera as a sub-object of the camera.

3) Determining relative coordinates of key points

Determining the relative coordinates of the left hand key point i with respect to the left hand key point 0 as follows

Left hand palm width d _lH ：

Wherein the method comprises the steps of，

For left-hand floating point array F _l Element i>

For left-hand floating point array F _l Element 3i of (2) receives the x-axis coordinates of the left hand key i,/c>

For left-hand floating point array F _l Element 0 of (c) receives the x-axis coordinates of left hand keypoint 0,

for left-hand floating point array F _l 3i+1 element of (2) receives the y-axis coordinate of the left-hand key i, ++>

For left-hand floating point array F _l Element 1 of (2) receives the y-axis coordinates of left-hand keypoint 0, +>

For left-hand floating point array F _l The 3i+2 element of (2) receives the z-axis coordinate of the left-hand key i, ++>

For left-hand floating point array F _l Element number 2 of (c) receives the z-axis coordinates of left hand keypoint 0.

The relative coordinates of the right hand key point i with respect to the right hand key point 0 are determined as follows

Palm width d of right hand _rH ：

Wherein, the liquid crystal display device comprises a liquid crystal display device,

for right-hand floating-point array F _r Element i>

For right-hand floating-point array F _r Element 3i of (2) receives the x-axis coordinates of the right hand key i,/c>

For right-hand floating-point array F _r Element 0 of (c) receives the x-axis coordinates of right hand keypoint 0,

for right-hand floating-point array F _r 3i+1 element of (2) receives the y-axis coordinate of the right hand key i, ++>

For right-hand floating-point array F _r Element 1 of (2) receives the y-axis coordinates of right hand keypoint 0, +>

For right-hand floating-point array F _r The 3i+2 element of (2) receives the z-axis coordinate of the right-hand key i, ++>

For right-hand floating-point array F _r Element number 2 of (c) receives the z-axis coordinates of right hand keypoint 0.

4) Determining virtual hand scaling ratio

Determining the distance d between the left hand keypoint 0 and the left hand keypoint 1 as follows _lR ：

The distance d between the right hand keypoint 0 and the right hand keypoint 1 is determined by the following _rR ：

Determining a virtual left hand scaling ratio M by pressing _l ：

Determining a virtual right hand zoom ratio M by pressing _r ：

Wherein d _lM Is the distance between the key point 0 and the key point 1 of the left hand of the virtual hand model, d _rM Is the distance between keypoint 0 and keypoint 1 of the right hand of the virtual hand model.

5) Determining relative hand movement position

Determining coordinates C of the virtual left hand position motion relative to the camera by pressing _lx 、C _ly 、C _lz ：

Wherein D is _lx 、D _ly 、D _lz Initial position D for virtual left hand _l Coordinates relative to the camera.

Determining coordinates C of the virtual right hand position motion relative to the camera according to the following mode _rx 、C _ry 、C _rz ：

Wherein D is _rx 、D _ry 、D _rz Initial position D for virtual right hand _r Coordinates relative to the camera.

(8) Gesture interactions that trigger augmented reality

1) When the distance between the current position of the virtual hand of the augmented reality and the virtual object is less than or equal to 0.3 and the distances between the virtual

hand key points

8, 12, 16 and 20 and the virtual hand key point 0 are all less than or equal to 0.05, triggering gesture interaction for picking up the virtual object.

2) When the distances between the virtual

hand key points

8, 12, 16 and 20 of the augmented reality and the virtual hand key point 0 are all more than 0.05, triggering gesture interaction for putting down the virtual object.

(9) Performance evaluation

The number of frames per second FPS of the processed image of the monocular camera augmented reality gesture interaction method based on key point detection is evaluated as follows:

wherein t is _e Is the time when one frame is processed, t _s Is the time of starting processing of one frame, the frame number FPS of the processed image per second>At 30 frames/second, the method is a method with high real-time performance.

In the step (2) of detecting hand keypoint coordinates of the present invention, the confidence threshold of the hand detection is preferably 0.68.

In the step of constructing the virtual hand model in the step (5) of the invention, the rotation angle of the bones added on the key points is as follows: the rotation angle of the wrist key point 0 is 0-180 degrees, the rotation angles of the rest key points are 0-90 degrees, and the radius of the finger on the camera is 12d _lM In (d), where d _lM Is the distance between keypoint 0 and keypoint 1 of the left hand of the virtual hand model.

The invention adopts the hand confidence threshold in the step of the hand key point detection network, and only when the hand confidence is lower than the threshold, the hand detection model is restarted. The hand detection and then the key point detection are needed to be carried out in the first frame of the image, because the video is continuous, according to the coordinates of the key points of the hand of the previous frame, the hand area can be prejudged, then the hand area is sent to the key point detection model of the next frame, the hand detection model can not be continuously and repeatedly used, each frame only needs to infer the hand area from the key points of the previous frame, and the hand area is sent to the key point detection model of the next frame.

Compared with the prior art, the invention has the following advantages:

according to the invention, the hand key point detection network is adopted, and the network is applied to the gesture tracking and recognition of the monocular camera in augmented reality, so that the characteristics of the hand of the current frame are effectively screened, and meanwhile, the tracked hand key point information is subjected to data standardization, so that the data is more standard, the motion of a virtual hand is conveniently driven, and the real-time performance is high. The method has the advantages of higher instantaneity, stronger immersion and lower equipment cost, and can be used for carrying out gesture tracking and identification in different backgrounds.

Drawings

Fig. 1 is a flow chart of embodiment 1 of the present invention.

Fig. 2 is a schematic numbered diagram of the hand keys.

Fig. 3 is a frame rate plot of the hand keypoint tracking detection of the method of example 1.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, but the present invention is not limited to the following embodiments.

Example 1

The monocular camera augmented reality gesture interaction method based on key point detection of the embodiment comprises the following steps (see fig. 1):

(1) Acquiring an input image

(2) Detecting hand keypoint coordinates

1) Setting a hand confidence threshold value theta, theta epsilon (0, 1), wherein the value theta of the embodiment is 0.68, and starting the hand detection model when the hand confidence is lower than theta.

number

3) And positioning the detected hand, and cutting out the hand area.

in FIG. 2, key point 0 is the wrist, denoted as

The key points 5 to 8 are four joint points from the root of the index finger to the tip of the finger, and are marked as +.>

The key points 9 to 12 are four joints from the root of the finger to the tip of the finger, and are marked as +.>

Will H _n The hand key point j is marked as

The x, y and z axes are respectively +.>

Wherein->

Is a minimum value, marked as n-hand H _n Is defined by the z-axis origin of coordinates,/>

(3) Determining character string data

The character string data S is obtained as follows:

where k e {0,1,..20 }.

The invention adopts the step of determining the character string data, so that the character string data is more suitable for the transmission and the use of the user datagram protocol, and the data transmission speed and the processing efficiency are improved.

(4) Transmitting character string data S

(5) Constructing virtual hand models

Drawing skeleton positions by using a Unity engine, adding skeleton rotation angles and relative displacement on key points, wherein the rotation angle of a wrist key point 0 is 0-180 degrees, the rotation angles of other key points are 0-90 degrees, the rotation angle of the wrist key point 0 is 90 degrees, the rotation angles of other key points are 45 degrees, and the radius of a finger on a camera is 12d _lM In (d), where d _lM Is the distance between keypoint 0 and keypoint 1 of the left hand of the virtual hand model.

(6) Storing character string data S

1) The hand coordinate data is transmitted to the Unity engine.

2) Standardized character string data S' is obtained as follows:

3) Storing normalized string data S'

wherein N is a null value.

The invention adopts the step of storing the character string data, screens out the meaningful information obtained from the user datagram protocol transmission, and lays a foundation for driving the motion of the virtual hand.

(7) Virtual hand movement

3) Determining relative coordinates of key points

Left hand palm width d _lH ：

for left-hand floating point array F _l Element i>

For left-hand floating point array F _l Element 0 of (2) receives the x-axis coordinates of left-hand keypoint 0, < >>

Palm width d of right hand _rH ：

for right-hand floating-point array F _r Element i>

4) Determining virtual hand scaling ratio

Determining a virtual left hand scaling ratio M by pressing _l ：

Determining a virtual right hand zoom ratio M by pressing _r ：

5) Determining relative hand movement position

Because the invention adopts the step of driving the virtual hand to move by data, the motion of the virtual hand can be finer and natural and is more consistent with the hand motion in reality, thereby improving the interactive immersion of the experimenter in the augmented reality environment.

(8) Gesture interactions that trigger augmented reality

1) When the distance between the current position of the virtual hand of the augmented reality and the virtual object is less than or equal to 0.3 and the distances between the virtual hand

key points

2) When the distances between the virtual hand

key points

Because the invention adopts the key point detection network to capture the hand data for interaction, compared with the traditional handle and data glove interaction mode which has high price and needs special hardware, the invention has lower equipment cost and stronger universality.

(9) Performance evaluation

And (3) completing the monocular camera augmented reality gesture interaction method based on key point detection.

Example 2

The monocular camera augmented reality gesture interaction method based on key point detection in the embodiment comprises the following steps:

(1) Acquiring an input image

This step is the same as in example 1.

(2) Detecting hand keypoint coordinates

1) Setting a hand confidence threshold value theta, theta epsilon (0, 1), wherein the value theta of the embodiment is 0.01, and starting the hand detection model when the hand confidence is lower than theta.

The other steps of this step are the same as those of example 1.

(3) Determining character string data

This step is the same as in example 1.

(4) Transmitting character string data S

This step is the same as in example 1.

(5) Constructing virtual hand models

Drawing skeleton positions by using a Unity engine, adding skeleton rotation angles and relative displacement on key points, wherein the rotation angle of a wrist key point 0 is 0-180 degrees, the rotation angles of other key points are 0-90 degrees, the rotation angle of the wrist key point 0 is 0 degrees, the rotation angles of other key points are 0 degrees, and the radius of a finger on a camera is 12d _lM In (d), where d _lM Is the distance between keypoint 0 and keypoint 1 of the left hand of the virtual hand model.

The other steps were the same as in example 1. And (3) completing the monocular camera augmented reality gesture interaction method based on key point detection.

Example 3

(1) Acquiring an input image

This step is the same as in example 1.

(2) Detecting hand keypoint coordinates

1) Setting a hand confidence threshold value theta, theta epsilon (0, 1), wherein the value theta of the embodiment is 0.98, and starting the hand detection model when the hand confidence is lower than theta.

The other steps of this step are the same as those of example 1.

(3) Determining character string data

This step is the same as in example 1.

(4) Transmitting character string data S

This step is the same as in example 1.

(5) Constructing virtual hand models

Drawing skeleton positions by using a Unity engine, adding skeleton rotation angles and relative displacement on key points, wherein the rotation angle of a wrist key point 0 is 0-180 degrees, the rotation angles of other key points are 0-90 degrees, the rotation angle of the wrist key point 0 is 180 degrees, the rotation angles of other key points are 90 degrees, and the radius of a finger on a camera is 12d _lM In (d), where d _lM Is the distance between keypoint 0 and keypoint 1 of the left hand of the virtual hand model.

In order to verify the beneficial effects of the invention, the inventor adopts the method of the embodiment 1 of the invention to carry out simulation experiments, and the experimental conditions are as follows:

1. simulation conditions

Software environment: pyCharm 2019.3.1x64.

The hardware conditions are as follows: 1 personal computer, 1 Nvidia3060Ti video card, 1 1080P camera, 1 personal mobile phone.

Computer configuration:

1) A processor: intel (R) Core (TM) i7-10700 CPU@2.90GHz 2.90GHz.

2) Memory: 32.0GB.

The software platform is as follows: python3.8.

Other third library: opencv-python4.6.0, mediap 0.9.1, socket.

2. Simulation content and results

Experiments were performed under the above simulation conditions, and the experimental results are shown in fig. 3.

In fig. 3, the abscissa represents the running time of the present invention, and the ordinate represents the number of frames the present invention can process for 1 second, i.e. FPS; as can be seen from fig. 3, the frame number of the video image processed in 1 second fluctuates by about 30, which indicates that the video image processing speed is high and the real-time performance is achieved.

Claims

1. The monocular camera augmented reality gesture interaction method based on key point detection is characterized by comprising the following steps of:

(1) Acquiring an input image

Taking a real-time image shot by a monocular camera as an input image, wherein the width w of the input image is at least 400 pixels, and the height h is at least 200 pixels;

(2) Detecting hand keypoint coordinates

1) Setting a hand confidence coefficient threshold value theta, theta epsilon (0, 1), and starting a hand detection model when the hand confidence coefficient is lower than theta;

2) Hand detection is carried out on the input image, from left to right, the hand numbers are respectively number 0 and 1, n is 0 or 1, and n is H _n ，H _n Including left hand label l or right hand label r;

3) Positioning the detected hand, and cutting out a hand area;

key point 0 is wrist and marked as

The key points 13 to 16 are the ring fingers from the root to the fingerThe four sharp nodes, designated as

Will H _n The hand key point j is marked as

The x, y and z axes are respectively +.>

Wherein->

Is a minimum value, marked as n-hand H _n Is>

If the number is negative, the root of the wrist is far away from the camera, otherwise, the wrist is close to the camera;

(3) Determining character string data

The character string data S is obtained as follows:

wherein k e {0,1,., 20};

(4) Transmitting character string data S

Transmitting the character string data S to the Unity engine through a user datagram protocol;

(5) Constructing virtual hand models

Drawing skeleton positions by using a Unity engine, and adding skeleton rotation angles and relative displacement on key points;

(6) Storing character string data S

1) The hand coordinate data is transmitted to a Unity engine;

2) Standardized character string data S' is obtained as follows:

3) Storing normalized string data S'

wherein N is a null value;

(7) Virtual hand movement

Wherein, the left hand character string h _l N is not assigned to the left-hand floating point array F _l The method comprises the steps of carrying out a first treatment on the surface of the Right hand string h _r When N is set, the N is not assigned to the right-hand floating point array F _r ；

2) The virtual hand as a whole is used as a sub-object of the camera to move along with the camera;

3) Determining relative coordinates of key points

Left hand palm width d _lH ：

for left-hand floating point array F _l Element i>

For left-hand floating point array F _l Element 3i+1 of (1) receives the left hand offY-axis coordinate of key i, +.>

For left-hand floating point array F _l The element 2 of (2) receives the z-axis coordinate of the left-hand key point 0;

Palm width d of right hand _rH ：

float for the right handPoint array F _r Element i>

For right-hand floating-point array F _r Element 0 of (2) receives the x-axis coordinates of right hand keypoint 0, < >>

For right-hand floating-point array F _r The element 2 of (2) receives the z-axis coordinate of the right-hand key point 0;

4) Determining virtual hand scaling ratio

Determining a virtual left hand scaling ratio M by pressing _l ：

Determining a virtual right hand zoom ratio M by pressing _r ：

Wherein d _lM Is the distance between the key point 0 and the key point 1 of the left hand of the virtual hand model, d _rM The distance between the key point 0 and the key point 1 of the right hand of the virtual hand model is;

5) Determining relative hand movement position

Wherein D is _lx 、D _ly 、D _lz Initial position D for virtual left hand _l Coordinates relative to the camera;

determining relative camera when virtual right hand position moves by pressingCoordinates C of the image head _rx 、C _ry 、C _rz ：

Wherein D is _rx 、D _ry 、D _rz Initial position D for virtual right hand _r Coordinates relative to the camera;

(8) Gesture interactions that trigger augmented reality

1) Triggering gesture interaction for picking up the virtual object when the distance between the current position of the virtual hand of the augmented reality and the virtual object is less than or equal to 0.3 and the distances between the virtual hand key points 8, 12, 16 and 20 and the virtual hand key point 0 are all less than or equal to 0.05;

2) Triggering gesture interaction for putting down a virtual object when the distances between the virtual hand key points 8, 12, 16 and 20 of the augmented reality and the virtual hand key point 0 are all more than 0.05;

(9) Performance evaluation

2. The monocular camera augmented reality gesture interaction method based on keypoint detection of claim 1, wherein the method is characterized by: in the step (2) of detecting the hand key point coordinates, the confidence threshold of the hand detection is 0.68.

3. The monocular camera augmented reality gesture interaction method based on keypoint detection of claim 1, wherein the method is characterized by: in the step of constructing the virtual hand model (5), the adding skeleton rotation angle on the key points is as follows: the rotation angle of the wrist key point 0 is 0-180 degrees, the rotation angles of the rest key points are 0-90 degrees, and the radius of the finger on the camera is 12d _lM In (d), where d _lM Is the distance between keypoint 0 and keypoint 1 of the left hand of the virtual hand model.