CN114860072A

CN114860072A - Gesture recognition interaction equipment based on monocular camera

Info

Publication number: CN114860072A
Application number: CN202210404958.1A
Authority: CN
Inventors: 时沐朗
Original assignee: Individual
Current assignee: Individual
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2022-08-05

Abstract

The application discloses gesture recognition interaction equipment based on monocular camera, including camera, detector and feature extractor, it is used for discerning background scene consciousness. The interaction method comprises the following steps: 1) establishing a palm three-dimensional coordinate image library through a three-dimensional palm characteristic point model; 2) displaying an intelligent interactive picture; 3) operating on the image and calculating a palm position by the detector; 4) the intelligent interactive picture playing is controlled according to the user palm movement data, in daily use, by adopting the technical scheme, the feature extractor operates the whole image and calculates the hand positions, and the three-dimensional hand feature point model extracted by the detector operates the positions and predicts the approximate three-dimensional surface through regression. After the palm is accurately cut, the requirement for enhancing common data is greatly reduced, the resource consumption is low, and the required hardware threshold is low. The compatibility is strong, and the system can be operated in a crossing way. The universality is strong, and the device is suitable for various types of cameras and operation platforms.

Description

Gesture recognition interaction equipment based on monocular camera

Technical Field

The invention relates to gesture recognition interaction equipment based on a monocular camera.

Background

In the online PC live broadcast teaching process, due to the loss of interactive equipment, a numerical control board is similar to interactive equipment and is not popularized, and due to the limitation of live broadcast equipment (PC teaching), a teacher uses traditional interactive equipment (such as a mouse), so that the experience feeling is poor, and even the teaching quality is influenced, and therefore improvement is needed.

Disclosure of Invention

The present invention is directed to solving one of the technical problems of the prior art.

The application provides a gesture recognition interaction device based on monocular camera includes:

a camera for acquiring an image;

a feature extractor for locating a rough range in the image where the palm is located;

the detector is used for accurately cutting the image in the rough range to obtain a palm image and/or an operation intelligent interactive picture;

and the marker is used for identifying the palm joint characteristic points in the palm graph to position the palm.

Meanwhile, an interaction method based on the gesture recognition interaction equipment is disclosed, and comprises the following steps:

1) establishing a palm three-dimensional coordinate image library through a three-dimensional palm characteristic point model;

2) displaying an intelligent interactive picture;

3) operating on the image and calculating a palm position by the detector;

4) and controlling the intelligent interactive picture to play according to the user palm movement data.

The palm three-dimensional coordinate image library establishing step:

1) the ML Pipeline is formed by two deep neural network models working together in real time;

2) operating the intelligent interactive picture through a detector and calculating the hand position;

3) operating the positions through a three-dimensional palm feature point model and predicting an approximate three-dimensional surface through regression;

4) directly predicting coordinates of 21 three-dimensional finger joint coordinates in the detected hand region through regression, and performing model learning consistent internal hand posture representation;

5) and manually annotating the real images positioned by the 21 three-dimensional finger joint coordinates to establish a palm three-dimensional coordinate image library.

Further comprising:

training the detector, modeling by using a square bounding box, and neglecting other length-width ratios;

wherein focus loss is minimized during training.

Allowing the deep neural network to use most of its computational power for accuracy of coordinate prediction.

The marker is generated from the palm feature points identified in the previous frame, and the detector is invoked to reposition the palm when the feature point model is no longer able to identify the presence of the palm.

The intelligent interactive picture operation steps are as follows:

1) initializing an environment, and starting a mediaprofile and an OpenCV framework;

2) initializing a Hand Detector class;

3) initializing a main program of the drawing board, and calling a Hand Detector class;

4) judging whether the camera is successfully opened or not, and starting a gesture detection algorithm after the camera is successfully opened;

5) selecting a drawing mode, namely, opening two or more fingers or palms to be a tool selection mode;

6) drawing mode-only index finger is raised, the currently selected drawing tool is judged, and the color of the selected drawing tool is superimposed.

The Hand Detector class includes the following members:

1) for detecting, segmenting and marking the palms and joints;

2) detecting specific positions of the palm and the fingertips of each finger;

3) detecting a finger gesture;

4) detecting the position of a finger;

5) detecting and calculating the distance between fingers;

wherein detecting the finger gesture includes determining which fingers are lifted.

The initialization steps of the main program of the drawing board are as follows:

1) importing an external library and a module;

2) adjusting the size of the brush;

3) opening a Library folder, opening and calling a designed interactive picture, and initializing a UI (user interface);

4) turning on a camera by using an OpenCV framework;

5) and setting the size of the software window.

The gesture detection process comprises the following steps:

1) calling the Hand Detector;

2) creating an img variable, and storing a picture captured by a camera in real time into the img variable;

3) detecting fingers, judging and detecting the positions of an index finger and a middle finger;

4) detecting a palm posture;

wherein detecting the palm posture comprises determining which fingers are lifted.

The invention has the following beneficial effects:

1. the resource consumption is low, and the required hardware threshold is low;

2. the compatibility is strong, and the system can be operated;

3. the universality is strong, and the device is suitable for cameras and operation platforms of various models;

4. the application range is wide.

Drawings

Fig. 1 is a schematic diagram illustrating a principle of a gesture recognition interaction device and an interaction method based on a monocular camera in an embodiment of the present application;

FIG. 2 is a schematic diagram of positions of 21 three-dimensional palm feature points in the embodiment of the present application;

FIG. 3 is a flowchart of an interaction method of the gesture recognition interaction device in the embodiment of the present application;

FIG. 4 is a flowchart of a step of creating a palm three-dimensional coordinate image library in the embodiment of the present application;

FIG. 5 is a flowchart illustrating the operation steps of an intelligent interactive screen in the embodiment of the present application;

FIG. 6 is a flowchart of the initialization procedure of the main program of the drawing board in the embodiment of the present application;

FIG. 7 is a flowchart illustrating a main program initialization procedure of a drawing board according to an embodiment of the present application;

FIG. 8 is a flow chart of gesture detection in an embodiment of the present application.

Detailed Description

Technical solutions in the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application are capable of operation in other sequences than those illustrated or described herein, and that the terms "first," "second," etc. generally refer to a class of objects and do not limit the number of objects, for example, a first object may be one or more. Further, in the specification and claims, "and/or" means at least one of the connected objects, the character "/" generally means a relationship that preceding and succeeding associated objects are an "or".

The server provided by the embodiments of the present application is described in detail below with reference to the accompanying drawings by specific embodiments and application scenarios thereof.

As shown in fig. 1 to 7, an embodiment of the present application provides a gesture recognition interaction device based on a monocular camera, including a camera for acquiring an image; a feature extractor for locating a rough range in the image where the palm is located; the detector is used for accurately cutting the image in the rough range to obtain a palm image and/or an operation intelligent interactive picture; and the marker is used for identifying the palm joint characteristic points in the palm graph to position the palm.

Further, the method comprises the following steps:

2) displaying an intelligent interactive picture;

3) operating on the image and calculating a palm position by the detector;

Further, the step of establishing the palm three-dimensional coordinate image library is as follows:

1) forming ML Pipeline by two real-time cooperative deep neural network models;

Further, the method also comprises the following steps: and training the detector, modeling by using a square bounding box, neglecting other length-width ratios and reducing the focus loss as much as possible in the training process.

Preferably, the following components: allowing the deep neural network to use much of its computational power for accuracy in coordinate prediction.

Preferably: the marker is generated from the palm feature points identified in the previous frame, and the detector is invoked to reposition the palm when the feature point model can no longer identify the presence of the palm.

Further, the intelligent interactive picture operation steps are as follows:

2) initializing a Hand Detector class;

Further, the Hand Detector class includes the following members:

1) for detecting, segmenting and marking the palms and joints;

2) detecting specific positions of the palm and the fingertips of each finger;

3) detecting a finger gesture;

4) detecting a finger position;

5) detecting and calculating the distance between fingers;

Further, the initialization steps of the main program of the drawing board are as follows:

1) importing an external library and a module;

2) adjusting the size of the brush;

4) turning on a camera by using an OpenCV framework;

5) and setting the size of the software window.

Further, the gesture detection process is as follows:

1) calling a Hand Detector;

4) detecting a palm posture;

wherein detecting the palm pose comprises determining which fingers are lifted.

In the embodiment of the application, mediapiphands is a high fidelity (HD) hand and finger tracking solution. It uses Machine Learning (ML) to infer 21 three-dimensional feature points of the hand. The current most advanced method mainly relies on strong Desktop environment (Desktop) for inference, and the method of the Mediapipe can complete real-time identification on a Mobile phone (Mobile), and can even be expanded to simultaneous identification of multiple hands.

Mediapipe- (machine learning channel) ML Pipeline;

mediapipeline handles uses an ML channel composed of multiple models together. The palm detection model performs traversal operations on the entire image and returns a directional palm bounding box.

The detector adopts a palm detector;

the hand feature point model can operate on a cut image region defined by the palm detector and return high-fidelity 3D hand key points, and the strategy is as follows:

ML Pipeline consists of two real-time deep neural network models and works simultaneously. A detector operates on the entire image and calculates hand positions, and a three-dimensional hand feature point model operates on these positions and predicts the approximated three-dimensional surface by regression. After the palm is cut accurately, the need for common data enhancement (such as bionic transformation consisting of rotation, translation and scale change) is greatly reduced, and the requirement allows the network to use most of the computation power of the network for the accuracy of coordinate prediction. Furthermore, in our Pipeline, markers can also be generated from hand feature points identified from the previous frame, invoking the palm detector to reposition the palm only if the feature point model can no longer identify the presence of a hand.

Pipeline is implemented in the form of a Mediapipe graph, tracking the subgraph using hand landmarks from the hand feature points module, and rendering using a dedicated hand renderer subgraph. And a hand characteristic point sub-graph of the same module and a palm detection sub-graph of the palm detection module are used inside the hand characteristic point tracking sub-graph. First, we train a palm detector, rather than a hand detector, because the bounding box of a rigid object like the palm and fist is estimated to be much simpler than a hand that detects articulated fingers. In addition, since the palm is a small object, the non-maximum suppression algorithm has a good effect even in the case where both hands are not open (such as handshake). Furthermore, the palm can be modeled using a square bounding box (called anchors in ML) ignoring other aspect ratios, thus reducing the number of anchors by a factor of 3-5. Second, a feature extractor consisting of an encoder-decoder structure is used for greater scene context awareness, even for small objects (similar to the RetinaNet method). Finally, we try to minimize the focus loss during training to support the large number of archors caused by the high scale difference.

With the above technique, we achieved an average accuracy of 95.7% in palm detection. With the normal cross entropy loss and no decoder, baseline is only 86.22%.

After palm detection of the whole image, the subsequent hand feature point model carries out accurate key point positioning on 21 three-dimensional finger joint coordinates in the detected hand region through regression, namely direct coordinate prediction. The model learns a consistent internal hand posture representation, and even the robustness of the hand detection result which is partially visible or incomplete is strong.

To obtain the real data, we manually annotated about 3 million real test images with 21 three-dimensional coordinates as shown in the following figure (we obtained the Z value from the image depth map if it existed at each corresponding coordinate). To better cover possible hand gestures and provide additional oversight on hand geometry, we also render a high quality synthetic hand model in various contexts and map it onto corresponding three-dimensional coordinates.

Pipeline is implemented in the form of a Mediapipe graph, tracking the subgraph using hand landmarks from the hand feature points module, and rendering using a dedicated hand renderer subgraph. A hand characteristic point sub-graph of the same module and a palm detection sub-graph of a palm detection module are used in the hand characteristic point tracking sub-graph;

the gesture recognition and interaction algorithm can be applied to various scenes to realize different functions, such as gesture control of a mouse.

Detect finger position, when judging what drawing instrument of selection:

1. if the finger is located between the coordinates 160 and 360, the color of the painting brush is set to be purple red (255, 0, 255)

2. Setting the color of the brush to blue (255, 125, 0) if the finger is between the coordinates 500-700

3. Setting the brush color to green (0, 255, 0) if the finger is between coordinates 815 and 965

4. If the finger is located between the coordinates 1050 and 1250, the brush is set as an eraser, i.e. colorless (0, 0, 0).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order, depending on the functionality involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the scope of the invention as defined by the appended claims.

Claims

1. The utility model provides a gesture recognition mutual equipment based on monocular camera which characterized in that includes:

a camera for acquiring an image;

2. The interaction method of the gesture recognition interaction device based on the claim 1 is characterized by comprising the following steps:

2) displaying an intelligent interactive picture;

3) operating on the image and calculating a palm position by the detector;

3. The interaction method of the gesture recognition interaction device according to claim 2, wherein the step of establishing the palm three-dimensional coordinate image library comprises the steps of:

1) an ML Pipeline (channel) is formed by two real-time cooperative deep neural network models;

4. The interaction method of the gesture recognition interaction device according to claim 3, further comprising:

wherein focus loss is minimized during training.

5. The interaction method of the gesture recognition interaction device according to claim 2, characterized in that:

6. The interaction method of the gesture recognition interaction device according to claim 2, characterized in that:

7. The interaction method of the gesture recognition interaction device according to any one of claims 1 to 6, wherein the intelligent interaction screen is executed by the following steps:

2) initializing a Hand Detector class;

8. The interaction method of the gesture recognition interaction device according to claim 7, wherein the Hand Detector class comprises the following members:

1) for detecting, segmenting and marking the palms and joints;

2) detecting specific positions of the palm and the fingertips of each finger;

3) detecting a finger gesture;

4) detecting a finger position;

5) detecting and calculating the distance between fingers;

9. The interaction method of the gesture recognition interaction device according to claim 7, wherein the initialization step of the main program of the drawing board is as follows:

1) importing an external library and a module;

2) adjusting the size of the brush;

3) opening a Library folder, opening and calling designed interactive pictures, and initializing a UI (user interface);

4) turning on a camera by using an OpenCV framework;

5) and setting the size of the software window.

10. The interaction method of the gesture recognition interaction device according to claim 7, wherein the gesture detection process is as follows:

1) calling the Hand Detector;

4) detecting a palm posture;

wherein detecting the palm pose comprises determining which fingers are lifted.