CN112199994B

CN112199994B - Method and device for detecting interaction of3D hand and unknown object in RGB video in real time

Info

Publication number: CN112199994B
Application number: CN202010916742.4A
Authority: CN
Inventors: 薛聪; 吴彦坤; 向继; 查达仁; 王雷
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2020-09-03
Filing date: 2020-09-03
Publication date: 2023-05-12
Anticipated expiration: 2040-09-03
Also published as: CN112199994A

Abstract

The invention relates to a method and a device for detecting interaction between a 3D hand and an unknown object in RGB video in real time. The method comprises the following steps: training a convolutional neural network by taking a video frame as an input, and predicting the 3D hand gesture, the 6D object gesture, the hand action and the object type of each frame of image by the convolutional neural network; training an interactive cyclic neural network by taking the 3D hand gesture and the 6D object gesture detected by the convolutional neural network as input, and obtaining the interaction category of the hand and the object in the video by using time sequence information in the video by the cyclic neural network; and inputting the video to be detected into the convolutional neural network and the interactive cyclic neural network after training is completed, and obtaining the 3D hand gesture, the 6D object gesture, the hand action, the object type and the interaction action of the hand and the object in the video. According to the invention, depth photos or real object gesture coordinates are not needed as input, so that the accuracy of hand motion recognition is improved, the recognition range is greatly improved, and the method is more convenient to apply to life.

Description

Method and device for detecting interaction of3D hand and unknown object in RGB video in real time

Technical Field

The invention relates to hand and object interaction identification, and aims to detect motion trail and interaction category of hand and unknown object in RGB video in real time.

Background

In recent years, with the development of computer vision and virtual reality technology and the increasing demand for intelligent living home, the "human-centered" action recognition and behavior understanding are becoming research hotspots in the field of computer vision. In the field of behavioral understanding, recognition of hand-object interactions is of paramount importance, and recognition of hand-object interactions includes recognition of hand action categories and object categories, so that with semantic information of hand-object interactions, we can better understand user intent and predict their next actions. Meanwhile, detecting real-time hand shape and motion tracking are always the most core components in a sign language recognition and gesture control system, and play an important role in partial augmented reality experience.

Currently, hand recognition can be largely categorized into vision-based non-contact type and sensor information-based contact type. The method based on the sensor information requires the operator to wear equipment such as data gloves and the like, and the parameters are required to be readjusted after the operator is replaced, and although the three-dimensional pose information of the gesture in space can be directly obtained in real time, the method has a certain difficulty in popularization in reality due to the inconvenience of operation. In contrast, vision-based gesture recognition enables operators to interact with human-machine in a more natural manner. In future human-machine interaction and monitoring, it is therefore highly desirable to rely on vision systems to make the robot perceive the intention of the robot, of which vision-based action recognition and behavioral understanding are particularly important.

However, while critical to the semantically meaningful interpretation of visual scenes, the problem of jointly understanding humans and objects is of little concern. Much research is currently focused on visual understanding that humans and objects are isolated from each other. Conventional methods of recognizing hand motion separate hands from a first view to recognize their gestures (G.Rogez, J.Supancic, and d. Raman. First-Person Pose Recognition Using Egocentric workspaces. In CVPR, 2015.) or hand gestures recognized in RGB images from the first and third views (U.Iqbal, P.Molchanov, T.Breuel, J.Gall, and j. Kautz. Hand Pose Estimation via Latent 2.5D Heatmap Regression.In ECCV,2018.), but these do not model objects interacting with the hands together. Some methods use object interactions as additional constraints when estimating hand motions (C.Choi, S.H.Yoon, C.Chen, and k. Ramani. Robust Hand Pose Estimation during the Interaction with an Unknown object. In ICCV, 2017.) that improve the accuracy of hand motion recognition, but rely on depth images as input. There are methods of pose reconstruction of an adversary and an object (Learning Joint Reconstruction of Hands and Manipulated objects, yana Hasson, gul vanol, dimitrios Tzionas, igor Kalevatykh, michael J.Black, ivan Laptev, cordelia Schmid. In CVPR, 2019), but no semantic information is learned. There are methods that can identify hand-object interactions (Tekin B, bogo F, bellifeys M.H +o: unified egocentric recognition of d hand-object interactions, in CVPR, 2019.) but can only identify known objects in the dataset, lacking generalization.

Although the existing method can analyze semantic information of hand and object interaction, the identifiable object types are limited by a hand data set, the object types of hand interaction in the existing hand movement data set are very limited, and a large amount of manpower and material resources are required to be consumed for marking new data. It is therefore practical to propose a method that can recognize the interaction of a hand with an unknown object from RGB video.

Disclosure of Invention

The invention aims to provide a method and a device capable of detecting the spatial gesture and interaction type of a 3D hand and an unknown object in real time according to RGB video.

The inventor finds that many methods in the prior art solve the gesture of the hand or the object in an isolated state, and the gesture recognition methods can only recognize the shape of the hand and some simple gestures (such as the gesture of standing up the thumb and winning the thumb) and cannot recognize the interaction relationship with the object; some methods for reconstructing the hand and object gestures restore the object edges well, but do not analyze the semantic information of the scene; some methods for identifying actions need to rely on the input of depth images, otherwise, the accuracy is very low; some methods for estimating the object pose do not directly calculate the 6D pose, but generate a 2D frame first, and then calculate the 6D pose through PnP algorithm, so that part of information is lost. The invention solves the problems, can complete a plurality of tasks at one time, is an end-to-end method, can simultaneously predict the gesture, the action and the category estimation of the 3D hand and the object by inputting RGB video, does not need a depth photograph or real object gesture coordinates as input, and improves the accuracy of hand action recognition.

As shown in fig. 1, the invention mainly comprises a Convolutional Neural Network (CNN) and an interactive cyclic neural network (interactive RNN), wherein the convolutional neural network is used for identifying the 3D hand gesture, the 6D object gesture (3D position and 3D direction of an object), the hand action (inversion, opening, closing and the like), the object type (milk, detergent, fruit juice box and the like) of each frame of image, and the cyclic neural network is used for extracting the time sequence characteristics in the integrated video to obtain the interactive type (inversion milk, opening fruit juice box and the like) of the hand and the object of the whole video. The method of the present invention is divided into a training process and a use process. In the training process stage, two steps of training are respectively carried out, firstly, a video frame is taken as input, a convolutional neural network is trained, 3D hand gestures, 6D object gestures, hand actions and object types of each frame of images are predicted, parameters of the 3D hand gestures, 6D object gestures, hand actions and object types are fixed after training, then, the cyclic neural network is trained, detected hand and object key point coordinates are taken as input of the cyclic neural network, and interaction type estimation of the hand and the object in the whole video is output. In the use process stage, the complete model takes a series of video frames as input, and 3D hand gesture and object gesture prediction of each frame and estimation of object and action types of the whole video frame sequence are output after passing through two neural networks.

The technical scheme adopted by the invention mainly comprises the following steps (if no special description exists, the following steps are executed by software and hardware of a computer and electronic equipment):

(1) Model building and training. When the model is used for the first time, a user firstly needs to train the convolutional neural network and the interactive cyclic neural network, and then can use the trained model to conduct action recognition.

(2) Video input. The model can detect the 3D position (namely 3D gesture) of the hand of each frame of image in the video, the 6D gesture of the object, the action of the hand, the object type estimation and the interaction action of the hand and the object in the whole video in real time.

Further, in the detailed design of the model, as shown in fig. 2, 21 key points of each of the hand and the object (the key points of the hand are four joints and wrist nodes of each finger, the key points of the object are eight vertices, a center point and a midpoint of 12 sides of the object bounding box) are specified, and the pose (i.e., the 3D hand pose and the 6D object pose) is determined by predicting the coordinates of the key points.

Further, the method for predicting the coordinates of the key points and predicting the hand actions and the object categories by the convolutional neural network comprises the following steps:

as shown in fig. 3 and 4, each picture frame is divided into h×w grids, and D grids (H, W, D respectively representing height, width, and depth) are extended to depth, each grid having a size of C in units of pixels (pixels) in a plane and meters (meters) in a depth direction _u ×C _v Pixel×C _z And (5) rice. In this grid coordinate system, the upper left corner of the grid is taken as the origin of the coordinate system, and a grid is taken as a unit.

To enable simultaneous joint prediction of hand and object pose and category, as in FIG. 4, two vectors are stored in each cell (i.e., grid)

To predict the characteristics of hand and object, respectively, wherein +.>

Coordinates of key points of hand and object, respectively, < ->

N _c For the number of key points of hands or objects, +.>

For action category probability, ++>

N _a For action category number->

For object class probability ++>

N _o The method is characterized in that the method adds a background class for the object class number (if the object is an unknown object, the object is classified into the background class, and then the unknown object is identified by entering a zero-order learning classifier). Wherein the grid of wrist nodes and object center points is used to predict motion and object categories. />

For confidence level->

The two vectors stored per cell are derived from convolutional neural networks. The method comprises the steps of firstly determining the coordinates (u, v, z) of a cell where a key point is located, and then predicting the offsets delta u, delta v and delta z of the key point relative to the left upper corner of the cell where the key point is located in three dimensions, so that the coordinates of the key point in a grid coordinate system can be obtained:

wherein, since the cell where the wrist node and the object center point are located is responsible for predicting the action and the object category, g (x) is used to control the offset of the two points between [0,1], thereby determining the cell responsible for predicting the action and the object category. The g (x) expression is as follows:

wherein g (x) represents a function of constraining the offset of the wrist node from the center point of the object, x represents the offset deltau, deltav, deltaz, sigmoid of the key point relative to the upper left corner of the cell where it is located in three dimensions, and sigmoid represents an activation function, the value range is (0, 1), and a real number can be mapped to the interval of (0, 1), and the function is used to make the wrist node still located in the cell where it is located after the offset of the wrist node from the center point of the object to predict the action and the object category.

In addition, with the three-dimensional position in the grid coordinate system and the camera internal parameter K, the three-dimensional coordinates of the key points in the camera coordinate system can be calculated as follows:

further, a higher confidence is set for the grid in which the adversary or object is present, and a confidence function is set as:

wherein D is _T (x) Is the Euclidean distance between the predicted point and the real point, alpha represents the super-parameter, d _th Represents a set threshold value, D when the predicted value is closer to the true value _T (x) The smaller c (x) is, the larger the confidence is, and conversely, the smaller the confidence is. The total confidence is:

wherein:

further, when the probability of the background class of the object is maximum, it is determined that the object belongs to an unknown class. As shown in fig. 6, unknown object categories are identified by introducing semantic information using a zero-order learning classifier module. The zero-order learning classifier module multiplies the probabilities of other prediction classes except the background by vectors in a semantic space respectively, adds the obtained semantic vectors to be used as final prediction semantic vectors, calculates the class and the similarity thereof in the semantic space, and considers that the unknown object belongs to the class with the highest similarity when the highest similarity value is not lower than a threshold value.

Further, the total loss function of the convolutional neural network of the present invention is:

wherein lambda is _pose Loss function parameters, lambda, representing predicted hand and object positions _conf Loss function parameter, lambda, representing confidence _actcls Loss function parameter, lambda, representing predicted action class _objcls Loss function parameters representing predicted object class, G ^t A regular fixed grid representing the divided pictures;

representing predicted hand coordinates, +.>

Representing predicted object coordinates +.>

Confidence indicating manual work category of prediction, +.>

Confidence representing predicted object class, +.>

Representing the probability of a predicted object class,

representing the predicted action category probability.

Further, since the convolutional network only learns the information of each frame image and does not utilize the timing information in the video, the present inventionAn interactive cyclic neural network part is added, as shown in figure 5, the key point coordinate vector of the hand and the object is calculated by a convolution network

Inputting a multi-layer perceptron to model the relationship of the multi-layer perceptrons, and then taking the multi-layer perceptrons as the input of a cyclic neural network, wherein the model of the cyclic network is as follows:

wherein f _φ Is a cyclic neural network model g _θ Is a multi-layer perceptron model, and finally outputs the interaction category of the hand and the object in the video.

Based on the same inventive concept, the invention also provides a device for detecting interaction between a 3D hand and an unknown object in RGB video in real time by adopting the method, which comprises the following steps:

the model training module is used for taking the video frames as input and training a convolutional neural network, and the convolutional neural network predicts the 3D hand gesture, the 6D object gesture, the hand actions and the object types of each frame of image; the 3D hand gesture and the 6D object gesture detected by the convolutional neural network are used as input to train an interactive cyclic neural network, and the cyclic neural network utilizes time sequence information in the video to obtain the interactive category of the hand and the object in the video;

the real-time detection module is used for inputting the video to be detected into the convolutional neural network and the interactive cyclic neural network after training is completed, and obtaining the 3D hand gesture, the 6D object gesture, the hand action, the object type and the interaction action of the hand and the object in the video of each frame of image in the video.

The method for identifying the interaction between the 3D hand and the object in the RGB video greatly improves the practicability, and specifically comprises the following steps:

(1) The method does not need to rely on the depth image shot by the RGB-D camera, and can detect the hand-object interaction in the RGB video by inputting a series of frames, so that the application range in life is greatly increased.

(2) The method can detect the position track of the hand and the object and estimate the action category and the object category at the same time at real-time speed, and can be applied to abnormal behavior detection.

(3) The method can detect the unknown object types not in the training set, greatly improves the identification range, improves generalization and is more convenient to apply to life.

Drawings

FIG. 1 is a flow chart of a method for recognizing 3D hand interactions with objects based on RGB video; wherein I is ₁ ～I _N Representing N video frames, CNN is a convolutional neural network, RNN is a cyclic neural network.

FIG. 2 is a schematic diagram of hand and object keypoints; wherein (a) the graph illustrates the keypoints of 21 hands, (b) the graph illustrates the keypoints of 21 objects, (a) P, R, M, I, T in the graph represents 5 fingers, TIP represents fingertips, DIP represents distal knuckles, PIP represents proximal knuckles, MCP represents metacarpophalangerhans, and Wrist is Wrist.

FIG. 3 is a diagram of a grid coordinate system of an input image;

FIG. 4 is a schematic diagram of hand and object positions and their cell stored vectors in a grid coordinate system;

FIG. 5 is a schematic diagram of an interactive cyclic network in a model, where x ₁ ～x _N Representing the input of the interactive recurrent neural network.

Fig. 6 is a schematic diagram of a zero-order learning classifier module in a model.

Detailed Description

The process according to the invention is further described below with reference to the accompanying drawings and specific examples.

The hand motion recognition method does not need to rely on an external detection algorithm, and only needs to perform end-to-end training on a single image. After a single RGB image is input and feedforward transmission is carried out once through a neural network, the gestures of a 3D hand and an object can be estimated together, interaction of the three-dimensional hand and the object is modeled, the object and action categories are identified, and when the object category is identified as a background category, the unknown category of the object is predicted by calculating and searching the closest category in a semantic space through a zero-order learning classifier module. The pose information of the hand and object is then further combined and propagated in the time domain to infer interactions between the hand and object trajectories and to identify actions. The method takes a series of frames as input, and can output 3D hand and object gesture prediction of each frame and estimation of object and action category of the whole sequence.

Fig. 1 is a schematic flow chart of a method for identifying interaction between a 3D hand and an object based on RGB video, the method mainly comprises the following steps:

(1) And (5) model training. The model training is divided into two parts, namely training the convolutional neural network firstly and then fixing the training interactive cyclic neural network. The convolutional neural network is a YOLO-based architecture, the network has 31 layers in total, the rest layers are convolutional layers or pooling layers except the last layer which is a predictor, and a H x W x D x 2 (3 x N) _c +1+N _a +N _o ) Corresponding to the vectors of the two hands and the object contained in each cell in the grid. In the method of the present embodiment, h=w=13, d=5. The size of the picture input in this embodiment is 416×416. After the convolutional network is trained, the key point vectors of the hand and the object obtained by each frame of image through the convolutional network are learned by a multi-layer perceptron of a hidden layer, then the interactive relationship is input into a cyclic neural network of two hidden layers, and finally the interactive category estimation is output. The training dataset of this embodiment is a First-Person Hand Action (FPHA) dataset, which is a publicly available 3D hand-object interaction recognition dataset that contains labels for 3D hand gestures, 6D object gestures, and motion categories. The FPHA contains videos belonging to 45 different activity categories of 6 actors, and the subject performs complex actions corresponding to daily human activities. One subset of the dataset contains annotations of the 6D pose of the object, and corresponding mesh models of 4 objects involving 10 different action categories. The training set is divided into two parts according to the object categories interacting with the hands, a training set and a test set, wherein the test set comprises object categories (unknown categories) which are not present in the training set.

(2) And (3) a detection stage. A series of video frames are input into the model, and the 3D gesture of the hand and the object of each frame of image and the interaction category of the hand and the object in the whole sequence can be estimated. When the predicted object is a background class, the unknown class of the object is predicted through the zero-order learning classifier.

Fig. 2 is a schematic diagram of key points of a hand and an object, and 21 key points are taken for convenience of unified calculation. The key points of the hand are the four joints of each finger, and the wrist nodes. The keypoints of an object take the eight vertices, the center point, and the midpoints of 12 sides of its bounding box. The grids of the wrist nodes and the center points of the objects are used for predicting the categories of the objects and actions.

Fig. 3 is a diagram of a grid coordinate system of an input image, assuming that the upper left corner of the grid is the origin of coordinates, each grid is a unit, and the grid coordinates are the number of grids offset from the upper left corner.

Fig. 4 is a schematic diagram of vectors stored in a grid coordinate system for hand and object positions and cells where the hand and object positions are located, and whether an object exists in a cell is determined by whether a key point falls into the cell or not.

FIG. 5 is a schematic diagram of an interactive cyclic network in a model, wherein each frame of image is first passed through a convolution network to obtain key point vectors of hands and objects

Modeling the relationship by using a multi-layer perceptron, learning time sequence information in the video by using the obtained vector through a cyclic neural network of two hidden layers, and finally outputting interaction category estimation.

FIG. 6 is a schematic diagram of a zero-order learning classifier module in a model, when the probability of a background class is maximum, determining that the object belongs to an unknown class. Multiplying the probabilities of other prediction classes except the background by vectors in a semantic space respectively, adding the obtained semantic vectors to obtain a final prediction semantic vector, calculating the class and the similarity thereof in the semantic space, and considering that the unknown object belongs to the class with the highest similarity when the highest similarity value is not lower than a threshold value.

Even under a complex real scene, the method can also effectively identify the track, category and interaction action of hands and unknown objects in real time from RGB video, obtain semantic information and time sequence information of capturing sequences from the video, greatly improve the action identification efficiency, solve the problem that the semantic information interacted with the objects cannot be identified in the traditional gesture identification, and can also identify the unknown objects interacted with the hands without inputting depth images or real object coordinate data, thereby providing a good theoretical basis for wide application.

The module convolutional neural network, the zero-order learning classifier, and the cyclic neural network of the above embodiments may be arbitrarily combined, and for brevity of description, all possible combinations of the respective modules in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the modules, they should be considered as the scope of the description of the present specification.

Based on the same inventive concept, another embodiment of the present invention provides an apparatus for detecting 3D hand interaction with an unknown object in RGB video in real time using the above method, comprising:

Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps of the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, implements the steps of the inventive method.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art may modify or substitute the technical solution of the present invention without departing from the spirit and scope of the present invention, and the protection scope of the present invention shall be defined by the claims.

Claims

1. A method for detecting interactions of a 3D hand in RGB video with an unknown object in real time, comprising the steps of:

training a convolutional neural network by taking a video frame as an input, wherein the convolutional neural network predicts the 3D hand gesture, the 6D object gesture, the hand action and the object type of each frame of image;

the method comprises the steps of training an interactive cyclic neural network by taking a 3D hand gesture and a 6D object gesture detected by a convolutional neural network as input, wherein the cyclic neural network utilizes time sequence information in a video to obtain the interactive category of the hand and the object in the video;

inputting the video to be detected into a convolutional neural network and an interactive cyclic neural network after training is completed, and obtaining the 3D hand gesture, the 6D object gesture, the hand action, the object type and the interaction action of the hand and the object in the video;

the convolutional neural network predicts the coordinates of key points and predicts hand actions and object categories by adopting the following steps:

dividing each picture frame into H×W grids, and extending D grids to depth, wherein the grids are in pixel unit on plane and in meter unit in depth direction, i.e. each grid has size of C _u ×C _v Pixel×C _z In the grid coordinate system, the upper left corner of the grid is taken as the origin of the coordinate system, and a grid is taken as a unit;

two vectors are stored in each cell

To predict hand and object features, respectively, wherein

Coordinates of key points of hand and object, respectively, < ->

N _c For the number of key points of hands or objects, +.>

For action category probability, ++>

N _a For action category number->

For object class probability ++>

N _o The number of object categories; the grids where the wrist nodes and the center points of the objects are positioned are used for predicting actions and categories of the objects; />

For confidence level->

The two vectors stored in each cell are obtained by a convolutional neural network;

the coordinates (u, v, z) of the cell where the key point is located are determined, and then the offsets deltau, deltav, deltaz of the key point relative to the upper left corner of the cell where the key point is located in three dimensions are predicted, so that the coordinates of the key point in a grid coordinate system can be obtained:

wherein, because the cell where the wrist node and the object center point are located is responsible for predicting the action and the object category, g (x) is used for controlling the offset of the two points to be between 0 and 1, thereby determining the cell responsible for predicting the action and the object category; the g (x) expression is as follows:

wherein g (x) represents a function of constraining the offset of the wrist node from the center point of the object, x represents the offset deltau, deltav, deltaz of the key point relative to the upper left corner of the cell where it is located in three dimensions, sigmoid represents an activation function, the value range is (0, 1), and it can map a real number to the interval of (0, 1);

the interactive cyclic neural network takes the key point coordinate vector of the hand and the object obtained by the convolutional neural network as input, models the interactive relation of the key point coordinate vector by a multi-layer perceptron as the input of the cyclic neural network, and finally outputs the interactive category of the hand and the object in the video.

2. The method of claim 1, wherein 21 keypoints of the hand and the object are specified, the convolutional neural network determining the 3D hand pose and the 6D object pose by predicting coordinates of the keypoints, wherein the keypoints of the hand are four joints and wrist nodes of each finger, and the keypoints of the object are midpoints of eight vertices, a center point, and 12 sides of an object bounding box.

3. The method of claim 1, wherein the grid in which the adversary or object is present is set to a higher confidence, and the confidence function is set to:

wherein D is _T (x) Is the Euclidean distance between the predicted point and the real point, alpha represents the super-parameter, d _th Representing a set threshold value; the total confidence is:

wherein:

4. the method according to claim 1, wherein when the probability of a background class of an object is maximum, it is determined that the object belongs to an unknown class, and the unknown class of the object is identified by introducing semantic information using a zero-order learning classifier; the zero-order learning classifier multiplies the probabilities of other prediction classes except the background by vectors in a semantic space, adds the obtained semantic vectors to be used as final prediction semantic vectors, calculates the class in the semantic space and the similarity thereof, and considers that the unknown object belongs to the class with the highest similarity when the highest similarity value is not lower than a threshold value.

5. The method of claim 1, wherein the convolutional neural network has a total loss function of:

wherein lambda is _pose Loss function parameters, lambda, representing predicted hand and object positions _conf Loss function parameter, lambda, representing confidence _act Loss function parameter, lambda, representing predicted action class _obj Loss function parameters representing predicted object class, G ^t A regular fixed grid representing the divided pictures;

representing predicted hand coordinates, +.>

Representing predicted object coordinates +.>

Confidence indicating manual work category of prediction, +.>

Confidence representing predicted object class, +.>

Representing predicted object class probability, +.>

Representing the predicted action category probability.

6. An apparatus for real-time detection of3D hand interactions with unknown objects in RGB video using the method of any one of claims 1-5, comprising:

7. An electronic device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-5.

8. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any of claims 1-5.