CN112199994A

CN112199994A - Method and device for detecting interaction between 3D hand and unknown object in RGB video in real time

Info

Publication number: CN112199994A
Application number: CN202010916742.4A
Authority: CN
Inventors: 薛聪; 吴彦坤; 向继; 查达仁; 王雷
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2020-09-03
Filing date: 2020-09-03
Publication date: 2021-01-08
Anticipated expiration: 2040-09-03
Also published as: CN112199994B

Abstract

The invention relates to a method and a device for detecting interaction between a 3D hand and an unknown object in an RGB video in real time. The method comprises the following steps: training a convolutional neural network by taking a video frame as input, wherein the convolutional neural network predicts the 3D hand posture, the 6D object posture, the hand action and the object type of each frame of image; training an interactive cyclic neural network by taking the 3D hand posture and the 6D object posture detected by the convolutional neural network as input, and obtaining the interactive category of the hand and the object in the video by the cyclic neural network by utilizing the time sequence information in the video; and inputting the video to be detected into the trained convolutional neural network and interactive cyclic neural network to obtain the 3D hand posture, the 6D object posture, the hand action, the object type of each frame of image in the video and the interactive action of the hands and the object in the video. According to the invention, depth pictures or real object posture coordinates are not required to be used as input, so that the accuracy of hand motion recognition is improved, the recognition range is greatly improved, and the method is more convenient to apply to life.

Description

Method and device for detecting interaction between 3D hand and unknown object in RGB video in real time

Technical Field

The invention relates to hand and object interaction identification, and aims to detect the motion tracks and interaction categories of hands and unknown objects in RGB (red, green and blue) videos in real time.

Background

In recent years, with the development of computer vision and virtual reality technologies and the increasing demand for home-based intelligence, motion recognition and behavior understanding "people-centered" are becoming research hotspots in the field of computer vision. In the field of behavior understanding, recognition of hand-object interaction is crucial, recognition of hand-object interaction comprises recognition of hand action types and object types, and only with semantic information of hand-object interaction can a user better understand user intention and predict next action. Meanwhile, the detection of real-time hand shapes and the motion tracking are the most core components in a sign language recognition and gesture control system all the time, and play an important role in part of augmented reality experience.

Currently, hand recognition can be mainly classified into non-contact type based on vision and contact type based on sensor information. The method based on the sensor information requires an operator to wear equipment such as data gloves and the like, and the parameters need to be adjusted again after the operator is replaced, although the three-dimensional pose information of the gesture in the space can be directly obtained in real time, due to the inconvenience in operation, the method has certain difficulty in popularization in reality. In contrast, vision-based gesture recognition enables human-machine interaction by an operator in a more natural manner. In future human-computer interaction and monitoring, it is therefore highly desirable to rely on vision systems for the machine to perceive human intent, where vision-based action recognition and behavioral understanding are particularly important.

However, while crucial to semantically meaningful interpretation of visual scenes, the problem of co-understanding humans and objects is of little concern. Much research is currently focused on the visual understanding that humans and objects are isolated from each other. Conventional methods of recognizing hand movements segment the hand alone at a first perspective to recognize its gesture (g.rogez, j.supancic, and d.ramann.first-Person position Recognition Using the gesture work pages. in CVPR,2015.), or hand gesture recognized in RGB images at first and third perspectives (u.iqbal, p.molchanov, t.breuel, j.gall, and j.kautz.hand position Estimation view content 2.5D heat map regression. in ECCV,2018.), but these do not share objects that model the interaction with the hand. Some methods use object Interaction as an additional constraint (c.choi, s.h.yoon, c.chen, and k.ramani.robust Hand Estimation during the Interaction with an Unknown object ICCV,2017.) when estimating Hand motions, which improves the accuracy of Hand motion recognition but relies on depth images as input. Some methods reconstruct the posture of Hands and objects (left Joint Reconstruction of Hands and Manipulated objects, yana Hasson, Gul Varol, dimirios tzonoas, Igor kaleytkh, Michael j.black, Ivan Laptev, Cordelia schmid.in CVPR,2019), but do not learn semantic information. Some methods can identify hand-to-object interactions (Tekin B, Bogo F, Pollefeys M.H + O: Unifie ecodentic recognition of3d hand-object sites and interactions. in CVPR,2019.), but only identify objects known in the dataset and lack generalization.

Although the existing method can analyze semantic information of hand and object interaction, the object types capable of being identified are limited by a hand data set, the existing hand motion data set has very limited object types interacting with hands, and a large amount of manpower and material resources are consumed for new data labeling. Therefore, it is realistic to propose a method that can recognize the interaction of a hand with an unknown object from RGB video.

Disclosure of Invention

The invention aims to provide a method and a device for detecting the spatial posture and the interaction category of a 3D hand and an unknown object in real time according to RGB video.

The inventor finds that many methods in the prior art are used for solving the gesture of the isolated hand or object, and the gesture recognition methods can only recognize the shape of the hand and some simple gestures (such as a gesture of holding up a thumb and victory), and cannot recognize the interaction relationship with the object; some methods for reconstructing the hand and object postures can well restore the object edges, but the semantic information of the scene is not analyzed; some methods for recognizing actions need to rely on the input of depth images, otherwise, the accuracy is low; some methods for estimating the object posture do not directly calculate the 6D posture, but generate a 2D frame and calculate the 6D posture through a PnP algorithm, thereby losing part of information. The invention solves the problems, can complete multiple tasks at one time, is end-to-end, can simultaneously predict the 3D hand and object postures, actions and category estimation by inputting RGB video, does not need depth pictures or real object posture coordinates as input, and improves the accuracy rate of hand action identification.

As shown in fig. 1, the present invention mainly includes a Convolutional Neural Network (CNN) for identifying 3D hand pose, 6D object pose (3D position and 3D direction of object), hand motion (fall, open, close, etc.), object type (milk, detergent, juice box, etc.) of each frame of image, and an interactive recurrent neural network (interactive RNN) for extracting time-series features in the integrated video to obtain the interaction type (fall milk, open juice box, etc.) of the hand and the object of the whole video. The method of the invention is divided into a training process and a using process. In the training process stage, training is respectively carried out in two steps, firstly, a video frame is used as input, a convolutional neural network is trained, the 3D hand posture, the 6D object posture, the hand action and the object type of each frame of image are predicted, the parameters of the convolutional neural network are fixed after training is finished, then, the recurrent neural network is trained, the detected key point coordinates of the hand and the object are used as the input of the recurrent neural network, and the interactive type estimation of the hand and the object in the whole video is output. In the use process stage, the complete model takes a series of video frames as input, and outputs 3D hand posture and object posture predictions of each frame and estimates of objects and motion classes of the whole video frame sequence after passing through two neural networks.

The technical scheme adopted by the invention mainly comprises the following steps (if no special description is provided, the following steps are executed by software and hardware of a computer and electronic equipment):

(1) and (5) building and training a model. When the model is used for the first time, a user needs to train the convolutional neural network and the interactive cyclic neural network firstly, and then can use the trained model to perform motion recognition.

(2) And (6) inputting video. When a segment of RGB video is input, the model can detect the 3D position (namely 3D posture) of the hand, the 6D posture of an object, the motion of the hand, the object type estimation and the interaction motion of the hand and the object in the whole segment of video in real time.

Further, in the detailed design of the model, as shown in fig. 2, 21 key points of the hand and the object are specified respectively (the key points of the hand are four joints and wrist nodes of each finger, and the key points of the object are eight vertexes, a center point and a midpoint of 12 edges of a bounding box of the object), and the postures (i.e. 3D hand posture and 6D object posture) of the hand and the object are determined by predicting the coordinates of the key points.

Further, the method for predicting the coordinates of the key points and predicting the hand motion and object category by the convolutional neural network is as follows:

as shown in fig. 3 and 4, each picture frame is divided into H × W meshes, and the depth is expanded by D meshes (H, W, D represents height, width, and depth, respectively), and the pixels (pixels) are provided on a plane, and meters (meters) are provided in the depth direction, that is, the size of each mesh is C_u×C_vPixel by C_zAnd (4) rice. In the grid coordinate system, the upper left corner of the grid is taken as the origin of the coordinate system, and one grid is taken as a unit.

In order to jointly predict hand and object poses and categories simultaneously, two vectors are stored in each cell (i.e., grid), as shown in FIG. 4

To predict the characteristics of the hand and the object, respectively, wherein

Are respectively handsAnd the coordinates of the key points of the object,

N_cthe number of key points of the hand or the object,

in the form of a probability of an action category,

N_aas the number of the action categories,

is the probability of the object class,

N_othe number of object classes (the invention adds a class of background classes, if the object is an unknown object, the object is classified into the background class, and then the zero-time learning classifier is used for identifying the unknown object). The grids where the wrist node and the object center point are located are used for predicting the motion and the object category.

For the purpose of the confidence level,

the two vectors stored per cell are derived from a convolutional neural network. The invention firstly determines the coordinates (u, v, z) of the cell where the key point is located, and then predicts the deviations delta u, delta v, delta z of the key point relative to the upper left corner of the cell where the key point is located in three dimensions, so as to obtain the coordinates of the key point in a grid coordinate system:

since the cell where the wrist node and the object center point are located is responsible for predicting the motion and the object type, g (x) is used for controlling the offset of the two points to be between [0 and 1], so that the cell responsible for predicting the motion and the object type is determined. g (x) the expression is as follows:

wherein g (x) represents a function for restricting the deviation of the wrist node and the center point of the object, x represents the deviation delta u, delta v, delta z of the key point relative to the upper left corner of the cell in which the key point is located in three dimensions, sigmoid represents an activation function, the value range is (0,1), a real number can be mapped to the interval of (0,1), and the function is utilized to enable the wrist node and the center point of the object to be still located in the cell in which the wrist node and the center point of the object are located after being deviated to predict the action and the object category.

In addition, with the three-dimensional position in the grid coordinate system and the camera internal reference K, the three-dimensional coordinates of the key point in the camera coordinate system can be calculated as:

further, a higher confidence is set for the mesh where the hand or object is present, setting the confidence function as:

wherein D is_T(x) Is the Euclidean distance between the predicted point and the real point, alpha represents the hyper parameter, d_thIndicating a set threshold value, D, as the predicted value is closer to the true value_T(x) Smaller c (x) indicates greater confidence, and conversely, less confidence. The total confidence is:

wherein:

further, when the probability of the background class of the object is the maximum, it is determined that the object belongs to an unknown class. Referring to fig. 6, a zero-learning classifier module is used to identify unknown object classes by introducing semantic information. The zero-time learning classifier module multiplies the probabilities of other prediction classes except the background by the vectors of the prediction classes in the semantic space respectively, adds the obtained semantic vectors to be used as the final predicted semantic vector, then calculates the class and the similarity of the class in the semantic space, and considers that the unknown object belongs to the class with the highest similarity when the highest similarity value is not lower than a threshold value.

Further, the overall loss function of the convolutional neural network of the present invention is:

wherein λ is_poseParameters of a loss function, λ, representing predicted hand and object positions_confLoss function parameter, λ, representing confidence_actclsParameters of a loss function, λ, representing the class of predicted motion_objclsParameters of a loss function representing the class of the predicted object, G^tA regular fixed grid representing a divided picture;

which represents the predicted hand coordinates, and,

which represents the coordinates of the object that are predicted,

representing the confidence of the predicted manual category,

a confidence level indicating the class of the object to be predicted,

representing the probability of the predicted class of the object,

representing the predicted action category probability.

Further, because the convolutional network only learns the information of each frame of image and does not utilize the time sequence information in the video, the invention adds an interactive cyclic neural network part, as shown in fig. 5, and calculates the key point coordinate vectors of the hand and the object by the convolutional network

Inputting a multilayer perceptron to model the relationship of the multilayer perceptrons, and taking the relationship as the input of a recurrent neural network, wherein the model of the recurrent neural network is as follows:

wherein f is_φIs a recurrent neural network model, g_θThe method is a multi-layer perceptron model, and finally the interaction category of the hand and the object in the video is output.

Based on the same inventive concept, the invention also provides a device for detecting the interaction between the 3D hand and the unknown object in the RGB video in real time by adopting the method, which comprises the following steps:

a model training module for training a convolutional neural network with the video frames as input, the convolutional neural network predicting 3D hand poses, 6D object poses, hand motions and object classes of each frame of image; taking the 3D hand posture and the 6D object posture detected by the convolutional neural network as input, training an interactive cyclic neural network, and obtaining the interactive category of the hand and the object in the video by the cyclic neural network by utilizing the time sequence information in the video;

and the real-time detection module is used for inputting the video to be detected into the trained convolutional neural network and interactive cyclic neural network to obtain the 3D hand posture, the 6D object posture, the hand action, the object type of each frame of image in the video and the interactive action of the hands and the object in the video.

The method for identifying the interaction between the 3D hand and the object in the RGB video greatly improves the practicability and specifically comprises the following steps:

(1) the method does not need to rely on the depth image shot by the RGB-D camera, and can detect the hand-object interaction in the RGB video by inputting a series of frames, so that the applicable range in life is greatly increased.

(2) The method can simultaneously detect the position tracks of the hands and the objects, the motion types and the object type estimation at real-time speed, and can be applied to abnormal behavior detection.

(3) The method can detect the unknown object types which are not in the training set, greatly improves the identification range, improves the generalization and is more convenient to be applied to life.

Drawings

FIG. 1 is a flow chart diagram of a method for recognizing 3D hand-object interaction based on RGB video; wherein I₁～I_NAnd N video frames are represented, CNN is a convolutional neural network, and RNN is a cyclic neural network.

FIG. 2 is a schematic diagram of a hand and object keypoints; wherein (a) the diagram shows key points of 21 hands, (b) the diagram shows key points of 21 objects, P, R, M, I, T in the diagram shows 5 fingers, TIP shows a fingertip, DIP shows a far finger joint, PIP shows a near finger joint, MCP shows a palm joint, and Wrist shows a Wrist.

FIG. 3 is a schematic diagram of a grid coordinate system of an input image;

FIG. 4 is a vector diagram of the hand and object positions and their cell storage in a grid coordinate system;

FIG. 5 is a schematic diagram of an interaction cycle network in a modelWherein x is₁～x_NRepresenting the input to the interactive recurrent neural network.

FIG. 6 is a schematic diagram of a zero-learning classifier module in a model.

Detailed Description

The method of the invention is further described below with reference to the figures and specific examples.

The hand motion recognition method disclosed by the invention does not need to rely on an external detection algorithm, and only needs to carry out end-to-end training on a single image. Inputting a single RGB image, carrying out feedforward transmission once through a neural network, estimating the 3D hand and object postures together, modeling the interaction of the three and identifying the object and action categories, and when the object category is identified as a background category, calculating and searching the closest category in a semantic space through a zero-time learning classifier module to predict the unknown category of the object. The hand and object pose information is then further merged and propagated in the time domain to infer the interaction between the hand and object trajectories and to identify motion. The method takes a series of frames as input, and can output the 3D hand and object posture prediction of each frame and the estimation of the object and action category of the whole sequence.

Fig. 1 is a schematic flow chart of a method for recognizing interaction between a 3D hand and an object based on RGB video, the method mainly includes the following steps:

(1) and (5) training a model. The model training is divided into two parts, namely, the convolutional neural network is trained firstly, and then the training interactive cyclic neural network is fixed. The convolutional neural network is based on the architecture of YOLO, the total number of the network is 31 layers, except the last layer which is a predictor, the other layers are convolutional layers or pooling layers, and the convolutional neural network obtains H W D2 (3 xN) through the last predictor_c+1+N_a+N_o) Corresponding to the vectors of the two hands and the object contained in each grid in the grid. In the method of this embodiment, H ═ W ═ 13, and D ═ 5. The input picture size of this embodiment is 416 x 416. After the convolutional network is trained, the key point vectors of hands and objects obtained by each frame of image through the convolutional network are subjected to learning of the interaction relationship through a multi-layer perceptron with one hidden layer, and then the key point vectors are input into a recurrent neural network with two hidden layersAnd finally, outputting the interaction category estimation. The data set trained in the embodiment is a First-Person Hand Action (FPHA) data set, which is a publicly available 3D Hand-object interaction recognition data set, and includes labels of3D Hand gestures, 6D object gestures, and motion categories. The FPHA contains video belonging to 45 different activity categories of 6 actors, and the subject performs complex actions corresponding to daily human activities. A subset of the dataset contains annotations of the object's 6D pose, and corresponding mesh models of 4 objects involving 10 different action classes. The training set is divided into two parts according to the object class interacted with the hand, namely the training set and the test set, wherein the test set comprises the object class (unknown class) which does not appear in the training set.

(2) And (5) a detection stage. By inputting a series of video frames into the model, the 3D poses of the hand and the object for each frame of image and the interaction category of the hand and the object in the whole sequence can be estimated. When the object is predicted to be in the background class, the unknown class of the object is predicted by learning the classifier for zero times.

Fig. 2 is a schematic diagram of key points of a hand and an object, and 21 key points are taken for unified calculation. The key points of the hand are the four joints of each finger, and the wrist joint. The key points of an object take the eight vertices, the center point, and the midpoints of the 12 edges of its bounding box. The grids in which the wrist nodes and the object center points are located are used for predicting the types of the objects and the actions.

Fig. 3 is a schematic diagram of a grid coordinate system of an input image, where the upper left corner of the grid is set as the origin of coordinates, each grid is a unit, and the grid coordinates are the number of grids shifted from the upper left corner.

FIG. 4 is a vector diagram of the positions of the hand and the object and the cell where the hand and the object are located in the grid coordinate system, and whether the cell has the object or not is determined by whether the irrelevant key point falls into the cell or not.

FIG. 5 is a schematic diagram of an interactive circulation network in a model, each frame of image is firstly passed through a convolution network, and key point vectors of hands and objects obtained by taking the key point vectors

The relation of the two is modeled by a multilayer perceptron, the obtained vector is used for learning the time sequence information in the video by a recurrent neural network of two hidden layers, and finally, the interactive category estimation is output.

FIG. 6 is a schematic diagram of a zero-learning classifier module in a model, which determines that the object belongs to an unknown class when the probability of the background class is the highest. And multiplying the probabilities of other prediction classes except the background with the vectors of the prediction classes in the semantic space respectively, adding the obtained semantic vectors to be used as the final predicted semantic vector, then calculating the class and the similarity of the class in the semantic space, and when the highest value of the similarity is not lower than a threshold value, considering that the unknown object belongs to the class with the highest similarity.

Even in a complex real scene, the method can effectively identify the track, the category and the interactive action of the hand and an unknown object in real time from the RGB video, and obtains the semantic information and the time sequence information of the capture sequence in the video, thereby greatly improving the action identification efficiency, solving the problem that the semantic information interacted with the object cannot be identified in the traditional gesture identification, and identifying the unknown object interacted with the hand without inputting a depth image or the coordinate data of a real object, and providing a good theoretical basis for the wide application of the method.

The convolutional neural network, the zero-learning classifier, and the recurrent neural network of the above embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the modules in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the modules, the scope of the present disclosure should be considered.

Based on the same inventive concept, another embodiment of the present invention provides an apparatus for detecting interaction between a 3D hand and an unknown object in RGB video in real time by using the above method, including:

Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor and a processor, the computer program comprising instructions for performing the steps of the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, performs the steps of the inventive method.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A method for detecting interaction between a 3D hand and an unknown object in an RGB video in real time is characterized by comprising the following steps:

training a convolutional neural network with a video frame as input, the convolutional neural network predicting 3D hand gestures, 6D object poses, hand motions and object classes of each frame of image;

taking the 3D hand posture and the 6D object posture detected by the convolutional neural network as input, training an interactive cyclic neural network, and obtaining the interactive category of the hand and the object in the video by the cyclic neural network by utilizing time sequence information in the video;

and inputting the video to be detected into the trained convolutional neural network and interactive cyclic neural network to obtain the 3D hand posture, the 6D object posture, the hand action, the object type of each frame of image in the video and the interactive action of the hands and the object in the video.

2. The method of claim 1, wherein 21 keypoints of the hand and the object are specified, and the convolutional neural network determines the 3D hand pose and the 6D object pose by predicting the coordinates of the keypoints, wherein the keypoints of the hand are the four joints and wrist nodes of each finger, and the keypoints of the object are the eight vertices, the center point, and the midpoint of the 12 edges of the bounding box of the object.

3. The method of claim 2, wherein the convolutional neural network predicts coordinates of keypoints and predicts hand motion and object class using the following steps:

dividing each picture frame into H multiplied by W grids, expanding D grids to the depth, taking pixels as a unit on a plane and meters as a unit in the depth direction, namely, the size of each grid is C_u×C_vPixel by C_zRice, taking the upper left corner of the grid as the origin of the coordinate system in the grid coordinate system, and taking one grid as a unit;

storing two vectors in each cell

Respectively the coordinates of the hand and object key points,

N_cthe number of key points of the hand or the object,

in the form of a probability of an action category,

N_aas the number of the action categories,

is the probability of the object class,

N_othe number of object categories; the grids where the wrist nodes and the object center points are located are used for predicting actions and the types of the objects;

for the purpose of the confidence level,

the two vectors stored by each cell are obtained by a convolutional neural network;

firstly, determining coordinates (u, v, z) of a cell where the key point is located, and then predicting deviations delta u, delta v, delta z of the key point relative to the upper left corner of the cell where the key point is located in three dimensions, so that the coordinates of the key point in a grid coordinate system can be obtained:

the cell where the wrist node and the object center point are located is responsible for predicting the motion and the object type, so that g (x) is used for controlling the offset of the two points to be between [0 and 1], and the cell responsible for predicting the motion and the object type is determined; g (x) the expression is as follows:

wherein g (x) represents a function for restricting the deviation of the wrist node and the center point of the object, x represents the deviation delta u, delta v, delta z of the key point relative to the upper left corner of the cell in three dimensions, sigmoid represents an activation function, the value range is (0,1), and a real number can be mapped to the interval of (0, 1).

4. The method of claim 3, wherein a higher confidence is set for the mesh of the hand or object presence, setting the confidence function as:

wherein D is_T(x) Is the Euclidean distance between the predicted point and the real point, alpha represents the hyper parameter, d_thIndicating a set threshold; the total confidence is:

wherein:

5. the method according to claim 3, characterized in that when the probability of the background class of the object is the largest, the object is determined to belong to an unknown class, and the unknown class of the object is identified by introducing semantic information by using a zero-order learning classifier; the zero-time learning classifier multiplies the probabilities of other prediction classes except the background by the vectors of the prediction classes in the semantic space respectively, adds the obtained semantic vectors to be used as the final predicted semantic vector, then calculates the class and the similarity of the class in the semantic space, and when the highest value of the similarity is not lower than a threshold value, the unknown object is considered to belong to the class with the highest similarity.

6. The method of claim 3, wherein the overall loss function of the convolutional neural network is:

which represents the predicted hand coordinates, and,

which represents the coordinates of the object that are predicted,

representing the confidence of the predicted manual category,

a confidence level indicating the class of the object to be predicted,

representing the probability of the predicted class of the object,

representing the predicted action category probability.

7. The method according to claim 3, wherein the interactive recurrent neural network takes the key point coordinate vectors of the hand and the object obtained by the convolutional neural network as input, the interactive relationship is modeled by a multi-layer perceptron as the input of the recurrent neural network, and finally the interactive category of the hand and the object in the video is output.

8. An apparatus for detecting 3D hand interaction with unknown objects in RGB video in real time by using the method of any one of claims 1 to 7, comprising:

9. An electronic apparatus, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1 to 7.