CN112199994A - Method and device for detecting interaction between 3D hand and unknown object in RGB video in real time - Google Patents

Method and device for detecting interaction between 3D hand and unknown object in RGB video in real time Download PDF

Info

Publication number
CN112199994A
CN112199994A CN202010916742.4A CN202010916742A CN112199994A CN 112199994 A CN112199994 A CN 112199994A CN 202010916742 A CN202010916742 A CN 202010916742A CN 112199994 A CN112199994 A CN 112199994A
Authority
CN
China
Prior art keywords
hand
neural network
video
convolutional neural
interactive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010916742.4A
Other languages
Chinese (zh)
Other versions
CN112199994B (en
Inventor
薛聪
吴彦坤
向继
查达仁
王雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN202010916742.4A priority Critical patent/CN112199994B/en
Publication of CN112199994A publication Critical patent/CN112199994A/en
Application granted granted Critical
Publication of CN112199994B publication Critical patent/CN112199994B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a method and a device for detecting interaction between a 3D hand and an unknown object in an RGB video in real time. The method comprises the following steps: training a convolutional neural network by taking a video frame as input, wherein the convolutional neural network predicts the 3D hand posture, the 6D object posture, the hand action and the object type of each frame of image; training an interactive cyclic neural network by taking the 3D hand posture and the 6D object posture detected by the convolutional neural network as input, and obtaining the interactive category of the hand and the object in the video by the cyclic neural network by utilizing the time sequence information in the video; and inputting the video to be detected into the trained convolutional neural network and interactive cyclic neural network to obtain the 3D hand posture, the 6D object posture, the hand action, the object type of each frame of image in the video and the interactive action of the hands and the object in the video. According to the invention, depth pictures or real object posture coordinates are not required to be used as input, so that the accuracy of hand motion recognition is improved, the recognition range is greatly improved, and the method is more convenient to apply to life.

Description

Method and device for detecting interaction between 3D hand and unknown object in RGB video in real time
Technical Field
The invention relates to hand and object interaction identification, and aims to detect the motion tracks and interaction categories of hands and unknown objects in RGB (red, green and blue) videos in real time.
Background
In recent years, with the development of computer vision and virtual reality technologies and the increasing demand for home-based intelligence, motion recognition and behavior understanding "people-centered" are becoming research hotspots in the field of computer vision. In the field of behavior understanding, recognition of hand-object interaction is crucial, recognition of hand-object interaction comprises recognition of hand action types and object types, and only with semantic information of hand-object interaction can a user better understand user intention and predict next action. Meanwhile, the detection of real-time hand shapes and the motion tracking are the most core components in a sign language recognition and gesture control system all the time, and play an important role in part of augmented reality experience.
Currently, hand recognition can be mainly classified into non-contact type based on vision and contact type based on sensor information. The method based on the sensor information requires an operator to wear equipment such as data gloves and the like, and the parameters need to be adjusted again after the operator is replaced, although the three-dimensional pose information of the gesture in the space can be directly obtained in real time, due to the inconvenience in operation, the method has certain difficulty in popularization in reality. In contrast, vision-based gesture recognition enables human-machine interaction by an operator in a more natural manner. In future human-computer interaction and monitoring, it is therefore highly desirable to rely on vision systems for the machine to perceive human intent, where vision-based action recognition and behavioral understanding are particularly important.
However, while crucial to semantically meaningful interpretation of visual scenes, the problem of co-understanding humans and objects is of little concern. Much research is currently focused on the visual understanding that humans and objects are isolated from each other. Conventional methods of recognizing hand movements segment the hand alone at a first perspective to recognize its gesture (g.rogez, j.supancic, and d.ramann.first-Person position Recognition Using the gesture work pages. in CVPR,2015.), or hand gesture recognized in RGB images at first and third perspectives (u.iqbal, p.molchanov, t.breuel, j.gall, and j.kautz.hand position Estimation view content 2.5D heat map regression. in ECCV,2018.), but these do not share objects that model the interaction with the hand. Some methods use object Interaction as an additional constraint (c.choi, s.h.yoon, c.chen, and k.ramani.robust Hand Estimation during the Interaction with an Unknown object ICCV,2017.) when estimating Hand motions, which improves the accuracy of Hand motion recognition but relies on depth images as input. Some methods reconstruct the posture of Hands and objects (left Joint Reconstruction of Hands and Manipulated objects, yana Hasson, Gul Varol, dimirios tzonoas, Igor kaleytkh, Michael j.black, Ivan Laptev, Cordelia schmid.in CVPR,2019), but do not learn semantic information. Some methods can identify hand-to-object interactions (Tekin B, Bogo F, Pollefeys M.H + O: Unifie ecodentic recognition of3d hand-object sites and interactions. in CVPR,2019.), but only identify objects known in the dataset and lack generalization.
Although the existing method can analyze semantic information of hand and object interaction, the object types capable of being identified are limited by a hand data set, the existing hand motion data set has very limited object types interacting with hands, and a large amount of manpower and material resources are consumed for new data labeling. Therefore, it is realistic to propose a method that can recognize the interaction of a hand with an unknown object from RGB video.
Disclosure of Invention
The invention aims to provide a method and a device for detecting the spatial posture and the interaction category of a 3D hand and an unknown object in real time according to RGB video.
The inventor finds that many methods in the prior art are used for solving the gesture of the isolated hand or object, and the gesture recognition methods can only recognize the shape of the hand and some simple gestures (such as a gesture of holding up a thumb and victory), and cannot recognize the interaction relationship with the object; some methods for reconstructing the hand and object postures can well restore the object edges, but the semantic information of the scene is not analyzed; some methods for recognizing actions need to rely on the input of depth images, otherwise, the accuracy is low; some methods for estimating the object posture do not directly calculate the 6D posture, but generate a 2D frame and calculate the 6D posture through a PnP algorithm, thereby losing part of information. The invention solves the problems, can complete multiple tasks at one time, is end-to-end, can simultaneously predict the 3D hand and object postures, actions and category estimation by inputting RGB video, does not need depth pictures or real object posture coordinates as input, and improves the accuracy rate of hand action identification.
As shown in fig. 1, the present invention mainly includes a Convolutional Neural Network (CNN) for identifying 3D hand pose, 6D object pose (3D position and 3D direction of object), hand motion (fall, open, close, etc.), object type (milk, detergent, juice box, etc.) of each frame of image, and an interactive recurrent neural network (interactive RNN) for extracting time-series features in the integrated video to obtain the interaction type (fall milk, open juice box, etc.) of the hand and the object of the whole video. The method of the invention is divided into a training process and a using process. In the training process stage, training is respectively carried out in two steps, firstly, a video frame is used as input, a convolutional neural network is trained, the 3D hand posture, the 6D object posture, the hand action and the object type of each frame of image are predicted, the parameters of the convolutional neural network are fixed after training is finished, then, the recurrent neural network is trained, the detected key point coordinates of the hand and the object are used as the input of the recurrent neural network, and the interactive type estimation of the hand and the object in the whole video is output. In the use process stage, the complete model takes a series of video frames as input, and outputs 3D hand posture and object posture predictions of each frame and estimates of objects and motion classes of the whole video frame sequence after passing through two neural networks.
The technical scheme adopted by the invention mainly comprises the following steps (if no special description is provided, the following steps are executed by software and hardware of a computer and electronic equipment):
(1) and (5) building and training a model. When the model is used for the first time, a user needs to train the convolutional neural network and the interactive cyclic neural network firstly, and then can use the trained model to perform motion recognition.
(2) And (6) inputting video. When a segment of RGB video is input, the model can detect the 3D position (namely 3D posture) of the hand, the 6D posture of an object, the motion of the hand, the object type estimation and the interaction motion of the hand and the object in the whole segment of video in real time.
Further, in the detailed design of the model, as shown in fig. 2, 21 key points of the hand and the object are specified respectively (the key points of the hand are four joints and wrist nodes of each finger, and the key points of the object are eight vertexes, a center point and a midpoint of 12 edges of a bounding box of the object), and the postures (i.e. 3D hand posture and 6D object posture) of the hand and the object are determined by predicting the coordinates of the key points.
Further, the method for predicting the coordinates of the key points and predicting the hand motion and object category by the convolutional neural network is as follows:
as shown in fig. 3 and 4, each picture frame is divided into H × W meshes, and the depth is expanded by D meshes (H, W, D represents height, width, and depth, respectively), and the pixels (pixels) are provided on a plane, and meters (meters) are provided in the depth direction, that is, the size of each mesh is Cu×CvPixel by CzAnd (4) rice. In the grid coordinate system, the upper left corner of the grid is taken as the origin of the coordinate system, and one grid is taken as a unit.
In order to jointly predict hand and object poses and categories simultaneously, two vectors are stored in each cell (i.e., grid), as shown in FIG. 4
Figure BDA0002665281180000031
To predict the characteristics of the hand and the object, respectively, wherein
Figure BDA0002665281180000032
Are respectively handsAnd the coordinates of the key points of the object,
Figure BDA0002665281180000033
Ncthe number of key points of the hand or the object,
Figure BDA0002665281180000034
in the form of a probability of an action category,
Figure BDA0002665281180000035
Naas the number of the action categories,
Figure BDA0002665281180000036
is the probability of the object class,
Figure BDA0002665281180000037
Nothe number of object classes (the invention adds a class of background classes, if the object is an unknown object, the object is classified into the background class, and then the zero-time learning classifier is used for identifying the unknown object). The grids where the wrist node and the object center point are located are used for predicting the motion and the object category.
Figure BDA0002665281180000038
For the purpose of the confidence level,
Figure BDA0002665281180000039
the two vectors stored per cell are derived from a convolutional neural network. The invention firstly determines the coordinates (u, v, z) of the cell where the key point is located, and then predicts the deviations delta u, delta v, delta z of the key point relative to the upper left corner of the cell where the key point is located in three dimensions, so as to obtain the coordinates of the key point in a grid coordinate system:
Figure BDA00026652811800000310
Figure BDA00026652811800000311
Figure BDA00026652811800000312
since the cell where the wrist node and the object center point are located is responsible for predicting the motion and the object type, g (x) is used for controlling the offset of the two points to be between [0 and 1], so that the cell responsible for predicting the motion and the object type is determined. g (x) the expression is as follows:
Figure BDA0002665281180000041
wherein g (x) represents a function for restricting the deviation of the wrist node and the center point of the object, x represents the deviation delta u, delta v, delta z of the key point relative to the upper left corner of the cell in which the key point is located in three dimensions, sigmoid represents an activation function, the value range is (0,1), a real number can be mapped to the interval of (0,1), and the function is utilized to enable the wrist node and the center point of the object to be still located in the cell in which the wrist node and the center point of the object are located after being deviated to predict the action and the object category.
In addition, with the three-dimensional position in the grid coordinate system and the camera internal reference K, the three-dimensional coordinates of the key point in the camera coordinate system can be calculated as:
Figure BDA0002665281180000042
further, a higher confidence is set for the mesh where the hand or object is present, setting the confidence function as:
Figure BDA0002665281180000043
wherein D isT(x) Is the Euclidean distance between the predicted point and the real point, alpha represents the hyper parameter, dthIndicating a set threshold value, D, as the predicted value is closer to the true valueT(x) Smaller c (x) indicates greater confidence, and conversely, less confidence. The total confidence is:
Figure BDA0002665281180000044
wherein:
Figure BDA0002665281180000045
further, when the probability of the background class of the object is the maximum, it is determined that the object belongs to an unknown class. Referring to fig. 6, a zero-learning classifier module is used to identify unknown object classes by introducing semantic information. The zero-time learning classifier module multiplies the probabilities of other prediction classes except the background by the vectors of the prediction classes in the semantic space respectively, adds the obtained semantic vectors to be used as the final predicted semantic vector, then calculates the class and the similarity of the class in the semantic space, and considers that the unknown object belongs to the class with the highest similarity when the highest similarity value is not lower than a threshold value.
Further, the overall loss function of the convolutional neural network of the present invention is:
Figure BDA0002665281180000051
wherein λ isposeParameters of a loss function, λ, representing predicted hand and object positionsconfLoss function parameter, λ, representing confidenceactclsParameters of a loss function, λ, representing the class of predicted motionobjclsParameters of a loss function representing the class of the predicted object, GtA regular fixed grid representing a divided picture;
Figure BDA0002665281180000052
which represents the predicted hand coordinates, and,
Figure BDA0002665281180000053
which represents the coordinates of the object that are predicted,
Figure BDA0002665281180000054
representing the confidence of the predicted manual category,
Figure BDA0002665281180000055
a confidence level indicating the class of the object to be predicted,
Figure BDA0002665281180000056
representing the probability of the predicted class of the object,
Figure BDA0002665281180000057
representing the predicted action category probability.
Further, because the convolutional network only learns the information of each frame of image and does not utilize the time sequence information in the video, the invention adds an interactive cyclic neural network part, as shown in fig. 5, and calculates the key point coordinate vectors of the hand and the object by the convolutional network
Figure BDA0002665281180000058
Inputting a multilayer perceptron to model the relationship of the multilayer perceptrons, and taking the relationship as the input of a recurrent neural network, wherein the model of the recurrent neural network is as follows:
Figure BDA0002665281180000059
wherein f isφIs a recurrent neural network model, gθThe method is a multi-layer perceptron model, and finally the interaction category of the hand and the object in the video is output.
Based on the same inventive concept, the invention also provides a device for detecting the interaction between the 3D hand and the unknown object in the RGB video in real time by adopting the method, which comprises the following steps:
a model training module for training a convolutional neural network with the video frames as input, the convolutional neural network predicting 3D hand poses, 6D object poses, hand motions and object classes of each frame of image; taking the 3D hand posture and the 6D object posture detected by the convolutional neural network as input, training an interactive cyclic neural network, and obtaining the interactive category of the hand and the object in the video by the cyclic neural network by utilizing the time sequence information in the video;
and the real-time detection module is used for inputting the video to be detected into the trained convolutional neural network and interactive cyclic neural network to obtain the 3D hand posture, the 6D object posture, the hand action, the object type of each frame of image in the video and the interactive action of the hands and the object in the video.
The method for identifying the interaction between the 3D hand and the object in the RGB video greatly improves the practicability and specifically comprises the following steps:
(1) the method does not need to rely on the depth image shot by the RGB-D camera, and can detect the hand-object interaction in the RGB video by inputting a series of frames, so that the applicable range in life is greatly increased.
(2) The method can simultaneously detect the position tracks of the hands and the objects, the motion types and the object type estimation at real-time speed, and can be applied to abnormal behavior detection.
(3) The method can detect the unknown object types which are not in the training set, greatly improves the identification range, improves the generalization and is more convenient to be applied to life.
Drawings
FIG. 1 is a flow chart diagram of a method for recognizing 3D hand-object interaction based on RGB video; wherein I1~INAnd N video frames are represented, CNN is a convolutional neural network, and RNN is a cyclic neural network.
FIG. 2 is a schematic diagram of a hand and object keypoints; wherein (a) the diagram shows key points of 21 hands, (b) the diagram shows key points of 21 objects, P, R, M, I, T in the diagram shows 5 fingers, TIP shows a fingertip, DIP shows a far finger joint, PIP shows a near finger joint, MCP shows a palm joint, and Wrist shows a Wrist.
FIG. 3 is a schematic diagram of a grid coordinate system of an input image;
FIG. 4 is a vector diagram of the hand and object positions and their cell storage in a grid coordinate system;
FIG. 5 is a schematic diagram of an interaction cycle network in a modelWherein x is1~xNRepresenting the input to the interactive recurrent neural network.
FIG. 6 is a schematic diagram of a zero-learning classifier module in a model.
Detailed Description
The method of the invention is further described below with reference to the figures and specific examples.
The hand motion recognition method disclosed by the invention does not need to rely on an external detection algorithm, and only needs to carry out end-to-end training on a single image. Inputting a single RGB image, carrying out feedforward transmission once through a neural network, estimating the 3D hand and object postures together, modeling the interaction of the three and identifying the object and action categories, and when the object category is identified as a background category, calculating and searching the closest category in a semantic space through a zero-time learning classifier module to predict the unknown category of the object. The hand and object pose information is then further merged and propagated in the time domain to infer the interaction between the hand and object trajectories and to identify motion. The method takes a series of frames as input, and can output the 3D hand and object posture prediction of each frame and the estimation of the object and action category of the whole sequence.
Fig. 1 is a schematic flow chart of a method for recognizing interaction between a 3D hand and an object based on RGB video, the method mainly includes the following steps:
(1) and (5) training a model. The model training is divided into two parts, namely, the convolutional neural network is trained firstly, and then the training interactive cyclic neural network is fixed. The convolutional neural network is based on the architecture of YOLO, the total number of the network is 31 layers, except the last layer which is a predictor, the other layers are convolutional layers or pooling layers, and the convolutional neural network obtains H W D2 (3 xN) through the last predictorc+1+Na+No) Corresponding to the vectors of the two hands and the object contained in each grid in the grid. In the method of this embodiment, H ═ W ═ 13, and D ═ 5. The input picture size of this embodiment is 416 x 416. After the convolutional network is trained, the key point vectors of hands and objects obtained by each frame of image through the convolutional network are subjected to learning of the interaction relationship through a multi-layer perceptron with one hidden layer, and then the key point vectors are input into a recurrent neural network with two hidden layersAnd finally, outputting the interaction category estimation. The data set trained in the embodiment is a First-Person Hand Action (FPHA) data set, which is a publicly available 3D Hand-object interaction recognition data set, and includes labels of3D Hand gestures, 6D object gestures, and motion categories. The FPHA contains video belonging to 45 different activity categories of 6 actors, and the subject performs complex actions corresponding to daily human activities. A subset of the dataset contains annotations of the object's 6D pose, and corresponding mesh models of 4 objects involving 10 different action classes. The training set is divided into two parts according to the object class interacted with the hand, namely the training set and the test set, wherein the test set comprises the object class (unknown class) which does not appear in the training set.
(2) And (5) a detection stage. By inputting a series of video frames into the model, the 3D poses of the hand and the object for each frame of image and the interaction category of the hand and the object in the whole sequence can be estimated. When the object is predicted to be in the background class, the unknown class of the object is predicted by learning the classifier for zero times.
Fig. 2 is a schematic diagram of key points of a hand and an object, and 21 key points are taken for unified calculation. The key points of the hand are the four joints of each finger, and the wrist joint. The key points of an object take the eight vertices, the center point, and the midpoints of the 12 edges of its bounding box. The grids in which the wrist nodes and the object center points are located are used for predicting the types of the objects and the actions.
Fig. 3 is a schematic diagram of a grid coordinate system of an input image, where the upper left corner of the grid is set as the origin of coordinates, each grid is a unit, and the grid coordinates are the number of grids shifted from the upper left corner.
FIG. 4 is a vector diagram of the positions of the hand and the object and the cell where the hand and the object are located in the grid coordinate system, and whether the cell has the object or not is determined by whether the irrelevant key point falls into the cell or not.
FIG. 5 is a schematic diagram of an interactive circulation network in a model, each frame of image is firstly passed through a convolution network, and key point vectors of hands and objects obtained by taking the key point vectors
Figure BDA0002665281180000071
The relation of the two is modeled by a multilayer perceptron, the obtained vector is used for learning the time sequence information in the video by a recurrent neural network of two hidden layers, and finally, the interactive category estimation is output.
FIG. 6 is a schematic diagram of a zero-learning classifier module in a model, which determines that the object belongs to an unknown class when the probability of the background class is the highest. And multiplying the probabilities of other prediction classes except the background with the vectors of the prediction classes in the semantic space respectively, adding the obtained semantic vectors to be used as the final predicted semantic vector, then calculating the class and the similarity of the class in the semantic space, and when the highest value of the similarity is not lower than a threshold value, considering that the unknown object belongs to the class with the highest similarity.
Even in a complex real scene, the method can effectively identify the track, the category and the interactive action of the hand and an unknown object in real time from the RGB video, and obtains the semantic information and the time sequence information of the capture sequence in the video, thereby greatly improving the action identification efficiency, solving the problem that the semantic information interacted with the object cannot be identified in the traditional gesture identification, and identifying the unknown object interacted with the hand without inputting a depth image or the coordinate data of a real object, and providing a good theoretical basis for the wide application of the method.
The convolutional neural network, the zero-learning classifier, and the recurrent neural network of the above embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the modules in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the modules, the scope of the present disclosure should be considered.
Based on the same inventive concept, another embodiment of the present invention provides an apparatus for detecting interaction between a 3D hand and an unknown object in RGB video in real time by using the above method, including:
a model training module for training a convolutional neural network with the video frames as input, the convolutional neural network predicting 3D hand poses, 6D object poses, hand motions and object classes of each frame of image; taking the 3D hand posture and the 6D object posture detected by the convolutional neural network as input, training an interactive cyclic neural network, and obtaining the interactive category of the hand and the object in the video by the cyclic neural network by utilizing the time sequence information in the video;
and the real-time detection module is used for inputting the video to be detected into the trained convolutional neural network and interactive cyclic neural network to obtain the 3D hand posture, the 6D object posture, the hand action, the object type of each frame of image in the video and the interactive action of the hands and the object in the video.
Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor and a processor, the computer program comprising instructions for performing the steps of the inventive method.
Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, performs the steps of the inventive method.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (10)

1. A method for detecting interaction between a 3D hand and an unknown object in an RGB video in real time is characterized by comprising the following steps:
training a convolutional neural network with a video frame as input, the convolutional neural network predicting 3D hand gestures, 6D object poses, hand motions and object classes of each frame of image;
taking the 3D hand posture and the 6D object posture detected by the convolutional neural network as input, training an interactive cyclic neural network, and obtaining the interactive category of the hand and the object in the video by the cyclic neural network by utilizing time sequence information in the video;
and inputting the video to be detected into the trained convolutional neural network and interactive cyclic neural network to obtain the 3D hand posture, the 6D object posture, the hand action, the object type of each frame of image in the video and the interactive action of the hands and the object in the video.
2. The method of claim 1, wherein 21 keypoints of the hand and the object are specified, and the convolutional neural network determines the 3D hand pose and the 6D object pose by predicting the coordinates of the keypoints, wherein the keypoints of the hand are the four joints and wrist nodes of each finger, and the keypoints of the object are the eight vertices, the center point, and the midpoint of the 12 edges of the bounding box of the object.
3. The method of claim 2, wherein the convolutional neural network predicts coordinates of keypoints and predicts hand motion and object class using the following steps:
dividing each picture frame into H multiplied by W grids, expanding D grids to the depth, taking pixels as a unit on a plane and meters as a unit in the depth direction, namely, the size of each grid is Cu×CvPixel by CzRice, taking the upper left corner of the grid as the origin of the coordinate system in the grid coordinate system, and taking one grid as a unit;
storing two vectors in each cell
Figure FDA0002665281170000011
To predict the characteristics of the hand and the object, respectively, wherein
Figure FDA0002665281170000012
Figure FDA0002665281170000013
Respectively the coordinates of the hand and object key points,
Figure FDA0002665281170000014
Ncthe number of key points of the hand or the object,
Figure FDA0002665281170000015
in the form of a probability of an action category,
Figure FDA0002665281170000016
Naas the number of the action categories,
Figure FDA0002665281170000017
is the probability of the object class,
Figure FDA0002665281170000018
Nothe number of object categories; the grids where the wrist nodes and the object center points are located are used for predicting actions and the types of the objects;
Figure FDA0002665281170000019
for the purpose of the confidence level,
Figure FDA00026652811700000110
the two vectors stored by each cell are obtained by a convolutional neural network;
firstly, determining coordinates (u, v, z) of a cell where the key point is located, and then predicting deviations delta u, delta v, delta z of the key point relative to the upper left corner of the cell where the key point is located in three dimensions, so that the coordinates of the key point in a grid coordinate system can be obtained:
Figure FDA00026652811700000111
Figure FDA00026652811700000112
Figure FDA00026652811700000113
the cell where the wrist node and the object center point are located is responsible for predicting the motion and the object type, so that g (x) is used for controlling the offset of the two points to be between [0 and 1], and the cell responsible for predicting the motion and the object type is determined; g (x) the expression is as follows:
Figure FDA0002665281170000021
wherein g (x) represents a function for restricting the deviation of the wrist node and the center point of the object, x represents the deviation delta u, delta v, delta z of the key point relative to the upper left corner of the cell in three dimensions, sigmoid represents an activation function, the value range is (0,1), and a real number can be mapped to the interval of (0, 1).
4. The method of claim 3, wherein a higher confidence is set for the mesh of the hand or object presence, setting the confidence function as:
Figure FDA0002665281170000022
wherein D isT(x) Is the Euclidean distance between the predicted point and the real point, alpha represents the hyper parameter, dthIndicating a set threshold; the total confidence is:
Figure FDA0002665281170000023
wherein:
Figure FDA0002665281170000024
5. the method according to claim 3, characterized in that when the probability of the background class of the object is the largest, the object is determined to belong to an unknown class, and the unknown class of the object is identified by introducing semantic information by using a zero-order learning classifier; the zero-time learning classifier multiplies the probabilities of other prediction classes except the background by the vectors of the prediction classes in the semantic space respectively, adds the obtained semantic vectors to be used as the final predicted semantic vector, then calculates the class and the similarity of the class in the semantic space, and when the highest value of the similarity is not lower than a threshold value, the unknown object is considered to belong to the class with the highest similarity.
6. The method of claim 3, wherein the overall loss function of the convolutional neural network is:
Figure FDA0002665281170000025
Figure FDA0002665281170000026
Figure FDA0002665281170000027
Figure FDA0002665281170000028
wherein λ isposeParameters of a loss function, λ, representing predicted hand and object positionsconfLoss function parameter, λ, representing confidenceactclsParameters of a loss function, λ, representing the class of predicted motionobjclsParameters of a loss function representing the class of the predicted object, GtA regular fixed grid representing a divided picture;
Figure FDA0002665281170000031
which represents the predicted hand coordinates, and,
Figure FDA0002665281170000032
which represents the coordinates of the object that are predicted,
Figure FDA0002665281170000033
representing the confidence of the predicted manual category,
Figure FDA0002665281170000034
a confidence level indicating the class of the object to be predicted,
Figure FDA0002665281170000035
representing the probability of the predicted class of the object,
Figure FDA0002665281170000036
representing the predicted action category probability.
7. The method according to claim 3, wherein the interactive recurrent neural network takes the key point coordinate vectors of the hand and the object obtained by the convolutional neural network as input, the interactive relationship is modeled by a multi-layer perceptron as the input of the recurrent neural network, and finally the interactive category of the hand and the object in the video is output.
8. An apparatus for detecting 3D hand interaction with unknown objects in RGB video in real time by using the method of any one of claims 1 to 7, comprising:
a model training module for training a convolutional neural network with the video frames as input, the convolutional neural network predicting 3D hand poses, 6D object poses, hand motions and object classes of each frame of image; taking the 3D hand posture and the 6D object posture detected by the convolutional neural network as input, training an interactive cyclic neural network, and obtaining the interactive category of the hand and the object in the video by the cyclic neural network by utilizing the time sequence information in the video;
and the real-time detection module is used for inputting the video to be detected into the trained convolutional neural network and interactive cyclic neural network to obtain the 3D hand posture, the 6D object posture, the hand action, the object type of each frame of image in the video and the interactive action of the hands and the object in the video.
9. An electronic apparatus, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1 to 7.
CN202010916742.4A 2020-09-03 2020-09-03 Method and device for detecting interaction of3D hand and unknown object in RGB video in real time Active CN112199994B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010916742.4A CN112199994B (en) 2020-09-03 2020-09-03 Method and device for detecting interaction of3D hand and unknown object in RGB video in real time

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010916742.4A CN112199994B (en) 2020-09-03 2020-09-03 Method and device for detecting interaction of3D hand and unknown object in RGB video in real time

Publications (2)

Publication Number Publication Date
CN112199994A true CN112199994A (en) 2021-01-08
CN112199994B CN112199994B (en) 2023-05-12

Family

ID=74005883

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010916742.4A Active CN112199994B (en) 2020-09-03 2020-09-03 Method and device for detecting interaction of3D hand and unknown object in RGB video in real time

Country Status (1)

Country Link
CN (1) CN112199994B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112720504A (en) * 2021-01-20 2021-04-30 清华大学 Method and device for controlling learning of hand and object interactive motion from RGBD video
CN112949501A (en) * 2021-03-03 2021-06-11 安徽省科亿信息科技有限公司 Method for learning object availability from teaching video

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107168527A (en) * 2017-04-25 2017-09-15 华南理工大学 The first visual angle gesture identification and exchange method based on region convolutional neural networks
CN107590432A (en) * 2017-07-27 2018-01-16 北京联合大学 A kind of gesture identification method based on circulating three-dimensional convolutional neural networks
US20190107894A1 (en) * 2017-10-07 2019-04-11 Tata Consultancy Services Limited System and method for deep learning based hand gesture recognition in first person view
CN109919078A (en) * 2019-03-05 2019-06-21 腾讯科技(深圳)有限公司 A kind of method, the method and device of model training of video sequence selection
CN111104820A (en) * 2018-10-25 2020-05-05 中车株洲电力机车研究所有限公司 Gesture recognition method based on deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107168527A (en) * 2017-04-25 2017-09-15 华南理工大学 The first visual angle gesture identification and exchange method based on region convolutional neural networks
CN107590432A (en) * 2017-07-27 2018-01-16 北京联合大学 A kind of gesture identification method based on circulating three-dimensional convolutional neural networks
US20190107894A1 (en) * 2017-10-07 2019-04-11 Tata Consultancy Services Limited System and method for deep learning based hand gesture recognition in first person view
CN111104820A (en) * 2018-10-25 2020-05-05 中车株洲电力机车研究所有限公司 Gesture recognition method based on deep learning
CN109919078A (en) * 2019-03-05 2019-06-21 腾讯科技(深圳)有限公司 A kind of method, the method and device of model training of video sequence selection

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YANKUN WU 等: "A Method for Detecting Interaction between 3D Hands and Unknown Objects in RGB Video", 《2021 2ND INTERNATIONAL WORKSHOP ON ELECTRONIC COMMUNICATION AND ARTIFICIAL INTELLIGENCE (IWECAI 2021)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112720504A (en) * 2021-01-20 2021-04-30 清华大学 Method and device for controlling learning of hand and object interactive motion from RGBD video
CN112949501A (en) * 2021-03-03 2021-06-11 安徽省科亿信息科技有限公司 Method for learning object availability from teaching video
CN112949501B (en) * 2021-03-03 2023-12-08 安徽省科亿信息科技有限公司 Method for learning availability of object from teaching video

Also Published As

Publication number Publication date
CN112199994B (en) 2023-05-12

Similar Documents

Publication Publication Date Title
Doosti et al. Hope-net: A graph-based model for hand-object pose estimation
Kwon et al. H2o: Two hands manipulating objects for first person interaction recognition
Wang et al. Atloc: Attention guided camera localization
CN110147743B (en) Real-time online pedestrian analysis and counting system and method under complex scene
Lao et al. Automatic video-based human motion analyzer for consumer surveillance system
Gao et al. Dynamic hand gesture recognition based on 3D hand pose estimation for human–robot interaction
Han et al. Enhanced computer vision with microsoft kinect sensor: A review
Wang et al. Hmor: Hierarchical multi-person ordinal relations for monocular multi-person 3d pose estimation
Elgammal et al. Tracking people on a torus
Wang et al. Predicting camera viewpoint improves cross-dataset generalization for 3d human pose estimation
Deng et al. MVF-Net: A multi-view fusion network for event-based object classification
CN108171133B (en) Dynamic gesture recognition method based on characteristic covariance matrix
CN110555481A (en) Portrait style identification method and device and computer readable storage medium
WO2021098802A1 (en) Object detection device, method, and systerm
Li et al. Hmor: Hierarchical multi-person ordinal relations for monocular multi-person 3d pose estimation
Zhang et al. Handsense: smart multimodal hand gesture recognition based on deep neural networks
CN112199994B (en) Method and device for detecting interaction of3D hand and unknown object in RGB video in real time
Cao et al. Real-time gesture recognition based on feature recalibration network with multi-scale information
Wu et al. Context-aware deep spatiotemporal network for hand pose estimation from depth images
Kourbane et al. A graph-based approach for absolute 3D hand pose estimation using a single RGB image
Le et al. A survey on 3D hand skeleton and pose estimation by convolutional neural network
Raman et al. Emotion and Gesture detection
Nie et al. A child caring robot for the dangerous behavior detection based on the object recognition and human action recognition
US20220180548A1 (en) Method and apparatus with object pose estimation
Lu et al. Dynamic hand gesture recognition using HMM-BPNN model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant